Contact

Position:
IRAM, Granada, Spain and ECMWF, Reading, UK
Address
United Kingdom

Miscellaneous Information

Miscellaneous Information

Abstract Reference: 30818
Identifier: O12.2
Presentation: Oral communication
Key Theme: 3 New Trends in HPC and Distributed Computing

Massive Scientific Workloads: Lessons Learned From Petaflop-Scale Weather Simulations

Authors:
Pierfederici Francesco

Weather forecasts run at the European Centre for Medium-Range Weather Forecasts (ECMWF) are complex workloads which use tens of thousands of CPU cores from two of the most powerful supercomputers in the world (top twenty of the top 500 list). They run for potentially weeks on end and process hundreds of millions of observation datasets.

Each of these forecast simulations is a heterogeneous mix of hybrid MPI-OpenMP Fortran/C/C++ numerical code surrounded by a host of Python and Shell scripts staging data in and out of databases, creating high-level products, performing sanity check on inputs and outputs etc. When running on a HPC cluster, they each spawn tens of thousands of jobs in a very deep dependency graph.

Monitoring, profiling, debugging these complex workloads and their dependency rules is a herculean task, made more difficult by the fact that the tools one can use to analyse compiled executables (e.g. darshan and Allinea MAP) lose much of their power or are completely unusable when dealing with scripts. Important issues of machine over-subscription and CPU power management are also left un-tackled.

Tools and techniques developed at ECMWF to approach whole-workload profiling of weather simulations will be presented. Their applicability to present and future astronomy processing needs will be investigated as well.