Automatic and transparent resource contention mitigation for improving large-scale parallel file system performance

Proportional to the scale increases in HPC systems, many scientific applications are becoming increasingly data intensive, and parallel I/O has become one of the dominant factors impacting the large-scale HPC application performance. On a typical large-scale HPC system, we have observed that the lac...

Full description

Saved in:
Bibliographic Details
Main Authors: Neuwirth, Sarah (Author) , Brüning, Ulrich (Author)
Format: Chapter/Article Conference Paper
Language:English
Published: 31 May 2018
In: 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS 2017)
Year: 2018, Pages: 604-613
DOI:10.1109/ICPADS.2017.00084
Online Access:Verlag, Volltext: http://dx.doi.org/10.1109/ICPADS.2017.00084
Verlag, Volltext: https://ieeexplore.ieee.org/document/8368413/
Get full text
Author Notes:Sarah Neuwirth, Feiyi Wang, Sarp Oral and Ulrich Bruening
Description
Summary:Proportional to the scale increases in HPC systems, many scientific applications are becoming increasingly data intensive, and parallel I/O has become one of the dominant factors impacting the large-scale HPC application performance. On a typical large-scale HPC system, we have observed that the lack of a global workload coordination coupled with the shared nature of storage systems cause load imbalance and resource contention over the end-to-end I/O paths resulting in severe performance degradation. I/O load imbalance on HPC systems is generally a self-inflicted wound and mostly occurs between the I/O paths and resources consumed by each individual job.
Physical Description:Online Resource
ISBN:9781538621295
DOI:10.1109/ICPADS.2017.00084