1-3 December 2021
Africa/Johannesburg timezone
Conference Videos Available

ANACIN-X: A Software Framework for Studying Non-Determinism in HPC Applications

Not scheduled
20m
Professional Micro-Talk HPC Techniques and Computer Science Micro-talks

Speakers

Mr Patrick Bell (The University of Tennessee Knoxville)Ms Kae Suarez (The University of Tennessee Knoxville)

Description

This content along with corresponding figures can be found in the uploaded PDF version of this abstract.

HPC applications that use message passing programmer interfaces like MPI rely on asynchronous communication to achieve scalable performance. Asynchronous communication often produces non-deterministic software executions. Non-determinism in MPI applications can negatively impact the correctness of HPC and scientific simulations, and non-deterministic bugs can be very costly to debug. One study [3] reported more than 10,000 hours of compute time spent manually debugging a non-deterministic bug located in the HYPRE 2.10.1 linear algebra software package [4].

Non-deterministic bugs like those in the referenced version of HYPRE produce inconsistent application execution patterns, thus making it difficult to perform reproducible science and to reason about and identify the source of said bugs. For this reason, the problem of developing tools for automatically identifying the root sources of non-determinism (i.e., function call-stacks that produce non-deterministic executions) has become an important challenge to solve.

To address this challenge, we present ANACIN-X, an open-source software framework for identifying regions of code in MPI applications that exhibit high amounts of non-determinism and for reporting the call-stacks that contribute to high amounts of said non-determinism. Specifically, we target MPI applications that implement point-to-point communication (i.e., communication that does not use collective MPI function calls). This is done by quantifying the non-determinism in MPI applications as graph kernel distances. More details for this work can also be found in previous work of the authors [1, 2].

In Figure 1 we show that graph kernel distances can effectively represent the amount of non-determinism present in an application's execution. Figure 2 illustrates that ANACIN-X can use these kernel distances to identify the root sources of non-determinism within an application. For both simulations, the simulation was set to use 16 MPI processes on 1 POWER9 compute node with non-determinism varying from 0-100% in increments of 10%, and the test was repeated over 100 executions. For demonstration purposes, ANACIN-X comes packaged with three benchmark applications that model communication patterns common in HPC codes (i.e., message race, algebraic multigrid (AMG) 2013, and unstructured mesh).

  1. D. Chapp, N. Tan, S. Bhowmick and M. Taufer, "Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph Kernels," in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 12, pp. 2936-2952, 1 Dec. 2021, doi: 10.1109/TPDS.2021.3081530.
  2. Patrick Bell, Kae Suarez, Dylan Chapp, Nigel Tan, Sanjukta Bhowmick, and Michela Taufer. ANACIN-X: A Software Framework for Studying Non-determinism in MPI Applications. Software Impacts, 10:100151, 2021.
  3. Kento Sato, Dong H Ahn, Ignacio Laguna, Gregory L Lee, Martin Schulz, and Christopher M Chambreau. Noise Injection Techniques to Expose Subtle and Unintended Message Races. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 89–101, 2017.
  4. Robert D Falgout and Ulrike Meier Yang. HYPRE: A Library of High Performance Preconditioners. In Proceedings of the International Conference on Computational Science, pages 632–641. Springer, 2002.

Primary authors

Mr Patrick Bell (The University of Tennessee Knoxville) Ms Kae Suarez (The University of Tennessee Knoxville) Mr Dylan Chapp (The University of Tennessee Knoxville) Mr Nigel Tan (The University of Tennessee Knoxville) Mrs Sanjukta Bhowmick (The University of North Texas) Mrs Michela Taufer (The University of Tennessee Knoxville)

Presentation Materials

There are no materials yet.