TY - GEN
T1 - Profiling directed NUMA optimization on Linux systems
T2 - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
AU - Yang, Rui
AU - Antony, Joseph
AU - Rendell, Alistair
AU - Robson, Danny
AU - Strazdins, Peter
PY - 2011/10/3
Y1 - 2011/10/3
N2 - The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the NUMAgrind profiling tool which can be used to simplify this process. It extends the Val grind binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. Using NUMAgrind, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the NUMAgrind tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.
AB - The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the NUMAgrind profiling tool which can be used to simplify this process. It extends the Val grind binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. Using NUMAgrind, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the NUMAgrind tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.
KW - G09
KW - Gaussian 09
KW - NUMA
KW - NUMAgrind
KW - OpenMP
KW - thread and memory placement
KW - Valgrind
UR - http://www.scopus.com/inward/record.url?scp=80053236222&partnerID=8YFLogxK
UR - http://purl.org/au-research/grants/ARC/LP0347178
UR - http://purl.org/au-research/grants/ARC/LP0774896
U2 - 10.1109/IPDPS.2011.100
DO - 10.1109/IPDPS.2011.100
M3 - Conference contribution
AN - SCOPUS:80053236222
SN - 9780769543857
T3 - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
SP - 1046
EP - 1057
BT - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Y2 - 16 May 2011 through 20 May 2011
ER -