International HPC Summer School 2016
Performance analysis and optimization (3parts)
Phil Blood, Christian Feld, PRACE, XSEDE, RIKEN, Compute Canada
June 2016
Slide contents
Performance Engineering of Parallel Applications
Acknowledgment
Outline for Performance Sessions
Fitting algorithms to hardware…and vice versa
Code Development and Optimization Process
Performance engineering workflow
A little background...
Hardware Counters
Features of PAPI
Measurement Techniques
Inclusive and Exclusive Profiles
Applying Performance Tools to Improve Parallel Performance of the UNRES MD code
Structure of UNRES
Performance Engineering: Procedure
Is There a Performance Problem?
Detecting Performance Problems
Use a Sampling Tool for Initial Performance Check
UNRES: Serial Performance
UNRES: Parallel Performance
Performance Engineering: Procedure
Which Functions are Important?
Contributions of Functions
UNRES Function Summary
Performance Engineering: Procedure
Choose a tool: there are many!
TAU: Tuning and Analysis Utilities
General Instructions for TAU
Using TAU with Makefiles
Tiny Routines: High Overhead
Reducing Overhead
Selective Instrumentation File
Selective Instrumentation File
Getting a Call Path with TAU
Getting Call Path Information
Isolate regions of code execution
Key UNRES Functions in TAU (with Startup Time)
Key UNRES Functions (MD Time Only)
Performance Engineering: Procedure
Detecting Serial Performance Issues
Create a Derived Metric in Paraprof Manager
Perf of EELEC (peak is 2)
Performance Engineering: Procedure
Do compiler optimization first! EELEC – After forcing inlining with compiler
Further Info on Serial Optimization
Performance Engineering: Procedure
TAU Recipe #1: Detecting Serial Bottlenecks
Serial Bottleneck Detection in UNRES: Function Scaling
TAU Recipe #2: Detecting Parallel Load Imbalance
Load Imbalance Detection in UNRES
Major Serial Bottleneck and Load Imbalance in UNRES Eliminated
Next Iteration of Performance Engineering with Optimized Code
Use Call Path Information: MPI Calls
Performance Engineering: Procedure
Some Take-Home Points
International HPC Summer School 2016: Performance analysis and optimization Hands-on:
Access to Bridges
Compiling & job submission
Local installation
NPB-MZ-MPI suite
Building an NPB-MZ-MPI benchmark
System topology
Building an NPB-MZ-MPI benchmark
NPB-MZ-MPI / BT (Block Tridiagonal Solver)
Building an NPB-MZ-MPI benchmark
NPB-MZ-MPI / BT reference execution
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir
Performance engineering workflow
Fragmentation of tools landscape
Scalasca TAU VAMPIR Paraver
Score-P project idea
Score-P overview
Hands-on: NPB-MZ-MPI / BT