International HPC Summer School 2016
Parallel programming: accelerator track 3
John Urbanic, PRACE, XSEDE, RIKEN, Compute Canada
June 2016
Slide contents
Advanced OpenACC
Outline
Outline
Targeting the Architecture (But Not Admitting It)
OpenACC Task Granularity
Targeting the Architecture
NVIDIA GPU Task Granularity (Take Notes!)
Warps – on Kepler (Still taking notes?)
Determining block size – on Kepler (You can stop now)
Determining grid size – on Kepler
Mapping OpenACC to CUDA Threads and Blocks
SAXPY Returns For Some Fine Tuning
Rapid Evolution
Parallel Regions vs. Kernels
Parallel Construct
Parallel Clauses
Parallel Regions
Compare and Contrast
Compare and Contrast
Parallel Regions vs. Kernels
A parallel region will work differently
Parallel Regions vs. Kernels (Which is best?)
OpenACC 2.0 & 2.5
OpenACC 2.0 & 2.5
Procedure Calls
Nested Parallelism
Nested Parallelism
Device Specific Tuning
Multiple Devices and Multiple Threads
Asynchronous Behavior
Data Management
Profiling
Mandlebrot Code
Step 1 Profile
Pipelining with 32 blocks
Optimized In A Few Well-Informed Stages
OpenACC Things Not Covered