Parallel programming: accelerator track 3

  • John Urbanic, PRACE, XSEDE, RIKEN, Compute Canada
  • June 2016

Slide contents

  • Advanced OpenACC
  • Outline
  • Outline
  • Targeting the Architecture (But Not Admitting It)
  • OpenACC Task Granularity
  • Targeting the Architecture
  • NVIDIA GPU Task Granularity (Take Notes!)
  • Warps – on Kepler (Still taking notes?)
  • Determining block size – on Kepler (You can stop now)
  • Determining grid size – on Kepler
  • Mapping OpenACC to CUDA Threads and Blocks
  • SAXPY Returns For Some Fine Tuning
  • Rapid Evolution
  • Parallel Regions vs. Kernels
  • Parallel Construct
  • Parallel Clauses
  • Parallel Regions
  • Compare and Contrast
  • Compare and Contrast
  • Parallel Regions vs. Kernels
  • A parallel region will work differently
  • Parallel Regions vs. Kernels (Which is best?)
  • OpenACC 2.0 & 2.5
  • OpenACC 2.0 & 2.5
  • Procedure Calls
  • Nested Parallelism
  • Nested Parallelism
  • Device Specific Tuning
  • Multiple Devices and Multiple Threads
  • Asynchronous Behavior
  • Data Management
  • Profiling
  • Mandlebrot Code
  • Step 1 Profile
  • Pipelining with 32 blocks
  • Optimized In A Few Well-Informed Stages
  • OpenACC Things Not Covered