Parallel programming: accelerator track 4

  • John Urbanic, PRACE, XSEDE, RIKEN, Compute Canada
  • June 2016

Slide contents

  • Using OpenACC With CUDA Libraries
  • 3 Ways to Accelerate Applications
  • 3 Ways to Accelerate Applications
  • CUDA Libraries
  • NVIDIA cuBLAS
  • CUDA Math Libraries
  • How To Use CUDA Libraries With OpenACC
  • Sharing data with libraries
  • deviceptr Data Clause
  • host_data Construct
  • Example: 1D convolution using CUFFT
  • Source Excerpt
  • Example: 1D convolution using CUFFT
  • host_data Construct
  • deviceptr Data Clause
  • Sharing data with libraries
  • Example: 1D convolution using CUFFT
  • Source Excerpt
  • Source Excerpt
  • OpenACC Convolution Code
  • Source Excerpt
  • OpenACC Convolution Code
  • Linking CUFFT
  • Result
  • Summary
  • Appendix
  • cuFFT: Multi-dimensional FFTs
  • FFTs up to 10x Faster than MKL
  • CUDA 4.1 optimizes 3D transforms
  • cuBLAS: Dense Linear Algebra on GPUs
  • cuBLAS Level 3 Performance
  • ZGEMM Performance vs Intel MKL
  • cuBLAS Batched GEMM API improves performance on batches of small matrices
  • cuSPARSE: Sparse linear algebra routines
  • OpenMP 4.0 (now 4.5) for Accelerators
  • OpenACC vs. OpenMP
  • OpenMP Thread Control Philosophy
  • Intel’s MIC Approach
  • What is MIC?
  • MIC Architecture
  • OpenMP 4.0 Data Migration
  • SAXPY in OpenMP 4.0 on NVIDIA
  • Comparing OpenACC with OpenMP 4.0 on NVIDIA & Phi
  • OpenMP 4.0 Across Architectures
  • Which way to go?
  • So, at this time…
  • Going Hostless
  • Some things we did not mention
  • Hybrid Programming
  • Assuming you know basic MPI
  • Hybrid OpenACC Programming (Fast & Wrong)