International HPC Summer School 2016

Parallel programming: accelerator track 4

John Urbanic, PRACE, XSEDE, RIKEN, Compute Canada
June 2016

Slide contents

Using OpenACC With CUDA Libraries
3 Ways to Accelerate Applications
3 Ways to Accelerate Applications
CUDA Libraries
NVIDIA cuBLAS
CUDA Math Libraries
How To Use CUDA Libraries With OpenACC
Sharing data with libraries
deviceptr Data Clause
host_data Construct
Example: 1D convolution using CUFFT
Source Excerpt
Example: 1D convolution using CUFFT
host_data Construct
deviceptr Data Clause
Sharing data with libraries
Example: 1D convolution using CUFFT
Source Excerpt
Source Excerpt
OpenACC Convolution Code
Source Excerpt
OpenACC Convolution Code
Linking CUFFT
Result
Summary
Appendix
cuFFT: Multi-dimensional FFTs
FFTs up to 10x Faster than MKL
CUDA 4.1 optimizes 3D transforms
cuBLAS: Dense Linear Algebra on GPUs
cuBLAS Level 3 Performance
ZGEMM Performance vs Intel MKL
cuBLAS Batched GEMM API improves performance on batches of small matrices
cuSPARSE: Sparse linear algebra routines
OpenMP 4.0 (now 4.5) for Accelerators
OpenACC vs. OpenMP
OpenMP Thread Control Philosophy
Intel’s MIC Approach
What is MIC?
MIC Architecture
OpenMP 4.0 Data Migration
SAXPY in OpenMP 4.0 on NVIDIA
Comparing OpenACC with OpenMP 4.0 on NVIDIA & Phi
OpenMP 4.0 Across Architectures
Which way to go?
So, at this time…
Going Hostless
Some things we did not mention
Hybrid Programming
Assuming you know basic MPI
Hybrid OpenACC Programming (Fast & Wrong)