Inroducing new course on Heterogenous Computing

We are back with a new course on Heterogeneous Computing.

This time we are teaching both CUDA (4.0 & 4.1) and OpenCL.

Thus preparing you to tackle problems in heterogeneous computing in general, not just in GPU Computing., such as  in application using a DSP, an FPGA, a GPU,  and a Multicore   CPU all in a single system.

Topics in OpenCL

OpenCL Training Syllabus for 5-Day Training (25 Hours)
The training consists of parallel theory and lab sessions.

Module One: An overview of Heterogeneous Computing using OpenCL
1-What is Parallel Computing?
– Necessity or Luxury?
– Opportunities & Challenges
– Alternatives to Parallel Computing
2-Fundamentals of Heterogeneous/GPU Computing
– ‘ Dark Silicon’ and Heterogeneous computing
– Why use GPUs
– Why GPUs are fast?
– Basic components of a graphics Card
– Basic differences between GPUs and CPUs
– The APU (Accelerated Processing Unit)
– The processor of 2020
3-What is OpenCL ?
– Introduction to the language/API
– Getting started with APP SDK 2.4, installation, configuration etc
– Sample OpenCl programs walk through
– Hands on session

Module Two: Architecture of some recent CPUs and GPUs
– Intel Dual Core Processors
– Nvidia Fermi
– AMD Fusion
– IBM Cell Broadband Engine

Module Three: Introduction to parallel programming using OpenCL
1-Introductory concepts
– Algorithms
– Task and data decomposition
– Load Balancing
– Software models
– Hardware architectures
– CPU-GPU Communication (PCI-Express Vs PCI)
2- Getting started with the OpenCL program (Lab)
· The software Development Environment and Tools
– Requirements
– Installing on Windows
– Installing on Linux
· The first program: Hello World!
· Compilation (on Linux and Windows)

Module Four: OpenCL Architecture
* An over view to other programming models: OpenMP, CUDA
OpenCL Architecture
* Platform Model
* Execution Model
* Memory Model (Local Memory, Private memory, Constant etc )
* Programming Model
* Threading and Scheduling (Work Items, Workgroups and Wave-front )
* Dealing with Buffer and Images
* How OpenCL model provides Transparent Scalability
* Hands on Session: A simple vector addition

Module Five
1- OpenCL programming in detail (Lab Exercises)
· Image Convolution
· Matrix Multiplication (Single and Multiple Block)
· Parallel Reduction
2-Discussion on Advantages of using Local memory
3- OpenCL Implementation of Task Parallel and Data Parallel Algorithms
4- Overlapping GPU and CPU tasks
5- Performance metrics – speed-up, utilization, efficiency
6- Events & Timing

Module Six
1- OpenCL C programming language detail
· Supported features
· Restrictions
2-OpenCL C++ Bindings
3-Understanding GPU memories
* Bank Conflicts
* Memory Coalescing
4-Converting a CUDA kernel to OpenCL
5-AMD Accelerated Parallel Processing Math Libraries (APPML)
6- Lab Exercises: · Sobel Edge detection

Module Seven: Debugging, Profiling and other useful tools
– AMD KernelAnalyzer,
– AMD APP Profiler
– CodeAnalyst
– gDEbugger

Module Eight: Advance concepts
1- General Optimization Tips, removing bank conflicts, coalesced access
2-Vector operations
3- Programming Multi Device
4-OpenCl Extensions: (Atomics, Device Fission, GPU Printf, OpenGL Interoperability)
5- Lab exercises
· Matrix Transpose
· Gaussian Noise
· Optimizing Image Convolution Kernel

Module Nine: OpenCL in real-world applications
Brief demonstrations on the following:
– N-body Simulation
– Mersene Twister
– Application in Artificial Neural Network

Topics in CUDA 4.0/4.1

Introduction to the course
Parallel Computing and Heterogenous Computing
The dark silicon
Paralle computing: Need or luxary?
Models of Parallel Computation: SIMD (Single Instruction Multiple
Data), MIMD (Multiple Instruction Multiple Data)
Arcitecture of Modern CPUs
what is a GPU?  Arcitecture of GTX 580
CUDA Programming Model
CUDA Memory architecture
– Contents
– Interaction with visual studio
Major modifications in CUDA SDK Ver4.x and performance gains w.r.t 3.x and 2.x
CUDA Tool chain
– Difference between various tool chains (nvcc, gcc etc)
– Pros and cons and ease of integration with custom build systems
Installation & Compilation  on Windows
Installation  & Compilation  on Linux
Hello World’ in CUDA (Hands-on)
Vector Multiplication  (Hands-on)
Questions/ Resources
Details on CUDA runtime APIs and examples/Demos of SDK, Blocking Vs non-Blocking Functions, Data Transfer H to D  and D to H
CUDA driver APIs Overview
Thread creation & synchronization at the device(GPU) level, Avoiding Race condition between threads
Thread-Block-Warp Algebra  & Transparent scalability
Assigning indices to threads in single/multiple block
Thread creation and synchronization between CPU-GPU
Performance metrics – speed-up(events), utilization, efficiency, Performance Estimation- theoratical
Performance Estimation-Practical
Matrix Addition-Single Block
Matrix Addition -Multiple Blocks
Simple Matrix Multiplication- Single Block
Simple Matrix Mutiplication-Multiple Blocks
Atomic Operations and their limitations
Bandwidth test
Questions/ Resources
Querying a Device for supported features
Error Handeling
Memory organization in CUDA(cache, shared, global, constant, texture )
Optimization(1) – Using shared memory in Matrix Addition and Multiplication, performance improvement over without shared memory
Array reversal with and without shared memory.
Optimization(2) Important compiler options and pragma
Image rotation and image convolution on GPU
Constant Memory Usage
Calling a device function from a Kernel
Optimization(3) CUDA Warps And Occupancy Considerations
Questions/ Resources
Optimization(4) Sum Reduction
Optimization(5) Removing Bank Conflicts
Optimization(6) Pre-fix Sum
Optimization(7) – Texture memory usage
Global Memory optimization(8) Memory Coalescing
Optimization(9) -Using occupancy calculator
Optimization(10) Pinned Memory
Zero copy Host memory
Portable pinned memory
Optimization(11)- Using streams: Overlapping GPU and CPU tasks, Overlapping
Computation with Memory Copy
Optimization(11)CUDA Video Decoder API (Hardware accelerated)
Thrust Library & CUDA Data Parallel Primitives Library (CuDPP)
NVIDIA Performance Primitives (NPP) library for image/video processing
CUDA Profiler
Parallel NSIGHT
Sharing GPUs across multiple threads
Use all GPUs in the system concurrently from a single host thread
Unified Virtual Addressing, and Multiple kernels
GPUDirect v2.0 support for Peer-to-Peer Communication
GPU binary disassembler for Fermi architecture (cuobjdump)
Getting started with OpenCL
Converting a CUDA kernel into OpenCL
Summary and Performance Guide lines