Re-visited Tata Elxsi after close to 2 years, trained another generation of engineers on parallel computing… this time focused on NVIDIA Kepler and Host optimization…
This is indeed an interesting time in the history of computing. Moore’s Law is coming to an end -perhaps sooner than expected- thanks to finite speed of light and atomic nature of matter and growing demand of performance. Its rise and rise and rise of heterogeneous computing! But are we prepared to handle such hetergeneous systems?
At the same time we are witnessing an era in which worldwide data is growing at a rate of 40 percent per year. An era when Big data is dominating better Algorithm.
now these are two contrasting situations: More performance to process Big Data Vs Ending Moores’ Law
How we are going to tackle this problem then?
Quantum computing! Its time to encourage your children for a career in pure science and maths rather than engineering!!
I hope I am alive in the year 2020 to see how do I write my “hello world” programs !
Its time to learn Art and Science of GPU computing on the latest Kepler architecture.
Organizing 5 Days-hand -on workshop at New Delhi.
Date: October 12-16
Well, just busy in educating myself. Till April I am teaching (actually learning :-)) at Hamdard University at New Delhi: I am teaching these courses
1- Advance Algorithms (M.Tech)
2- Advance Computer Architecture (M.Tech)
3- DSP (B.Tech)
And also involved in some really interesting projects at AcmeEmsys
Shall update soon more, how I am having fun there 😉
TATA ELEXSI invited us for delivering 5 days training program at Bangalore from 5th to 9th February. It was an advance training covering topics from CUDA and some exposure to OpenCL. The sessions were mostly hands-on and we provided example source code for each concept we explained.
We are back with a new course on Heterogeneous Computing.
This time we are teaching both CUDA (4.0 & 4.1) and OpenCL.
Thus preparing you to tackle problems in heterogeneous computing in general, not just in GPU Computing., such as in application using a DSP, an FPGA, a GPU, and a Multicore CPU all in a single system.
Topics in OpenCL
OpenCL Training Syllabus for 5-Day Training (25 Hours)
The training consists of parallel theory and lab sessions.
Module One: An overview of Heterogeneous Computing using OpenCL
1-What is Parallel Computing?
– Necessity or Luxury?
– Opportunities & Challenges
– Alternatives to Parallel Computing
2-Fundamentals of Heterogeneous/GPU Computing
– ‘ Dark Silicon’ and Heterogeneous computing
– Why use GPUs
– Why GPUs are fast?
– Basic components of a graphics Card
– Basic differences between GPUs and CPUs
– The APU (Accelerated Processing Unit)
– The processor of 2020
3-What is OpenCL ?
– Introduction to the language/API
– Getting started with APP SDK 2.4, installation, configuration etc
– Sample OpenCl programs walk through
– Hands on session
Module Two: Architecture of some recent CPUs and GPUs
– Intel Dual Core Processors
– Nvidia Fermi
– AMD Fusion
– IBM Cell Broadband Engine
Module Three: Introduction to parallel programming using OpenCL
– Task and data decomposition
– Load Balancing
– Software models
– Hardware architectures
– CPU-GPU Communication (PCI-Express Vs PCI)
2- Getting started with the OpenCL program (Lab)
· The software Development Environment and Tools
– Installing on Windows
– Installing on Linux
· The first program: Hello World!
· Compilation (on Linux and Windows)
Module Four: OpenCL Architecture
* An over view to other programming models: OpenMP, CUDA
* Platform Model
* Execution Model
* Memory Model (Local Memory, Private memory, Constant etc )
* Programming Model
* Threading and Scheduling (Work Items, Workgroups and Wave-front )
* Dealing with Buffer and Images
* How OpenCL model provides Transparent Scalability
* Hands on Session: A simple vector addition
1- OpenCL programming in detail (Lab Exercises)
· Image Convolution
· Matrix Multiplication (Single and Multiple Block)
· Parallel Reduction
2-Discussion on Advantages of using Local memory
3- OpenCL Implementation of Task Parallel and Data Parallel Algorithms
4- Overlapping GPU and CPU tasks
5- Performance metrics – speed-up, utilization, efficiency
6- Events & Timing
1- OpenCL C programming language detail
· Supported features
2-OpenCL C++ Bindings
3-Understanding GPU memories
* Bank Conflicts
* Memory Coalescing
4-Converting a CUDA kernel to OpenCL
5-AMD Accelerated Parallel Processing Math Libraries (APPML)
6- Lab Exercises: · Sobel Edge detection
Module Seven: Debugging, Profiling and other useful tools
– AMD KernelAnalyzer,
– AMD APP Profiler
Module Eight: Advance concepts
1- General Optimization Tips, removing bank conflicts, coalesced access
3- Programming Multi Device
4-OpenCl Extensions: (Atomics, Device Fission, GPU Printf, OpenGL Interoperability)
5- Lab exercises
· Matrix Transpose
· Gaussian Noise
· Optimizing Image Convolution Kernel
Module Nine: OpenCL in real-world applications
Brief demonstrations on the following:
– N-body Simulation
– Mersene Twister
– Application in Artificial Neural Network
Topics in CUDA 4.0/4.1
|Introduction to the course|
|Parallel Computing and Heterogenous Computing|
|The dark silicon|
|Paralle computing: Need or luxary?|
|Models of Parallel Computation: SIMD (Single Instruction Multiple
Data), MIMD (Multiple Instruction Multiple Data)
|Arcitecture of Modern CPUs|
|what is a GPU? Arcitecture of GTX 580|
|CUDA Programming Model|
|CUDA Memory architecture|
– Interaction with visual studio
|Major modifications in CUDA SDK Ver4.x and performance gains w.r.t 3.x and 2.x|
|CUDA Tool chain
– Difference between various tool chains (nvcc, gcc etc)
– Pros and cons and ease of integration with custom build systems
|Installation & Compilation on Windows|
|Installation & Compilation on Linux|
|Hello World’ in CUDA (Hands-on)|
|Vector Multiplication (Hands-on)|
|Details on CUDA runtime APIs and examples/Demos of SDK, Blocking Vs non-Blocking Functions, Data Transfer H to D and D to H|
|CUDA driver APIs Overview|
|Thread creation & synchronization at the device(GPU) level, Avoiding Race condition between threads|
|Thread-Block-Warp Algebra & Transparent scalability|
|Assigning indices to threads in single/multiple block|
|Thread creation and synchronization between CPU-GPU|
|Performance metrics – speed-up(events), utilization, efficiency, Performance Estimation- theoratical|
|Matrix Addition-Single Block|
|Matrix Addition -Multiple Blocks|
|Simple Matrix Multiplication- Single Block|
|Simple Matrix Mutiplication-Multiple Blocks|
|Atomic Operations and their limitations|
|Querying a Device for supported features|
|Memory organization in CUDA(cache, shared, global, constant, texture )|
|Optimization(1) – Using shared memory in Matrix Addition and Multiplication, performance improvement over without shared memory|
|Array reversal with and without shared memory.|
|Optimization(2) Important compiler options and pragma|
|Image rotation and image convolution on GPU|
|Constant Memory Usage|
|Calling a device function from a Kernel|
|Optimization(3) CUDA Warps And Occupancy Considerations|
|Optimization(4) Sum Reduction|
|Optimization(5) Removing Bank Conflicts|
|Optimization(6) Pre-fix Sum|
|Optimization(7) – Texture memory usage|
|Global Memory optimization(8) Memory Coalescing|
|Optimization(9) -Using occupancy calculator|
|Optimization(10) Pinned Memory|
|Zero copy Host memory|
|Portable pinned memory|
|Optimization(11)- Using streams: Overlapping GPU and CPU tasks, Overlapping
Computation with Memory Copy
|Optimization(11)CUDA Video Decoder API (Hardware accelerated)|
|CUBLASS & CUFFT usage|
|Thrust Library & CUDA Data Parallel Primitives Library (CuDPP)|
|NVIDIA Performance Primitives (NPP) library for image/video processing|
|Sharing GPUs across multiple threads|
|Use all GPUs in the system concurrently from a single host thread|
|Unified Virtual Addressing, and Multiple kernels|
|GPUDirect v2.0 support for Peer-to-Peer Communication|
|GPU binary disassembler for Fermi architecture (cuobjdump)|
|Getting started with OpenCL|
|Converting a CUDA kernel into OpenCL|
|Summary and Performance Guide lines|
I was wondering why this F is coming in the above equation. It is because of this F that the entire computing world is in trouble or at least the job’s of software developers are in trouble. 🙂
Combing through my undergraduate books I found a very good explanation of how the above equations come in the Book by Sedra and Smith Section 4.10.
In his keynote address in The AMD Developer Fusion Summit (AFDS), Jem Davies ARM Fellow,VP of Technology, Media Processing Division, emphasized that the current need is the current we must focus on Functionality Per Unit Cost Per Unit Energy and not Functionality Per Unit Cost Per Unit Power. Those days are gone!
What is Dark Silicon?
Refer to the above slide. A we are putting more transistors in a single chip, it is creating problem. The basic problem is if so many transistors are powered simultaneously on such a small area, it will generate lots of heat, which will raise the temperature of this chip and thus ultimately the chip can burn and become useless. Therefore we will have to switch off approximately 75% of the transistors by 2014. this figure will grow to approximately to 90% by 2020. !!
These unlit transistors constitute DARK SILICON.