Revisit to TATA ELXSI @ B’lore

Re-visited Tata Elxsi after close to 2 years, trained another generation of engineers on parallel computing… this time focused on NVIDIA Kepler and Host optimization…


random thoughts- computing , end of moores law, etc etc

This is indeed an interesting time in the history of computing.  Moore’s Law is coming to an end -perhaps sooner than expected- thanks to finite speed of light and atomic nature of matter and growing demand of performance. Its rise and rise and rise of heterogeneous computing! But are we prepared to handle such hetergeneous systems?

At the same time we are witnessing an era in which worldwide data is growing at a rate of 40 percent per year. An era when Big data is dominating better Algorithm.   

now these are two contrasting situations: More performance to process Big Data Vs Ending Moores’ Law

How we are going to tackle this problem then?  

Quantum computing!  Its time to encourage your children for a career in pure science and maths rather than engineering!!

I hope I am alive  in the year 2020 to see how do I write my “hello world”  programs !  


I will be back soon, super charged, in Top Gear :-)

Well, just busy in educating myself.   Till April I am teaching (actually learning :-)) at Hamdard University at New Delhi: I am teaching these courses

1- Advance Algorithms (M.Tech)

2- Advance Computer Architecture (M.Tech)

3- DSP (B.Tech)

And also involved in some really interesting projects at AcmeEmsys

Shall update soon more, how I am having fun there 😉

Inroducing new course on Heterogenous Computing

We are back with a new course on Heterogeneous Computing.

This time we are teaching both CUDA (4.0 & 4.1) and OpenCL.

Thus preparing you to tackle problems in heterogeneous computing in general, not just in GPU Computing., such as  in application using a DSP, an FPGA, a GPU,  and a Multicore   CPU all in a single system.

Topics in OpenCL

OpenCL Training Syllabus for 5-Day Training (25 Hours)
The training consists of parallel theory and lab sessions.

Module One: An overview of Heterogeneous Computing using OpenCL
1-What is Parallel Computing?
– Necessity or Luxury?
– Opportunities & Challenges
– Alternatives to Parallel Computing
2-Fundamentals of Heterogeneous/GPU Computing
– ‘ Dark Silicon’ and Heterogeneous computing
– Why use GPUs
– Why GPUs are fast?
– Basic components of a graphics Card
– Basic differences between GPUs and CPUs
– The APU (Accelerated Processing Unit)
– The processor of 2020
3-What is OpenCL ?
– Introduction to the language/API
– Getting started with APP SDK 2.4, installation, configuration etc
– Sample OpenCl programs walk through
– Hands on session

Module Two: Architecture of some recent CPUs and GPUs
– Intel Dual Core Processors
– Nvidia Fermi
– AMD Fusion
– IBM Cell Broadband Engine

Module Three: Introduction to parallel programming using OpenCL
1-Introductory concepts
– Algorithms
– Task and data decomposition
– Load Balancing
– Software models
– Hardware architectures
– CPU-GPU Communication (PCI-Express Vs PCI)
2- Getting started with the OpenCL program (Lab)
· The software Development Environment and Tools
– Requirements
– Installing on Windows
– Installing on Linux
· The first program: Hello World!
· Compilation (on Linux and Windows)

Module Four: OpenCL Architecture
* An over view to other programming models: OpenMP, CUDA
OpenCL Architecture
* Platform Model
* Execution Model
* Memory Model (Local Memory, Private memory, Constant etc )
* Programming Model
* Threading and Scheduling (Work Items, Workgroups and Wave-front )
* Dealing with Buffer and Images
* How OpenCL model provides Transparent Scalability
* Hands on Session: A simple vector addition

Module Five
1- OpenCL programming in detail (Lab Exercises)
· Image Convolution
· Matrix Multiplication (Single and Multiple Block)
· Parallel Reduction
2-Discussion on Advantages of using Local memory
3- OpenCL Implementation of Task Parallel and Data Parallel Algorithms
4- Overlapping GPU and CPU tasks
5- Performance metrics – speed-up, utilization, efficiency
6- Events & Timing

Module Six
1- OpenCL C programming language detail
· Supported features
· Restrictions
2-OpenCL C++ Bindings
3-Understanding GPU memories
* Bank Conflicts
* Memory Coalescing
4-Converting a CUDA kernel to OpenCL
5-AMD Accelerated Parallel Processing Math Libraries (APPML)
6- Lab Exercises: · Sobel Edge detection

Module Seven: Debugging, Profiling and other useful tools
– AMD KernelAnalyzer,
– AMD APP Profiler
– CodeAnalyst
– gDEbugger

Module Eight: Advance concepts
1- General Optimization Tips, removing bank conflicts, coalesced access
2-Vector operations
3- Programming Multi Device
4-OpenCl Extensions: (Atomics, Device Fission, GPU Printf, OpenGL Interoperability)
5- Lab exercises
· Matrix Transpose
· Gaussian Noise
· Optimizing Image Convolution Kernel

Module Nine: OpenCL in real-world applications
Brief demonstrations on the following:
– N-body Simulation
– Mersene Twister
– Application in Artificial Neural Network

Topics in CUDA 4.0/4.1

Introduction to the course
Parallel Computing and Heterogenous Computing
The dark silicon
Paralle computing: Need or luxary?
Models of Parallel Computation: SIMD (Single Instruction Multiple
Data), MIMD (Multiple Instruction Multiple Data)
Arcitecture of Modern CPUs
what is a GPU?  Arcitecture of GTX 580
CUDA Programming Model
CUDA Memory architecture
– Contents
– Interaction with visual studio
Major modifications in CUDA SDK Ver4.x and performance gains w.r.t 3.x and 2.x
CUDA Tool chain
– Difference between various tool chains (nvcc, gcc etc)
– Pros and cons and ease of integration with custom build systems
Installation & Compilation  on Windows
Installation  & Compilation  on Linux
Hello World’ in CUDA (Hands-on)
Vector Multiplication  (Hands-on)
Questions/ Resources
Details on CUDA runtime APIs and examples/Demos of SDK, Blocking Vs non-Blocking Functions, Data Transfer H to D  and D to H
CUDA driver APIs Overview
Thread creation & synchronization at the device(GPU) level, Avoiding Race condition between threads
Thread-Block-Warp Algebra  & Transparent scalability
Assigning indices to threads in single/multiple block
Thread creation and synchronization between CPU-GPU
Performance metrics – speed-up(events), utilization, efficiency, Performance Estimation- theoratical
Performance Estimation-Practical
Matrix Addition-Single Block
Matrix Addition -Multiple Blocks
Simple Matrix Multiplication- Single Block
Simple Matrix Mutiplication-Multiple Blocks
Atomic Operations and their limitations
Bandwidth test
Questions/ Resources
Querying a Device for supported features
Error Handeling
Memory organization in CUDA(cache, shared, global, constant, texture )
Optimization(1) – Using shared memory in Matrix Addition and Multiplication, performance improvement over without shared memory
Array reversal with and without shared memory.
Optimization(2) Important compiler options and pragma
Image rotation and image convolution on GPU
Constant Memory Usage
Calling a device function from a Kernel
Optimization(3) CUDA Warps And Occupancy Considerations
Questions/ Resources
Optimization(4) Sum Reduction
Optimization(5) Removing Bank Conflicts
Optimization(6) Pre-fix Sum
Optimization(7) – Texture memory usage
Global Memory optimization(8) Memory Coalescing
Optimization(9) -Using occupancy calculator
Optimization(10) Pinned Memory
Zero copy Host memory
Portable pinned memory
Optimization(11)- Using streams: Overlapping GPU and CPU tasks, Overlapping
Computation with Memory Copy
Optimization(11)CUDA Video Decoder API (Hardware accelerated)
Thrust Library & CUDA Data Parallel Primitives Library (CuDPP)
NVIDIA Performance Primitives (NPP) library for image/video processing
CUDA Profiler
Parallel NSIGHT
Sharing GPUs across multiple threads
Use all GPUs in the system concurrently from a single host thread
Unified Virtual Addressing, and Multiple kernels
GPUDirect v2.0 support for Peer-to-Peer Communication
GPU binary disassembler for Fermi architecture (cuobjdump)
Getting started with OpenCL
Converting a CUDA kernel into OpenCL
Summary and Performance Guide lines

The Dark Silicon

I was wondering why this F is coming in the above equation.  It is because of this F that the entire computing world is in trouble or at least the job’s of software developers are in trouble.  🙂

Combing through my undergraduate books  I  found a very good explanation of how the above equations come  in the Book by Sedra and Smith Section 4.10.

In his keynote address in The AMD Developer Fusion Summit (AFDS),  Jem Davies  ARM Fellow,VP of Technology, Media Processing Division, emphasized that  the current need is the current we must focus on Functionality Per Unit Cost Per Unit Energy and not  Functionality Per Unit Cost Per Unit Power. Those days are gone!

What is Dark Silicon? 

Refer to the above slide. A we are putting more transistors in a single chip, it is creating problem. The basic problem is if so many transistors are powered simultaneously on such a small area, it will generate lots of heat, which will raise the temperature of this chip and thus ultimately the chip can burn and become useless.  Therefore we will have to switch off approximately 75% of the transistors by 2014. this figure will grow to approximately to 90% by 2020. !!

These unlit transistors constitute DARK SILICON.