18 Dec 2025 5 min read

Gentle Intro to CUDA

Terminology

Block: A (software-level) grouping of threads. Threads in the same block can corporate, share data using shared memory (which is local to that block), and synchronize with __syncthreads()
Warp: A (hardware-level) group of (32) GPU threads created by the hardware from our blocks. All threads in a warp can only execute the same instructions (SIMT)
Grid: A collection of blocks. Blocks don’t cooperate directly - they can’t share memory or synchronize with each other. Each block runs the same kernel code independently
GPUs have a maximum threads-per-block limit (limited by hardware). Grid size (no of blocks we can fit in a grid) is limited too, but extremely large

Key Ideas

Typical CUDA program

CPU allocates memory on the GPU via cudaMalloc()
CPU copies data to GPU via functions like cudaMemcpy()
CPU launches kernel on GPU via <<grid, block>> syntax
CPU copies result from GPU back to CPU via cudaMemcpy()

When we launch a CUDA kernel, we create a grid. A grid is a collection of blocks. Each block contains many threads and runs on one SM. Multiple blocks may run on the same SM concurrently.

Each CUDA block is broken into warps of 32 threads. All 32 threads in a warp share the same program counter, and they execute the same instruction at the same time. However, they each operate on different registers and memory.

SM doesn't run threads one-by-one - it schedules execution at the warp level. Each SM maintains many resident warps (commonly 32-64 or more), and its warp scheduler selects the warps to be executed by the SM in the current cycle.

A warp scheduler doesn’t run threads. It issues the warp’s next instructions to the SM’s execution units. If an SM has 4 warp schedulers, it can issue up to 4 instructions per cycle (one per scheduler)

NVIDIA GPUs hide latency by keeping many warps ready to run. If one warp stalls, the SM instantly switches to a different ready warp with zero overhead.

In a kernel, we use thread indices to divide the work.

Memory Hierarchy

CUDA provides several types of memory, each with different speeds & uses

Global Memory: The main memory of the GPU, visible to any kernel (i.e. accessible by all threads across all blocks across all kernels), as well as the CPU
- Slowest but largest in size. Used for data shared across all threads (e.g. Arrays, large datasets)
Shared Memory: Memory shared by all threads in the same block.
- Very fast but small in size. Used for data that threads in a block need to share frequently (e.g. temporary variables)
Registers: Fastest memory, private to each thread
- Used for local variables. Limited in numbers - use wisely. (e.g. Loop counters, intermediate calculations)
Constant Memory: Read-only memory visible to any kernel, as well as the CPU
- Cached for fast access. Use for data that doesn’t change during kernel execution (e..g constants, config params)
Local Memory: Device global memory (GPU’s global DRAM), used when registers are not enough
- Slow; avoid if possible (e.g. large arrays/variables that don’t fit in registers)

TLDR

Per-thread registers: Lowest possible latency
Per-thread local memory: Private storage, slowest access
Per-block shared memory: Visible by all threads in a block. Can be used to exchange data between threads in a thread block. Very fast access
Global memory: Visible by all threads in a grid. Slowest access

Thread Hierarchy

A CUDA kernel is launched on a grid of thread blocks

Thread blocks are a group of threads. Each block is scheduled & executed on an SM blocks are scheduled on any available SM, in any order. Multiple thread blocks can run on the same SM
Individual threads execute on scalar CUDA cores

At runtime, a thread block is divided into warps for SIMT execution

The way a block is partitioned into warps is always the same. Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0

The total number of warps in a block is

CUDA Optimization

There are many ways to optimize CUDA kernels, including optimizing memory access & cache line usage, choosing good grid & block sizes, tiling with shared memory, minimizing thread divergence, etc..

We’ll focus on one of them - choosing good grid & block sizes

Thread blocks are divided into warps of 32 threads (warp size) each
In scenario #1, each thread block has 4 warps. But since we set block size to 100 (which isn’t a multiple of warp size), each block’s last warp will only use 4 threads
In scenario #2, each thread block has 4 warps. As we set the block size to 128 (which is a multiple of 32, the warp size), only the last block’s warp will not be fully used

We can maximize performance by saturating compute units & memory bandwidth

Little’s Law: Concurrency (In-flight work) = Bandwidth (Throughput) x Latency

E.g. We have 8 FP32 units with a latency of 24 cycles (i.e. each FP32 op takes 24 cycles from issue to completion), and we can issue 8 ops per cycle

To saturate all 8 units (to keep all 8 units fully busy every cycle), we need 8x24=192 operations in-flight. That means we need 192 instructions executing in our pipeline (across all 8 units) at the same time, to keep them fully busy
Even though each FP32 op takes 24 cycles to complete, the pipeline can accept a new op every cycle per FP32 unit

We can increase in-flight instructions (instructions currently being processed, but not yet completed) via 2 ways

Increase instruction-level parallelism: Have more independent instructions (don’t depend on each other) per thread

While instruction 1 is halfway through its 24-cycle FP pipeline, instruction 2 & 3 can be issued…

Increase Thread-level Parallelism: TLP = how many warps are resident (loaded & ready to run on an SM) on an SM at the same time. Increase no. of warps available for the SM to schedule by increasing block size (more threads -> warps per block), or increase no. of blocks in the grid so every SM always has blocks available to schedule

An SM has 3 separate limits: Max threads per SM, max warps per SM, max blocks per SM. SM occupancy is determined by whichever limit is hit first

Case A, we have a total of 4000 warps. Case B, we have a total of 128 warps
A: If an SM can hold 64 warps simultaneously, each SM can hold 16 blocks (4 warps/block)
B: The same SM can hold only 2 blocks (32 warps/block)
Large blocks reduce TLP, because fewer blocks = fewer scheduling opportunities. If a block stalls on memory, shared memory barrier, etc… the SM has no other blocks to switch to.
Case A is better...