π The Feynman GPU Lectures
"What I cannot create, I do not understand." ; Richard Feynman
Welcome. We'll start from the transistor, build up to a CUDA core, wire those into a monster called a Streaming Multiprocessor, and then ride the architectural wave from Volta all the way to Blackwell. I'll give you real code ; not toy code ; in CUDA and PTX. By the end, you'll know what
tcgen05.mmadoes at the wire level.No fluff. No marketing. Just physics, logic, and math.
Table of Contents
- Hour 1 ; The Transistor to the CUDA Core
- Hour 2 ; Memory: The Real Bottleneck
- Hour 3 ; The Generational Leap: Volta β Ampere β Hopper
- Hour 4 ; Blackwell: The Fifth Generation
- Appendix A ; PTX Quick Reference
- Appendix B ; Occupancy Calculator Cheat Sheet
Hour 1 ; The Transistor to the CUDA Core
1.1 The Transistor: Nature's Switch
Everything begins with a single idea: a voltage-controlled switch.
A MOSFET (Metal-Oxide-Semiconductor Field-Effect Transistor) has three terminals:
graph LR
G["Gate (Control)"] --> T["Channel"]
S["Source"] --> T
T --> D["Drain"]
style G fill:#ff6b6b,stroke:#333,color:#fff
style S fill:#4ecdc4,stroke:#333,color:#fff
style D fill:#45b7d1,stroke:#333,color:#fff
style T fill:#f9f9f9,stroke:#333
There are two flavors:
| Type | Conducts when Gate is... | Pulls output toward... |
|---|---|---|
| NMOS | HIGH (Vdd) | Ground (0) |
| PMOS | LOW (0) | Supply (1) |
The key insight: A transistor is not a mystical device. It's a water faucet. Gate voltage = handle. Current = water. That's it.
Modern GPUs use FinFET transistors ; the "fin" is a 3D gate that wraps around the channel on three sides, giving better electrostatic control at tiny scales:
graph TD
subgraph "FinFET Process Nodes"
V["Volta (2017)
TSMC 12nm
21.1B transistors"]
T["Turing (2018)
TSMC 12nm
18.6B transistors"]
A["Ampere (2020)
TSMC 7nm
54.2B transistors"]
H["Hopper (2022)
TSMC 4N
80B transistors"]
B["Blackwell (2024)
TSMC 4NP
208B transistors"]
end
V --> T --> A --> H --> B
style V fill:#6c5ce7,stroke:#333,color:#fff
style T fill:#a29bfe,stroke:#333,color:#fff
style A fill:#00b894,stroke:#333,color:#fff
style H fill:#fdcb6e,stroke:#333,color:#000
style B fill:#e17055,stroke:#333,color:#fff
208 billion transistors on Blackwell. That's about 26 transistors for every human alive, on a chip the size of your thumbnail (well, two chips fused together).
1.2 From Transistors to Logic Gates
Every digital circuit is built from three atomic operations. Here's how transistors make them:
The NOT Gate (Inverter) ; 2 Transistors
graph TD
VDD["Vdd (+)"] --> PMOS
PMOS -->|"when Input=0, PMOS ON"| OUT["Output"]
OUT --> NMOS
NMOS -->|"when Input=1, NMOS ON"| GND["Ground"]
INPUT["Input"] -.-> PMOS
INPUT -.-> NMOS
style VDD fill:#e74c3c,stroke:#333,color:#fff
style GND fill:#2c3e50,stroke:#333,color:#fff
style OUT fill:#f1c40f,stroke:#333,color:#000
style INPUT fill:#3498db,stroke:#333,color:#fff
| Input | PMOS | NMOS | Output |
|---|---|---|---|
| 0 | ON | OFF | 1 |
| 1 | OFF | ON | 0 |
The NAND Gate ; 4 Transistors
The NAND gate is the universal gate ; you can build ANY logic function from NANDs alone.
Vdd
βββ€ββ
A βββ€Pβ β
βββ¬ββ
βββ€ββ
B βββ€Pβ βββββ Output (= NOT(A AND B))
βββ¬ββ
βββ€ββ
A βββ€Nβ β
βββ¬ββ
βββ€ββ
B βββ€Nβ β
βββ¬ββ
GND
PMOS: Parallel (either A=0 OR B=0 β output=1)
NMOS: Series (both A=1 AND B=1 β output=0)
| A | B | NAND(A,B) |
|---|---|---|
| 0 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Why NAND is universal:
NOT(A) = NAND(A,A).AND(A,B) = NOT(NAND(A,B)).OR(A,B) = NAND(NOT(A), NOT(B)). Every other gate follows. This is why chip designers think in NANDs.
1.3 From Gates to an ALU
An ALU is just a network of gates that performs arithmetic. Let's build one from the bottom up.
Half Adder ; The Atom of Arithmetic
graph LR
A["A"] --> XOR["β XOR"]
B["B"] --> XOR
A --> AND["β§ AND"]
B --> AND
XOR --> S["Sum"]
AND --> C["Carry"]
style A fill:#3498db,stroke:#333,color:#fff
style B fill:#3498db,stroke:#333,color:#fff
style XOR fill:#e74c3c,stroke:#333,color:#fff
style AND fill:#e67e22,stroke:#333,color:#fff
style S fill:#2ecc71,stroke:#333,color:#fff
style C fill:#2ecc71,stroke:#333,color:#fff
Full Adder ; Handles a Carry-In
Gate count: ~9 gates = ~36 transistors per bit.
32-bit Adder
Chain 32 full adders. But naive chaining is slow (carry ripples through all 32 bits). GPUs use Carry-Lookahead Adders (CLA):
graph TD
subgraph "Carry-Lookahead Adder - 32-bit"
G0["Generate/Propagate
Bits 0-7"] --> CLA1["CLA Block"]
G1["Generate/Propagate
Bits 8-15"] --> CLA2["CLA Block"]
G2["Generate/Propagate
Bits 16-23"] --> CLA3["CLA Block"]
G3["Generate/Propagate
Bits 24-31"] --> CLA4["CLA Block"]
CLA1 --> SUPER["Carry-Lookahead
Super Block"]
CLA2 --> SUPER
CLA3 --> SUPER
CLA4 --> SUPER
end
style SUPER fill:#e74c3c,stroke:#333,color:#fff
style CLA1 fill:#3498db,stroke:#333,color:#fff
style CLA2 fill:#3498db,stroke:#333,color:#fff
style CLA3 fill:#3498db,stroke:#333,color:#fff
style CLA4 fill:#3498db,stroke:#333,color:#fff
Key math: For each bit position i:
- Generate: $G_i = A_i \wedge B_i$ (this bit produces a carry regardless)
- Propagate: $P_i = A_i \oplus B_i$ (this bit passes a carry through)
- Carry: $C_i = G_i \lor (P_i \wedge C_{i-1})$
The lookahead trick: expand the recursion so all carries compute in $O(\log n)$ gate delays instead of $O(n)$.
Total for a 32-bit CLA: ~200 gates β ~800 transistors. Delay: ~4-5 gate stages.
1.4 The CUDA Core: A Bare-Metal FPU
Now let's build an actual floating-point unit. An FP32 number (IEEE 754):
ββββββββ¬βββββββββββ¬ββββββββββββββββββββββββ
β Sign β Exponent β Mantissa β
β 1 bitβ 8 bits β 23 bits β
ββββββββ΄βββββββββββ΄ββββββββββββββββββββββββ
Value = (-1)^sign Γ 2^(exponent-127) Γ 1.mantissa
FP32 Multiply Pipeline
To compute $A \times B$:
graph TD
A["Input A
(sign, exp, mant)"] --> S1
B["Input B
(sign, exp, mant)"] --> S1
S1["Stage 1: Sign XOR
sign_result = sign_A β sign_B
1 gate"] --> S2
S2["Stage 2: Exponent Add
exp_result = exp_A + exp_B - 127
8-bit adder + subtractor"] --> S3
S3["Stage 3: Mantissa Multiply
mant_result = 1.mant_A x 1.mant_B
24x24 multiplier array"] --> S4
S4["Stage 4: Normalize
Shift mantissa, adjust exponent
Leading-zero detector + barrel shifter"] --> S5
S5["Stage 5: Round
IEEE 754 rounding
Round-to-nearest-even logic"] --> R["Result"]
style S1 fill:#e74c3c,stroke:#333,color:#fff
style S2 fill:#e67e22,stroke:#333,color:#fff
style S3 fill:#f1c40f,stroke:#333,color:#000
style S4 fill:#2ecc71,stroke:#333,color:#fff
style S5 fill:#3498db,stroke:#333,color:#fff
style R fill:#9b59b6,stroke:#333,color:#fff
The 24Γ24 mantissa multiplier is the big one. Using a Wallace tree multiplier: ~2,000-3,000 gates.
Total transistor count for one FP32 CUDA core: approximately 8,000β20,000 transistors.
Compare to a CPU core: An Intel Golden Cove core has ~500 million transistors. A CUDA core has ~15,000. That's a factor of 30,000Γ. A CUDA core is stupidly simple by design. That's the point.
What a CUDA Core Does NOT Have
graph LR
subgraph "CPU Core"
BP["Branch Predictor
~50K transistors"]
OOO["Out-of-Order Engine
~200K transistors"]
ROB["Reorder Buffer
~100K transistors"]
L1D["32KB L1D Cache
Private"]
L1I["32KB L1I Cache
Private"]
TLB["TLB
~1000 entries"]
end
subgraph "CUDA Core"
FPU["FP32 ALU
~15K transistors"]
DONE["That is it."]
end
style BP fill:#e74c3c,stroke:#333,color:#fff
style OOO fill:#e74c3c,stroke:#333,color:#fff
style ROB fill:#e74c3c,stroke:#333,color:#fff
style FPU fill:#2ecc71,stroke:#333,color:#fff
style DONE fill:#2ecc71,stroke:#333,color:#fff
No branch predictor. No out-of-order execution. No speculation. No private cache. Just: input β ALU β output. The GPU's philosophy is: "Why predict branches when you can just run 10,000 threads and always have something useful to do?"
1.5 SIMT: How 32 Threads Breathe Together
The Warp ; The Atom of GPU Execution
A warp is 32 threads that execute the same instruction at the same time on 32 different data.
This is SIMT ; Single Instruction, Multiple Threads. It's like SIMD (SSE/AVX), but each "lane" has its own registers and can branch independently (since Volta).
graph TD
subgraph "One Warp - 32 threads"
PC["Program Counter
(shared)"] --> INST["Instruction: FADD R1, R2, R3"]
INST --> T0["Thread 0
R1_0 = R2_0 + R3_0"]
INST --> T1["Thread 1
R1_1 = R2_1 + R3_1"]
INST --> T2["Thread 2
R1_2 = R2_2 + R3_2"]
INST --> DOTS["..."]
INST --> T31["Thread 31
R1_31 = R2_31 + R3_31"]
end
style PC fill:#e74c3c,stroke:#333,color:#fff
style INST fill:#f1c40f,stroke:#333,color:#000
style T0 fill:#3498db,stroke:#333,color:#fff
style T1 fill:#3498db,stroke:#333,color:#fff
style T2 fill:#3498db,stroke:#333,color:#fff
style T31 fill:#3498db,stroke:#333,color:#fff
The hardware implementation: The warp scheduler fetches ONE instruction. The dispatch unit sends it to 32 CUDA cores simultaneously. Each core reads from its own register bank and writes to its own register bank. One instruction, 32 executions. That's a 32Γ multiplier on instruction fetch/decode efficiency compared to a scalar processor.
Thread Identity
// Every thread knows exactly where it is:
int globalId = blockIdx.x * blockDim.x + threadIdx.x;
int warpId = threadIdx.x / 32; // which warp within this block
int laneId = threadIdx.x % 32; // which lane within the warp (0-31)
1.6 The Streaming Multiprocessor (SM)
The SM is the fundamental compute unit of a GPU. Everything above (cores, warps, schedulers) lives inside one SM.
graph TD
subgraph SM["Streaming Multiprocessor"]
subgraph Q0["Quadrant 0"]
WS0["Warp Scheduler 0
+ Dispatch Unit"]
CORES0["32 x FP32 Cores"]
TC0["1 x Tensor Core"]
LS0["8 x Load/Store Units"]
SFU0["4 x SFU"]
RF0["16,384 x 32-bit Registers"]
end
subgraph Q1["Quadrant 1"]
WS1["Warp Scheduler 1"]
CORES1["32 x FP32 Cores"]
TC1["1 x Tensor Core"]
RF1["16,384 Registers"]
end
subgraph Q2["Quadrant 2"]
WS2["Warp Scheduler 2"]
CORES2["32 x FP32 Cores"]
TC2["1 x Tensor Core"]
RF2["16,384 Registers"]
end
subgraph Q3["Quadrant 3"]
WS3["Warp Scheduler 3"]
CORES3["32 x FP32 Cores"]
TC3["1 x Tensor Core"]
RF3["16,384 Registers"]
end
SMEM["Shared Memory / L1 Cache
(Configurable, up to 228 KB on Hopper)"]
ICACHE["Instruction Cache"]
CCACHE["Constant Cache"]
TEX["Texture Units"]
end
style SM fill:#1a1a2e,stroke:#e94560,color:#fff
style Q0 fill:#16213e,stroke:#0f3460,color:#fff
style Q1 fill:#16213e,stroke:#0f3460,color:#fff
style Q2 fill:#16213e,stroke:#0f3460,color:#fff
style Q3 fill:#16213e,stroke:#0f3460,color:#fff
style SMEM fill:#e94560,stroke:#333,color:#fff
style WS0 fill:#0f3460,stroke:#333,color:#fff
style CORES0 fill:#533483,stroke:#333,color:#fff
style TC0 fill:#e74c3c,stroke:#333,color:#fff
SM Specs Across Generations
| Feature | Volta (SM70) | Ampere (SM80) | Hopper (SM90) | Blackwell (SM100) |
|---|---|---|---|---|
| FP32 Cores/SM | 64 | 128 | 128 | 128 |
| Tensor Cores/SM | 8 | 4 (2x capable) | 4 (4th gen) | 4 (5th gen) |
| Register File | 256 KB | 256 KB | 256 KB | 256 KB |
| Shared Mem (max) | 96 KB | 164 KB | 228 KB | 228 KB |
| Max Threads/SM | 2048 | 2048 | 2048 | 2048 |
| Max Warps/SM | 64 | 64 | 64 | 64 |
| Warp Schedulers | 4 | 4 | 4 | 4 |
| Max Blocks/SM | 32 | 32 | 32 | 32 |
The golden ratio of GPUs: 2048 threads / 256 KB registers = 128 bytes (32 registers) per thread at 100% occupancy. Every register you add above 32 reduces your occupancy. This is the fundamental tradeoff of GPU programming.
1.7 Warp Divergence: The Cost of if
Consider this code:
__global__ void divergent_kernel(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx % 2 == 0) {
data[idx] = expensive_function_A(data[idx]); // Even threads
} else {
data[idx] = expensive_function_B(data[idx]); // Odd threads
}
}
What Happens Inside the Warp
graph TD
subgraph "Warp Execution Timeline"
T1["Clock 1-10: Execute Branch A
Active Mask: 0x55555555
Threads 0,2,4,...,30 ACTIVE
Threads 1,3,5,...,31 IDLE"]
T2["Clock 11-20: Execute Branch B
Active Mask: 0xAAAAAAAA
Threads 1,3,5,...,31 ACTIVE
Threads 0,2,4,...,30 IDLE"]
T3["Clock 21+: Reconverge
Active Mask: 0xFFFFFFFF
All 32 threads ACTIVE"]
end
T1 --> T2 --> T3
style T1 fill:#e74c3c,stroke:#333,color:#fff
style T2 fill:#e67e22,stroke:#333,color:#fff
style T3 fill:#2ecc71,stroke:#333,color:#fff
Throughput: Both branches execute sequentially. If each branch takes 10 cycles, the divergent warp takes 20 cycles instead of 10. That's 50% efficiency.
The hardware truth: Those "idle" CUDA cores aren't powered down ; they're executing the instruction but their write-back is masked. The energy is mostly wasted.
The Fix: Think in Warps, Not Threads
// GOOD: All threads in a warp take the same branch
__global__ void coalesced_kernel(float* data, int N) {
int warpId = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
int laneId = threadIdx.x % 32;
if (warpId % 2 == 0) {
// Entire warp goes here ; no divergence
data[warpId * 32 + laneId] = expensive_function_A(data[warpId * 32 + laneId]);
} else {
// Entire warp goes here ; no divergence
data[warpId * 32 + laneId] = expensive_function_B(data[warpId * 32 + laneId]);
}
}
Hour 2 ; Memory: The Real Bottleneck
"The speed of computation doesn't matter if you can't feed the beast."
2.1 The Memory Hierarchy: A Map of Latencies
This is the single most important diagram in GPU programming:
graph TD
REG["Registers
256 KB per SM
0 cycles - same clock
~20 TB/s aggregate"] --> SMEM
SMEM["Shared Memory
Up to 228 KB per SM
~20-30 cycles
~19 TB/s per SM"] --> L1
L1["L1 Cache
Combined with Shared Mem
~30-40 cycles"] --> L2
L2["L2 Cache
40-96 MB
~200-300 cycles
~6-12 TB/s"] --> HBM
HBM["HBM / GDDR
80-192 GB
~400-800 cycles
2-8 TB/s"] --> PCIE
PCIE["System Memory via PCIe
Host RAM
~10,000+ cycles
64 GB/s"]
style REG fill:#2ecc71,stroke:#333,color:#fff
style SMEM fill:#3498db,stroke:#333,color:#fff
style L1 fill:#9b59b6,stroke:#333,color:#fff
style L2 fill:#e67e22,stroke:#333,color:#fff
style HBM fill:#e74c3c,stroke:#333,color:#fff
style PCIE fill:#c0392b,stroke:#333,color:#fff
The fundamental equation: At 1.5 GHz, 400 cycles = 267 nanoseconds. A single global memory access takes as long as 400 floating-point operations. This is why GPUs need thousands of threads ; to keep the ALUs busy while other threads wait for memory.
Let me put it differently. Say each of your threads needs data from HBM. At 400 cycles per access, you need:
But we only have 64 warps per SM. So we need shared memory and caches to reduce effective latency, or we'll always be memory-bound.
2.2 Registers: Zero-Cost Storage
Registers are the fastest memory in the system. Each SM has 65,536 x 32-bit registers (256 KB).
__global__ void register_demo(float* out, float* in, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// These live in REGISTERS ; zero-cost access
float a = in[idx]; // Register R1
float b = in[idx + N]; // Register R2
float c = a * b; // Register R3 = R1 * R2
float d = c + 1.0f; // Register R4 = R3 + 1.0
out[idx] = d; // Store R4 to memory
}
Register pressure ; the critical tradeoff:
Total registers per SM: 65,536
Max threads per SM: 2,048
Registers at 100% occupancy: 65,536 / 2,048 = 32 registers per thread
If your kernel uses 64 registers/thread:
Threads per SM = 65,536 / 64 = 1,024 threads = 32 warps
Occupancy = 32/64 = 50%
If your kernel uses 128 registers/thread:
Threads per SM = 65,536 / 128 = 512 threads = 16 warps
Occupancy = 16/64 = 25%
Feynman's rule: Every register you add above 32 is stealing occupancy from you. But sometimes that's a good trade ; more registers means fewer memory accesses. Profile, don't guess.
PTX Register Usage
// PTX: Registers are explicitly declared
.reg .f32 %f<64>; // 64 FP32 registers: %f0 through %f63
.reg .f64 %fd<16>; // 16 FP64 registers
.reg .b32 %r<32>; // 32 32-bit general registers
.reg .b64 %rd<16>; // 16 64-bit registers (for addresses)
.reg .pred %p<8>; // 8 predicate registers (for branching)
// Usage:
ld.global.f32 %f0, [%rd0]; // Load from global memory to register
add.f32 %f1, %f0, %f2; // Add two registers
st.global.f32 [%rd1], %f1; // Store register to global memory
2.3 Shared Memory & Bank Conflicts
Shared memory is on-chip SRAM, accessible by all threads in a thread block (CTA). It's 20-30x faster than global memory.
Physical Structure: 32 Banks
graph TD
subgraph "Shared Memory: 32 Banks x 4 Bytes Wide"
B0["Bank 0
Addr 0, 128, 256..."]
B1["Bank 1
Addr 4, 132, 260..."]
B2["Bank 2
Addr 8, 136, 264..."]
B3["Bank 3
Addr 12, 140, 268..."]
DOTS1["..."]
B30["Bank 30
Addr 120, 248..."]
B31["Bank 31
Addr 124, 252..."]
end
subgraph "Warp Threads"
T0["Thread 0"] --> B0
T1["Thread 1"] --> B1
T2["Thread 2"] --> B2
T3["Thread 3"] --> B3
T30["Thread 30"] --> B30
T31["Thread 31"] --> B31
end
style B0 fill:#3498db,stroke:#333,color:#fff
style B1 fill:#2980b9,stroke:#333,color:#fff
style B2 fill:#2471a3,stroke:#333,color:#fff
style B3 fill:#1f618d,stroke:#333,color:#fff
style B30 fill:#1a5276,stroke:#333,color:#fff
style B31 fill:#154360,stroke:#333,color:#fff
Bank formula: bank(address) = (address / 4) % 32
The Three Cases
__shared__ float smem[1024];
// CASE 1: No conflict ; stride-1 access (perfect)
float val = smem[threadIdx.x];
// Thread 0 -> Bank 0, Thread 1 -> Bank 1, ..., Thread 31 -> Bank 31
// One cycle. 32 simultaneous reads.
// CASE 2: 2-way conflict ; stride-2 access
float val = smem[threadIdx.x * 2];
// Thread 0 -> Bank 0, Thread 16 -> Bank 0 (CONFLICT!)
// Thread 1 -> Bank 2, Thread 17 -> Bank 2 (CONFLICT!)
// Two cycles instead of one. 50% throughput.
// CASE 3: 32-way conflict ; stride-32 access (catastrophic)
float val = smem[threadIdx.x * 32];
// ALL threads -> Bank 0
// 32 cycles instead of 1. 3.125% throughput.
// CASE 4: Broadcast ; all read SAME address (free!)
float val = smem[0];
// All 32 threads read address 0 -> Bank 0 broadcasts. One cycle.
Bank Conflict Diagram
graph LR
subgraph "No Conflict - stride 1"
direction TB
TA0["T0"] -->|"addr 0"| BA0["Bank 0"]
TA1["T1"] -->|"addr 4"| BA1["Bank 1"]
TA2["T2"] -->|"addr 8"| BA2["Bank 2"]
TA3["T3"] -->|"addr 12"| BA3["Bank 3"]
end
subgraph "2-way Conflict - stride 2"
direction TB
TB0["T0"] -->|"addr 0"| BB0["Bank 0"]
TB1["T1"] -->|"addr 8"| BB2["Bank 2"]
TB16["T16"] -->|"addr 128"| BB0
TB17["T17"] -->|"addr 136"| BB2
end
style BA0 fill:#2ecc71,stroke:#333,color:#fff
style BA1 fill:#2ecc71,stroke:#333,color:#fff
style BA2 fill:#2ecc71,stroke:#333,color:#fff
style BA3 fill:#2ecc71,stroke:#333,color:#fff
style BB0 fill:#e74c3c,stroke:#333,color:#fff
style BB2 fill:#e74c3c,stroke:#333,color:#fff
Fixing Bank Conflicts: Padding
// Problem: column-major access to a 32-wide array
__shared__ float tile[32][32]; // smem[row][col]
// Accessing column: tile[threadIdx.x][col] -> stride-32 -> 32-way conflict!
// Fix: add padding
__shared__ float tile[32][32 + 1]; // 33 columns
// Now stride is 33, and 33 % 32 = 1 -> perfect stride-1 access!
// One wasted float per row, but 32x throughput improvement.
2.4 Global Memory & Coalescing
Global memory (HBM) is accessed through 128-byte cache lines. The hardware coalesces requests from all 32 threads in a warp into the minimum number of transactions.
graph TD
subgraph "Coalesced Access - 1 transaction"
W0["Warp: 32 threads read consecutive floats
Addresses: 0, 4, 8, ..., 124"]
CL0["Cache Line: 128 bytes at address 0"]
W0 -->|"1 transaction"| CL0
end
subgraph "Strided Access - 32 transactions"
W1["Warp: 32 threads read stride-32
Addresses: 0, 128, 256, ..., 3968"]
CL1["Cache Line at addr 0"]
CL2["Cache Line at addr 128"]
CL3["Cache Line at addr 256"]
CL32["...32 cache lines total"]
W1 --> CL1
W1 --> CL2
W1 --> CL3
W1 --> CL32
end
style CL0 fill:#2ecc71,stroke:#333,color:#fff
style CL1 fill:#e74c3c,stroke:#333,color:#fff
style CL2 fill:#e74c3c,stroke:#333,color:#fff
style CL3 fill:#e74c3c,stroke:#333,color:#fff
The Coalescing Hardware
Inside each SM's Load/Store Unit (LSU):
- Collect all 32 addresses from the warp
- Sort them by 128-byte aligned segment
- Issue one memory transaction per unique segment
- Route returned bytes to correct thread registers
// Perfectly coalesced ; 1 transaction per warp
__global__ void coalesced(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[idx]; // Threads 0-31 read addresses 0-124
}
// Strided ; 32 transactions per warp (32x slower)
__global__ void strided(float* data, int stride) {
int idx = threadIdx.x * stride;
float val = data[idx]; // Each thread hits a different cache line
}
// Random ; up to 32 transactions per warp
__global__ void random_access(float* data, int* indices) {
float val = data[indices[threadIdx.x]]; // Unpredictable addresses
}
Coalescing Math
For a warp accessing addresses $a_0, a_1, \ldots, a_{31}$:
i.e., the number of distinct 128-byte aligned segments touched.
Best case: 1 transaction (128 bytes), all data used β 100% efficiency
Worst case: 32 transactions (4096 bytes), only 128 bytes used β 3.125% efficiency
2.5 Occupancy: The Art of Hiding Latency
Occupancy = Active warps / Maximum warps per SM
graph TD
subgraph "Occupancy Limiters"
REG["Register Usage
65,536 regs / regs per thread x 32
= max warps from registers"]
SMEM_L["Shared Memory Usage
Max shared mem / smem per block
= max blocks then max warps"]
BLK["Block Size
Max 32 blocks/SM
Small blocks waste slots"]
end
REG --> MIN["Take MINIMUM
= Active Warps"]
SMEM_L --> MIN
BLK --> MIN
MIN --> OCC["Occupancy = Active / 64"]
style REG fill:#e74c3c,stroke:#333,color:#fff
style SMEM_L fill:#3498db,stroke:#333,color:#fff
style BLK fill:#2ecc71,stroke:#333,color:#fff
style MIN fill:#f1c40f,stroke:#333,color:#000
style OCC fill:#9b59b6,stroke:#333,color:#fff
Worked Example
Kernel: 256 threads/block, 48 regs/thread, 16 KB shared memory
GPU: SM90 (Hopper)
Register limit:
65,536 regs / (48 regs x 256 threads) = 65,536 / 12,288 = 5.33 -> 5 blocks
5 blocks x 256 threads = 1,280 threads = 40 warps
Shared memory limit:
228 KB / 16 KB = 14.25 -> 14 blocks
14 blocks x 256 threads = 3,584 -> but max is 2,048 -> 64 warps
Block count limit:
32 blocks max -> 32 x 256 = 8,192 -> capped at 2,048 -> 64 warps
Final: min(40, 64, 64) = 40 warps -> Occupancy = 40/64 = 62.5%
Feynman's warning: Don't worship occupancy. Sometimes 50% occupancy with 64 registers per thread beats 100% occupancy with 32 registers, because you avoid expensive memory spills. The occupancy calculator is a guide, not a god.
2.6 The Warp Scheduler: Juggling Latency
Each SM has 4 warp schedulers, each managing a pool of warps. Every cycle, each scheduler:
graph TD
CHECK["Check: Is any warp ready?
Not stalled on memory, barrier, or dependency"] -->|"Yes"| SELECT
CHECK -->|"No"| STALL["Scheduler stall
wasted cycle"]
SELECT["Select ready warp
Greedy-Then-Oldest policy"] --> FETCH["Fetch instruction
from warp's PC"]
FETCH --> DECODE["Decode + Dispatch
to execution unit"]
DECODE --> EXEC["Execute on 32 CUDA cores
or other unit"]
EXEC --> SCOREBOARD["Update scoreboard
Mark dest regs as pending"]
SCOREBOARD --> CHECK
style CHECK fill:#3498db,stroke:#333,color:#fff
style SELECT fill:#2ecc71,stroke:#333,color:#fff
style FETCH fill:#e67e22,stroke:#333,color:#fff
style EXEC fill:#e74c3c,stroke:#333,color:#fff
style SCOREBOARD fill:#9b59b6,stroke:#333,color:#fff
style STALL fill:#c0392b,stroke:#333,color:#fff
The Scoreboard
Each warp has a scoreboard ; a bit vector tracking which registers have pending writes:
Warp 7 Scoreboard:
R0: ready R8: ready R16: PENDING (waiting for LD)
R1: ready R9: ready R17: ready
R2: PENDING R10: ready R18: ready
...
If next instruction is: ADD R5, R2, R3
-> R2 is PENDING -> warp 7 is STALLED
-> Scheduler picks another warp
This is why GPUs need many warps: every memory access stalls a warp for hundreds of cycles. The scheduler swaps to another warp in zero cycles (zero-cost context switching, because all warp state is always resident in registers).
Latency Hiding Math
For a memory-bound kernel:
Example: Memory latency = 400 cycles, one memory op every 8 arithmetic instructions (4 cycles each = 32 cycles):
With 4 schedulers: 52 warps minimum β ~81% occupancy needed.
Hour 3 ; The Generational Leap: Volta β Ampere β Hopper
3.1 Tensor Cores: Why Multiply-Accumulate Is King
Deep learning is dominated by one operation: matrix multiply-accumulate (MMA).
Where $A$ is $M \times K$, $B$ is $K \times N$, and $C, D$ are $M \times N$.
The arithmetic intensity of matrix multiplication:
For large square matrices ($M = N = K$): intensity $\approx \frac{2K}{3b}$, which grows with matrix size. This is why GPUs love large GEMMs ; they become compute-bound, not memory-bound.
A Tensor Core is a specialized circuit that computes a small MMA (e.g., 4x4x4) in a single clock cycle, rather than requiring 128 individual FMA (fused multiply-add) operations.
graph LR
subgraph "Without Tensor Core"
direction TB
C1["CUDA Core 0: a00 x b00"]
C2["CUDA Core 1: a01 x b10"]
C3["CUDA Core 2: a02 x b20"]
C4["CUDA Core 3: a03 x b30"]
ACC["Accumulate: d00 = c00 + Sum"]
C1 --> ACC
C2 --> ACC
C3 --> ACC
C4 --> ACC
end
subgraph "With Tensor Core"
direction TB
TC["Tensor Core:
Computes ENTIRE
4x4x4 MMA
in ONE cycle
= 128 FMA ops"]
end
style TC fill:#e74c3c,stroke:#333,color:#fff
style C1 fill:#3498db,stroke:#333,color:#fff
style C2 fill:#3498db,stroke:#333,color:#fff
style C3 fill:#3498db,stroke:#333,color:#fff
style C4 fill:#3498db,stroke:#333,color:#fff
3.2 Volta (SM70): Where Tensor Cores Were Born
Volta (2017) introduced the 1st generation Tensor Core and independent thread scheduling.
Tensor Core V1: 4x4x4 FP16 to FP32
Each Tensor Core computes: $D_{4 \times 4} = A_{4 \times 4} \cdot B_{4 \times 4} + C_{4 \times 4}$
- Input: FP16 (A, B)
- Accumulate: FP32 (C, D)
- Per Tensor Core per clock: 64 FMA operations
- 8 Tensor Cores per SM: 512 FMA ops/SM/clock
CUDA: WMMA API
#include <mma.h>
using namespace nvcuda;
__global__ void volta_wmma_gemm(half* A, half* B, float* C, float* D,
int M, int N, int K) {
// Declare fragments for a 16x16x16 tile
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
// Initialize accumulator to zero
wmma::fill_fragment(c_frag, 0.0f);
// Compute warp's tile position
int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / 32 / (N/16);
int warpN = (blockIdx.x * blockDim.x + threadIdx.x) / 32 % (N/16);
int row = warpM * 16;
int col = warpN * 16;
// Loop over K dimension in tiles of 16
for (int k = 0; k < K; k += 16) {
// Load A and B tiles from global memory
wmma::load_matrix_sync(a_frag, A + row * K + k, K);
wmma::load_matrix_sync(b_frag, B + k * N + col, N);
// Tensor Core MMA: C += A x B
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}
// Store result
wmma::store_matrix_sync(D + row * N + col, c_frag, N, wmma::mem_row_major);
}
PTX: wmma.mma.sync
// PTX for Volta WMMA (16x16x16, FP16 to FP32)
// Warp-synchronous ; all 32 threads participate
// Declare register fragments
.reg .f32 %acc<8>; // 8 accumulator registers per thread (16x16 / 32 threads)
.reg .b32 %a<4>; // A fragment (packed FP16)
.reg .b32 %b<4>; // B fragment (packed FP16)
// Load A fragment from shared memory
wmma.load.a.sync.aligned.m16n16k16.shared.row.f16
{%a0, %a1, %a2, %a3}, [smem_a_addr], stride_a;
// Load B fragment from shared memory
wmma.load.b.sync.aligned.m16n16k16.shared.col.f16
{%b0, %b1, %b2, %b3}, [smem_b_addr], stride_b;
// Tensor Core MMA
wmma.mma.sync.aligned.m16n16k16.row.col.f32.f16.f16.f32
{%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7},
{%a0, %a1, %a2, %a3},
{%b0, %b1, %b2, %b3},
{%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7};
Key insight: The
wmma.mma.syncis warp-synchronous ; all 32 threads must execute it together. Each thread contributes a piece of the A, B, and C matrices. The Tensor Core hardware does the reduction internally.
3.3 Ampere (SM80): Async Copy & mbarrier
Ampere (2020) brought three game-changers:
- 3rd-gen Tensor Cores (TF32, BF16, FP64, structured sparsity)
- cp.async ; hardware asynchronous copy from global to shared memory
- mbarrier ; hardware-accelerated synchronization primitive
cp.async: Bypass the Register File
Before Ampere, loading from global to shared memory required:
Global Memory -> Register -> Shared Memory (2 steps, registers busy)
Ampere's cp.async does it in one hardware step:
Global Memory -> Shared Memory (1 step, registers free!)
graph LR
subgraph "Pre-Ampere via registers"
GM1["Global Memory"] -->|"LDG to register"| REG1["Register File"]
REG1 -->|"STS to shared"| SM1["Shared Memory"]
end
subgraph "Ampere cp.async bypass registers"
GM2["Global Memory"] -->|"cp.async
hardware DMA"| SM2["Shared Memory"]
end
style GM1 fill:#e74c3c,stroke:#333,color:#fff
style REG1 fill:#f1c40f,stroke:#333,color:#000
style SM1 fill:#3498db,stroke:#333,color:#fff
style GM2 fill:#e74c3c,stroke:#333,color:#fff
style SM2 fill:#3498db,stroke:#333,color:#fff
CUDA cp.async
#include <cuda_pipeline.h>
__global__ void async_copy_kernel(float4* global_data, int N) {
__shared__ float4 smem_buffer[2][64]; // Double buffer
int tid = threadIdx.x;
int stage = 0;
// Stage 0: Issue first async copy
__pipeline_memcpy_async(&smem_buffer[0][tid],
&global_data[tid], sizeof(float4));
__pipeline_commit();
for (int i = 1; i < N; i++) {
int next_stage = stage ^ 1; // Toggle 0 <-> 1
// Issue async copy for NEXT tile
__pipeline_memcpy_async(&smem_buffer[next_stage][tid],
&global_data[i * 64 + tid], sizeof(float4));
__pipeline_commit();
// Wait for CURRENT tile to be ready
__pipeline_wait_prior(1); // Wait until only 1 group pending
__syncthreads();
// Compute on current tile
compute(smem_buffer[stage]);
stage = next_stage;
}
// Process last tile
__pipeline_wait_prior(0);
__syncthreads();
compute(smem_buffer[stage]);
}
PTX: cp.async
// Copy 16 bytes from global to shared memory, asynchronously
cp.async.ca.shared.global [smem_addr], [global_addr], 16;
// Commit this group of copies
cp.async.commit_group;
// Wait until at most N groups are still pending
cp.async.wait_group N;
// Wait for ALL pending copies
cp.async.wait_all;
Ampere's Structured Sparsity (2:4)
Ampere introduced hardware-accelerated 2:4 structured sparsity: in every group of 4 elements, at least 2 must be zero.
Dense: [0.5, 0.0, 0.3, 0.0, 0.7, 0.0, 0.1, 0.0]
Sparse: [0.5, 0.3, 0.7, 0.1] + metadata: [0,2,0,2]
The Tensor Core skips zero multiplications β 2x throughput.
3.4 Deep Dive: mbarrier ; Phase-Based Synchronization
"Synchronization is the tax you pay for parallelism. mbarrier reduces that tax."
Why Not Just __syncthreads()?
__syncthreads() is a sledgehammer: it forces ALL threads in a block to reach the same point. It knows nothing about asynchronous operations. It's block-scoped only.
mbarrier is a scalpel: it can track async operations (like cp.async), works across CTAs in a cluster, and uses phases for pipelining.
| Feature | __syncthreads() |
mbarrier |
|---|---|---|
| Scope | Block-level only | Flexible (warp, block, cross-CTA in cluster) |
| Async awareness | No | Yes ; can track async operations |
| Hardware support | Barrier instruction | Dedicated hardware unit in SM |
| Phase tracking | No | Yes ; alternating phases |
| Completion tracking | Thread arrival only | Thread arrival + async byte count |
The mbarrier State Machine
stateDiagram-v2
[*] --> Init: mbarrier.init count=N
Init --> Phase0_Waiting: Phase 0 begins
Phase0_Waiting --> Phase0_Waiting: arrive count--
Phase0_Waiting --> Phase0_Complete: count reaches 0
Phase0_Complete --> Phase1_Waiting: Phase flips to 1
Phase1_Waiting --> Phase1_Waiting: arrive count--
Phase1_Waiting --> Phase1_Complete: count reaches 0
Phase1_Complete --> Phase0_Waiting: Phase flips back to 0
Phase 0 note: Threads call
mbarrier.arrive(), async ops arrive automatically. Waiters spin ontry_wait(phase=0).Phase 1 note: Same mechanism, opposite phase. Allows pipelining stages.
The mbarrier Object in Memory
An mbarrier lives in shared memory as an 8-byte (64-bit) opaque object:
Bits 63-32: Pending async byte count (for cp.async tracking)
Bits 31-16: Arrival count remaining
Bit 0: Phase bit (alternates 0 and 1)
The hardware manages these bits atomically.
Complete PTX mbarrier Example: Multi-Stage Pipeline
// ============================================
// Multi-stage async pipeline using mbarrier
// ============================================
.shared .align 8 .b64 mbar[4]; // 4 mbarrier objects (4 stages)
.shared .align 128 .b8 smem_buf[4][TILE_SZ]; // 4 stage buffers
// --- Initialization (thread 0 only) ---
.reg .pred %is_t0;
setp.eq.u32 %is_t0, %tid, 0;
@%is_t0 mbarrier.init.shared.b64 [mbar + 0], %block_size;
@%is_t0 mbarrier.init.shared.b64 [mbar + 8], %block_size;
@%is_t0 mbarrier.init.shared.b64 [mbar + 16], %block_size;
@%is_t0 mbarrier.init.shared.b64 [mbar + 24], %block_size;
bar.sync 0; // Ensure init is visible to all threads
// --- Prologue: Fill the pipeline ---
// Stage 0: Issue async copy and arrive
cp.async.ca.shared.global [smem_buf + 0 + %tid_offset],
[global_addr_0 + %tid_offset], 16;
cp.async.commit_group;
mbarrier.arrive.expect_tx.shared.b64 %state, [mbar + 0], 16;
// Tell the barrier: "expect 16 more bytes from async ops"
// Stage 1: Issue async copy
cp.async.ca.shared.global [smem_buf + TILE_SZ + %tid_offset],
[global_addr_1 + %tid_offset], 16;
cp.async.commit_group;
mbarrier.arrive.expect_tx.shared.b64 %state, [mbar + 8], 16;
// --- Main Loop ---
.reg .u32 %stage;
mov.u32 %stage, 0;
MAIN_LOOP:
// Wait for current stage data to be ready
.reg .u64 %mbar_addr;
// compute %mbar_addr = mbar + (%stage % 4) * 8
.reg .pred %wait_done;
TRY_WAIT:
mbarrier.try_wait.parity.shared.b64 %wait_done, [%mbar_addr], %phase;
@!%wait_done bra TRY_WAIT;
// Data is ready! Proceed.
// Issue async copy for stage+2 (prefetch ahead)
// ... (similar cp.async + mbarrier.arrive.expect_tx) ...
// Compute on current stage data
ld.shared.f32 %f0, [%current_buf + %tid_offset];
// ... computation ...
st.global.f32 [%output + %tid_offset], %f_result;
// Advance
add.u32 %stage, %stage, 1;
setp.lt.u32 %p_loop, %stage, %num_tiles;
@%p_loop bra MAIN_LOOP;
mbarrier with TMA (Hopper+)
On Hopper, the TMA hardware automatically arrives at an mbarrier when its copy completes ; no thread intervention needed:
sequenceDiagram
participant Thread as Warp Thread
participant TMA as TMA Engine
participant MBAR as mbarrier
participant SMEM as Shared Memory
participant GMEM as Global Memory
Thread->>MBAR: mbarrier.arrive.expect_tx(bytes)
Thread->>TMA: cp.async.bulk.tensor [smem], [tensorMap], [mbar]
Note right of Thread: Thread continues other work!
TMA->>GMEM: Read tensor tile
GMEM-->>TMA: Data
TMA->>SMEM: Write to shared memory
TMA->>MBAR: Hardware auto-arrive (tx_count -= bytes)
Note right of MBAR: When all arrivals + TX complete: phase flips
Thread->>MBAR: mbarrier.try_wait(phase) = complete!
3.5 Hopper (SM90): TMA, Clusters & WGMMA
Hopper (2022) was a quantum leap. Three major innovations:
1. Thread Block Clusters
A cluster is a group of up to 8 CTAs (thread blocks) that are guaranteed to be co-scheduled on the same GPC.
graph TD
subgraph GPC0["GPC 0"]
subgraph Cluster["Thread Block Cluster"]
CTA0["CTA 0
SM 0
Shared Mem 0"]
CTA1["CTA 1
SM 1
Shared Mem 1"]
CTA2["CTA 2
SM 2
Shared Mem 2"]
CTA3["CTA 3
SM 3
Shared Mem 3"]
CTA0 <-->|"DSMEM"| CTA1
CTA1 <-->|"DSMEM"| CTA2
CTA2 <-->|"DSMEM"| CTA3
CTA0 <-->|"DSMEM"| CTA3
end
end
style Cluster fill:#1a1a2e,stroke:#e94560,color:#fff
style CTA0 fill:#16213e,stroke:#0f3460,color:#fff
style CTA1 fill:#16213e,stroke:#0f3460,color:#fff
style CTA2 fill:#16213e,stroke:#0f3460,color:#fff
style CTA3 fill:#16213e,stroke:#0f3460,color:#fff
Distributed Shared Memory (DSMEM): CTA 0 can directly read/write CTA 1's shared memory ; no global memory round-trip needed.
__cluster_dims__(4, 1, 1) // 4 CTAs in a cluster
__global__ void cluster_kernel() {
// Get cluster info
auto cluster = cooperative_groups::this_cluster();
extern __shared__ int smem[];
// Access another CTA's shared memory directly!
int target_cta = (cluster.block_rank() + 1) % cluster.num_blocks();
int* remote_smem = cluster.map_shared_rank(smem, target_cta);
// Direct cross-CTA shared memory access (via DSMEM)
int val = *remote_smem; // No global memory needed!
}
2. Tensor Memory Accelerator (TMA)
The TMA is a dedicated hardware engine in each SM that moves multi-dimensional tensor tiles between global memory and shared memory.
graph LR
subgraph "Without TMA - Manual Tiling"
direction TB
T0["Thread 0: load gmem 0 to reg to smem 0"]
T1["Thread 1: load gmem 1 to reg to smem 1"]
T2["Thread 2: load gmem 2 to reg to smem 2"]
DOTS["...128 threads doing address math"]
end
subgraph "With TMA - 1 thread 1 instruction"
direction TB
TMA["TMA Engine:
cp.async.bulk.tensor.2d
[smem], [tensorMap, coords], [mbar]
Handles:
- Address calculation
- Bounds checking
- Swizzling
- Format conversion
- mbarrier notification"]
end
style TMA fill:#e74c3c,stroke:#333,color:#fff
style T0 fill:#95a5a6,stroke:#333,color:#fff
style T1 fill:#95a5a6,stroke:#333,color:#fff
style T2 fill:#95a5a6,stroke:#333,color:#fff
TMA Setup (Host Side)
// Create a tensor map descriptor on the host
CUtensorMap tensorMap;
CUresult result = cuTensorMapEncodeTiled(
&tensorMap,
CU_TENSOR_MAP_DATA_TYPE_FLOAT16, // Element type
2, // Number of dimensions
globalPtr, // Base pointer in global memory
globalDim, // {rows, cols} of global tensor
globalStrides, // {stride_row_bytes, stride_col_bytes}
boxDim, // {tile_rows, tile_cols} to transfer
elementStrides, // Element strides
CU_TENSOR_MAP_INTERLEAVE_NONE, // Interleaving
CU_TENSOR_MAP_SWIZZLE_128B, // Swizzle pattern
CU_TENSOR_MAP_L2_PROMOTION_L2_128B, // L2 policy
CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE // Out-of-bounds handling
);
TMA PTX (Device Side)
// Load a 2D tile from global to shared memory
// Only ONE thread needs to execute this ; TMA does all the work
.reg .pred %is_leader;
setp.eq.u32 %is_leader, %tid, 0;
@%is_leader cp.async.bulk.tensor.2d.shared::cluster.global.tile
.mbarrier::complete_tx::bytes
[smem_addr], // Destination in shared memory
[tensorMapPtr, {%coord_x, %coord_y}], // Source: tensor map + coordinates
[mbar_addr]; // mbarrier to signal on completion
// All threads wait for TMA to complete
WAIT:
mbarrier.try_wait.parity.shared.b64 %wait_result, [mbar_addr], %phase;
@!%wait_result bra WAIT;
TMA Multicast (Load once, deliver to multiple CTAs)
// Load tile and multicast to CTAs 0, 1, 2, 3 in a cluster
@%is_leader cp.async.bulk.tensor.2d.shared::cluster.global.tile
.mbarrier::complete_tx::bytes
.multicast::cluster
[smem_addr],
[tensorMapPtr, {%coord_x, %coord_y}],
[mbar_addr],
%ctaMask; // Bitmask: 0b1111 = CTAs 0,1,2,3
Why TMA is revolutionary: Without TMA, 128 threads waste time computing addresses, doing bounds checks, and issuing individual loads. With TMA, one instruction replaces all of that. The freed-up threads do useful compute instead.
3.6 WGMMA: The Warpgroup Matrix Machine
WGMMA (Warpgroup Matrix Multiply-Accumulate) is Hopper's Tensor Core interface. It operates at the warpgroup level: 4 warps (128 threads) acting as a unit.
Why Warpgroups?
graph TD
subgraph "Volta/Ampere: Warp-level MMA"
W0["1 Warp = 32 threads
wmma.mma.sync 16x16x16
= 8,192 FLOPs"]
end
subgraph "Hopper: Warpgroup-level MMA"
WG["4 Warps = 128 threads
wgmma.mma_async 64x256x16
= 524,288 FLOPs
64x more work per instruction!"]
end
style W0 fill:#3498db,stroke:#333,color:#fff
style WG fill:#e74c3c,stroke:#333,color:#fff
WGMMA Key Innovation
Operand B can come directly from shared memory ; no need to load it into registers first:
graph LR
subgraph "Volta wmma"
SMEM1["Shared Mem"] -->|"load to reg"| REG1["Registers A"]
SMEM1 -->|"load to reg"| REG2["Registers B"]
REG1 --> TC1["Tensor Core"]
REG2 --> TC1
end
subgraph "Hopper wgmma"
SMEM2["Shared Mem B"] -->|"direct read"| TC2["Tensor Core"]
REG3["Registers A"] --> TC2
end
style TC1 fill:#3498db,stroke:#333,color:#fff
style TC2 fill:#e74c3c,stroke:#333,color:#fff
Key: B reads shared memory directly! This saves register bandwidth.
WGMMA PTX
// ==========================================
// Hopper WGMMA: 64x256x16 FP16 to FP32
// ==========================================
// Step 1: Fence ; ensure memory ordering
wgmma.fence.sync.aligned;
// Step 2: Issue the MMA
// desc_a = 64-bit descriptor for A matrix in shared memory
// desc_b = 64-bit descriptor for B matrix in shared memory
wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16
{%acc0, %acc1, %acc2, %acc3,
%acc4, %acc5, %acc6, %acc7,
%acc8, %acc9, %acc10, %acc11,
%acc12, %acc13, %acc14, %acc15,
%acc16, %acc17, %acc18, %acc19,
%acc20, %acc21, %acc22, %acc23,
%acc24, %acc25, %acc26, %acc27,
%acc28, %acc29, %acc30, %acc31,
%acc32, %acc33, %acc34, %acc35,
%acc36, %acc37, %acc38, %acc39,
%acc40, %acc41, %acc42, %acc43,
%acc44, %acc45, %acc46, %acc47,
%acc48, %acc49, %acc50, %acc51,
%acc52, %acc53, %acc54, %acc55,
%acc56, %acc57, %acc58, %acc59,
%acc60, %acc61, %acc62, %acc63,
%acc64, %acc65, %acc66, %acc67,
%acc68, %acc69, %acc70, %acc71,
%acc72, %acc73, %acc74, %acc75,
%acc76, %acc77, %acc78, %acc79,
%acc80, %acc81, %acc82, %acc83,
%acc84, %acc85, %acc86, %acc87,
%acc88, %acc89, %acc90, %acc91,
%acc92, %acc93, %acc94, %acc95,
%acc96, %acc97, %acc98, %acc99,
%acc100,%acc101,%acc102,%acc103,
%acc104,%acc105,%acc106,%acc107,
%acc108,%acc109,%acc110,%acc111,
%acc112,%acc113,%acc114,%acc115,
%acc116,%acc117,%acc118,%acc119,
%acc120,%acc121,%acc122,%acc123,
%acc124,%acc125,%acc126,%acc127},
%desc_a, // 64-bit shared memory descriptor for A
%desc_b, // 64-bit shared memory descriptor for B
1, // Scale D (multiply accumulator by 1)
1, 1, // Scale A, Scale B (negate flags)
0, 0; // Transpose A, Transpose B
// Step 3: Commit ; mark this MMA group for tracking
wgmma.commit_group.sync.aligned;
// Step 4: Wait ; block until MMA completes
wgmma.wait_group.sync.aligned 0; // 0 = wait for all groups
// Now %acc0..%acc127 contain the 64x256 result matrix
WGMMA Descriptor Layout
The 64-bit descriptor encodes the shared memory tile location:
Bits 63-49: Reserved
Bits 48-46: Leading dimension mode
Bits 45-32: Base address offset (in shared memory, 16B aligned)
Bits 31-16: Leading dimension byte offset
Bits 15-4: Stride dimension byte offset
Bits 3-0: Swizzle mode
Complete Hopper GEMM Pattern
sequenceDiagram
participant TMA as TMA Engine
participant SMEM as Shared Memory
participant WG as Warpgroup of 4 warps
participant TC as Tensor Cores
Note over TMA,TC: === Stage k (double-buffered) ===
WG->>TMA: Issue TMA load for tile k+1
TMA->>SMEM: Async bulk copy (tile k+1)
WG->>WG: wgmma.fence
WG->>TC: wgmma.mma_async (on tile k data)
WG->>WG: wgmma.commit_group
TMA-->>SMEM: TMA arrives at mbarrier
WG->>WG: Wait on mbarrier (tile k+1 ready)
WG->>WG: wgmma.wait_group (tile k MMA done)
Note over TMA,TC: === Stage k+1 ===
WG->>TMA: Issue TMA load for tile k+2
WG->>TC: wgmma.mma_async (on tile k+1 data)
Hour 4 ; Blackwell: The Fifth Generation
4.1 Blackwell Architecture Overview
Blackwell (2024) ; compute capability sm_100 (data center) and sm_120 (desktop).
graph TD
subgraph GB100["GB100 Full Die"]
direction TB
SMs["192 SMs
24,576 CUDA Cores
768 Tensor Cores - 5th gen"]
L2["96 MB L2 Cache"]
HBM["192 GB HBM3e
8 TB/s bandwidth"]
NVLINK["18x NVLink 5
1.8 TB/s bidirectional"]
subgraph PerSM["Per SM"]
CUDA_B["128 FP32 CUDA Cores"]
TC_B["4x 5th Gen Tensor Cores"]
TMEM_B["TENSOR MEMORY - TMEM
NEW! Dedicated TC storage"]
TMA_B["TMA Enhanced"]
RF_B["256 KB Register File"]
SMEM_B["228 KB Shared Mem / L1"]
MBAR_B["16 mbarrier slots"]
end
end
style SMs fill:#e74c3c,stroke:#333,color:#fff
style L2 fill:#e67e22,stroke:#333,color:#fff
style HBM fill:#f1c40f,stroke:#333,color:#000
style NVLINK fill:#2ecc71,stroke:#333,color:#fff
style TMEM_B fill:#e74c3c,stroke:#fff,color:#fff
style TC_B fill:#c0392b,stroke:#333,color:#fff
What's New in Blackwell
| Feature | Hopper | Blackwell | Improvement |
|---|---|---|---|
| Transistors | 80B | 208B | 2.6x |
| SMs | 132 | 192 | 1.45x |
| Tensor Core gen | 4th | 5th | New ISA |
| Peak FP4 (sparse) | N/A | ~9 PFLOPS | New! |
| Peak FP8 | 990 TFLOPS | ~1800 TFLOPS | ~1.8x |
| HBM Bandwidth | 3.35 TB/s | 8 TB/s | 2.4x |
| L2 Cache | 60 MB | 96 MB | 1.6x |
| NVLink BW | 900 GB/s | 1.8 TB/s | 2x |
| Key TC instruction | wgmma | tcgen05 | New paradigm |
| New memory space | ; | TMEM | New! |
| New data types | FP8 | FP4, MX formats | New! |
4.2 Tensor Memory (TMEM): A New Address Space
This is the single biggest architectural change in Blackwell. TMEM is a new, dedicated memory space that lives inside the SM, separate from both registers and shared memory.
graph TD
subgraph "Hopper Memory Spaces"
REG_H["Register File
256 KB
Accumulator lives HERE"]
SMEM_H["Shared Memory
228 KB
Operands A, B"]
GMEM_H["Global Memory
HBM3"]
end
subgraph "Blackwell Memory Spaces"
REG_BW["Register File
256 KB
General computation"]
TMEM_BW["TMEM - NEW
Accumulator lives HERE
Dedicated TC bandwidth"]
SMEM_BW["Shared Memory
228 KB
Operands A, B"]
GMEM_BW["Global Memory
HBM3e"]
end
style REG_H fill:#e74c3c,stroke:#333,color:#fff
style SMEM_H fill:#3498db,stroke:#333,color:#fff
style TMEM_BW fill:#e74c3c,stroke:#fff,color:#fff
style REG_BW fill:#f1c40f,stroke:#333,color:#000
style SMEM_BW fill:#3498db,stroke:#333,color:#fff
Why TMEM Exists
On Hopper, WGMMA results accumulate in registers. But the register file has limited bandwidth ; reading/writing 128 accumulator registers per warpgroup per MMA creates a bottleneck. The register file is also used for general computation.
TMEM solves this by giving the Tensor Cores their own private, high-bandwidth storage:
- Accumulators live in TMEM, not registers
- The Tensor Core reads/writes TMEM directly with dedicated pathways
- The register file is freed for other work (address computation, control flow)
- TMEM has higher bandwidth to the Tensor Cores than the register file
TMEM Access Patterns
// TMEM is accessed ONLY via tcgen05 instructions:
// Load from shared memory INTO TMEM
tcgen05.ld.16x256b [tmem_addr], [smem_addr];
// Store from TMEM to shared memory
tcgen05.st.16x256b [smem_addr], [tmem_addr];
// MMA writes results directly to TMEM
tcgen05.mma ... [tmem_addr], ...; // Accumulator in TMEM
// You CANNOT:
// - Load TMEM with regular LD instructions
// - Access TMEM from CUDA core ALUs directly
// - Use TMEM for general-purpose storage
4.3 tcgen05: The Fifth-Gen Tensor Core ISA
tcgen05 = Tensor Core Generation 05 ; the instruction set for Blackwell's 5th-gen Tensor Cores.
Instruction Overview
graph TD
subgraph "tcgen05 Instruction Family"
MMA["tcgen05.mma
Matrix Multiply-Accumulate
The main compute instruction"]
LD["tcgen05.ld
Load shared mem to TMEM"]
ST["tcgen05.st
Store TMEM to shared mem"]
FENCE["tcgen05.fence
Memory ordering"]
COMMIT["tcgen05.commit
Signal completion via mbarrier"]
end
LD -->|"Load operands"| MMA
MMA -->|"Results in TMEM"| ST
MMA -.->|"Order"| FENCE
MMA -.->|"Track completion"| COMMIT
style MMA fill:#e74c3c,stroke:#333,color:#fff
style LD fill:#3498db,stroke:#333,color:#fff
style ST fill:#2ecc71,stroke:#333,color:#fff
style FENCE fill:#e67e22,stroke:#333,color:#fff
style COMMIT fill:#9b59b6,stroke:#333,color:#fff
tcgen05.mma ; The Core Instruction
tcgen05.mma.cta_group::{1,2}.kind [d_tmem], a_desc, b_desc, idesc, enable;
Let's break down every operand:
| Operand | Description |
|---|---|
cta_group::{1,2} |
How many CTAs collaborate. 1 = single CTA. 2 = two CTAs in a cluster work together on one MMA. |
kind |
Data type combination: .f16, .tf32, .f8f6f4, .mxf8f6f4, etc. |
[d_tmem] |
Destination address in TMEM ; where the accumulator result is stored |
a_desc |
64-bit descriptor for matrix A (shared memory location, layout, size) |
b_desc |
64-bit descriptor for matrix B (shared memory location, layout, size) |
idesc |
Immediate descriptor ; encodes the MMA shape (M, N, K dimensions) |
enable |
Predicate ; allows conditional execution |
PTX Example: tcgen05.mma
// =============================================
// Blackwell tcgen05.mma: FP16 Matrix Multiply
// =============================================
// Prerequisites: A and B tiles in shared memory, TMEM allocated
// Step 1: Fence ; ensure previous TC ops are ordered
tcgen05.fence::before_thread_sync;
// Step 2: Issue the MMA
// Compute D[TMEM] = A[smem] x B[smem] + D[TMEM]
tcgen05.mma.cta_group::1.kind::f16
[%tmem_d], // Accumulator destination in TMEM
%desc_a, // 64-bit descriptor for A tile (in shared memory)
%desc_b, // 64-bit descriptor for B tile (in shared memory)
%idesc, // Immediate descriptor (shape info)
%enable; // Predicate enable
// Step 3: Commit ; signal completion via mbarrier
tcgen05.commit.cta_group::1.mbarrier::arrive [%mbar_addr];
// Step 4: Wait ; block until MMA completes
// (Check the mbarrier that tcgen05.commit arrived at)
TRY_WAIT_TC:
mbarrier.try_wait.parity.shared.b64 %done, [%mbar_addr], %phase;
@!%done bra TRY_WAIT_TC;
tcgen05 Load/Store ; Moving Data To/From TMEM
// Load: Shared Memory to TMEM
// Transfers a 16x256-bit tile (16 rows, each 256 bits = 32 bytes)
tcgen05.ld.16x256b [%tmem_addr], [%smem_addr];
// Store: TMEM to Shared Memory
tcgen05.st.16x256b [%smem_addr], [%tmem_addr];
tcgen05 vs WGMMA Comparison
graph LR
subgraph "Hopper WGMMA Flow"
direction TB
H1["1. Load A from smem to registers"]
H2["2. wgmma.mma_async: A reg, B smem to acc reg"]
H3["3. wgmma.wait: acc ready in registers"]
H4["4. Store registers to smem/gmem"]
H1 --> H2 --> H3 --> H4
end
subgraph "Blackwell tcgen05 Flow"
direction TB
B1["1. TMA loads A,B to smem"]
B2["2. tcgen05.mma: A smem, B smem to acc TMEM"]
B3["3. tcgen05.commit + mbar wait"]
B4["4. tcgen05.st: TMEM to smem"]
B1 --> B2 --> B3 --> B4
end
style H1 fill:#3498db,stroke:#333,color:#fff
style H2 fill:#3498db,stroke:#333,color:#fff
style H3 fill:#3498db,stroke:#333,color:#fff
style H4 fill:#3498db,stroke:#333,color:#fff
style B1 fill:#e74c3c,stroke:#333,color:#fff
style B2 fill:#e74c3c,stroke:#333,color:#fff
style B3 fill:#e74c3c,stroke:#333,color:#fff
style B4 fill:#e74c3c,stroke:#333,color:#fff
Key difference: In Blackwell, the accumulator never touches the register file during the MMA. It lives entirely in TMEM. This frees registers for address computation, control flow, and other work. The register file is no longer the bottleneck for Tensor Core throughput.
CTA-Group MMA: Two CTAs, One MMA
// Two CTAs in a cluster collaborate on a single, larger MMA
tcgen05.mma.cta_group::2.kind::f16
[%tmem_d], %desc_a, %desc_b, %idesc, %enable;
// Both CTAs contribute portions of A/B and receive portions of D
// The hardware coordinates the data sharing via DSMEM
// Result: larger effective tile size = better utilization
4.4 FP4 and Microscaling: Extreme Precision Engineering
FP4: E2M1 Format
+------+----------+----------+
| Sign | Exponent | Mantissa |
| 1 bit| 2 bits | 1 bit |
+------+----------+----------+
Representable values:
0, +/-0.5, +/-1.0, +/-1.5, +/-2.0, +/-3.0, +/-4.0, +/-6.0
Exponent bias = 1
Value = (-1)^s x 2^(e-1) x (1 + m/2)
= (-1)^s x 2^(e-1) x {1.0, 1.5}
Only 16 representable values! How can this possibly work?
Microscaling (MX) Formats
The trick: block-level scaling factors. Instead of each element having its own exponent, a block of elements shares one scale:
graph TD
subgraph "Standard FP16"
E0["elem 0: full 16-bit float"]
E1["elem 1: full 16-bit float"]
E2["elem 2: full 16-bit float"]
E3["elem 3: full 16-bit float"]
end
subgraph "MXFP4 Microscaling"
SCALE["Shared Scale Factor
8-bit E8M0 exponent
Per block of 32 elements"]
M0["elem 0: 4-bit"]
M1["elem 1: 4-bit"]
M2["elem 2: 4-bit"]
M31["elem 31: 4-bit"]
SCALE --> M0
SCALE --> M1
SCALE --> M2
SCALE --> M31
end
style SCALE fill:#e74c3c,stroke:#333,color:#fff
style M0 fill:#3498db,stroke:#333,color:#fff
style M1 fill:#3498db,stroke:#333,color:#fff
style M2 fill:#3498db,stroke:#333,color:#fff
style M31 fill:#3498db,stroke:#333,color:#fff
Effective value: $v_i = \text{scale} \times \text{fp4}_i = 2^{\text{shared\\_exp}} \times (-1)^{s_i} \times 2^{(e_i - 1)} \times (1 + m_i/2)$
This gives FP4 the dynamic range of ~FP16 with the storage cost of 4 bits + amortized scale overhead.
Why This Matters for Transformers
In transformer inference:
- Attention scores and FFN weights have relatively uniform magnitudes within blocks
- A shared scale factor captures the block-level magnitude
- Individual FP4 values capture the relative differences
- The 2nd-gen Transformer Engine automatically manages these scales
Result: 2x the throughput of FP8, with minimal accuracy loss.
MX Format Variants on Blackwell
| Format | Element bits | Scale | Elements/block | Use case |
|---|---|---|---|---|
| MXFP8 | 8 | E8M0 | 32 | High accuracy |
| MXFP6 | 6 | E8M0 | 32 | Balanced |
| MXFP4 | 4 | E8M0 | 32 | Max throughput |
4.5 Putting It All Together: A Blackwell GEMM Kernel
Here's the complete data flow for a high-performance GEMM on Blackwell:
graph TD
subgraph "Global Memory HBM3e"
GA["Matrix A
M x K"]
GB["Matrix B
K x N"]
end
subgraph "TMA Engine"
TMA_A["TMA Load A tile"]
TMA_B["TMA Load B tile"]
end
subgraph "Shared Memory"
SA["A tile buffer
double-buffered"]
SB["B tile buffer
double-buffered"]
end
subgraph "5th Gen Tensor Core"
MMA["tcgen05.mma"]
end
subgraph "TMEM"
ACC["Accumulator D
Lives in TMEM,
never touches registers!"]
end
subgraph "Output"
SC["Result tile in smem"]
GD["Matrix D: M x N
Global Memory"]
end
GA -->|"TMA descriptor"| TMA_A
GB -->|"TMA descriptor"| TMA_B
TMA_A -->|"async bulk copy"| SA
TMA_B -->|"async bulk copy"| SB
SA -->|"a_desc"| MMA
SB -->|"b_desc"| MMA
MMA -->|"accumulate"| ACC
ACC -->|"tcgen05.st"| SC
SC -->|"TMA store"| GD
style MMA fill:#e74c3c,stroke:#333,color:#fff
style ACC fill:#c0392b,stroke:#333,color:#fff
style TMA_A fill:#2ecc71,stroke:#333,color:#fff
style TMA_B fill:#2ecc71,stroke:#333,color:#fff
CUDA/PTX Pseudocode: Blackwell GEMM
// =============================================
// Blackwell GEMM Kernel (Conceptual Structure)
// Uses: TMA + tcgen05.mma + TMEM + mbarrier
// =============================================
#define TILE_M 128
#define TILE_N 256
#define TILE_K 64
#define STAGES 4
__cluster_dims__(2, 1, 1) // 2 CTAs per cluster
__global__ void blackwell_gemm(
const __grid_constant__ CUtensorMap tensorMapA,
const __grid_constant__ CUtensorMap tensorMapB,
half* D, int M, int N, int K)
{
extern __shared__ char smem[];
// Partition shared memory: double-buffered A & B tiles + mbarriers
constexpr int TILE_A_BYTES = TILE_M * TILE_K * sizeof(half);
constexpr int TILE_B_BYTES = TILE_K * TILE_N * sizeof(half);
half* smem_A[2] = {
(half*)(smem),
(half*)(smem + TILE_A_BYTES)
};
half* smem_B[2] = {
(half*)(smem + 2 * TILE_A_BYTES),
(half*)(smem + 2 * TILE_A_BYTES + TILE_B_BYTES)
};
// mbarrier array for pipeline stages
__shared__ __align__(8) uint64_t mbar_load[STAGES];
__shared__ __align__(8) uint64_t mbar_compute;
// =============================================
// Initialize mbarriers
// =============================================
if (threadIdx.x == 0) {
for (int i = 0; i < STAGES; i++)
asm volatile("mbarrier.init.shared.b64 [%0], %1;"
:: "l"(&mbar_load[i]), "r"(1));
asm volatile("mbarrier.init.shared.b64 [%0], %1;"
:: "l"(&mbar_compute), "r"(blockDim.x));
}
__syncthreads();
// Compute block tile coordinates
int block_m = blockIdx.x * TILE_M;
int block_n = blockIdx.y * TILE_N;
// =============================================
// PROLOGUE: Fill pipeline with first tiles
// =============================================
if (threadIdx.x == 0) {
// TMA load first A and B tiles
asm volatile(
"cp.async.bulk.tensor.2d.shared::cluster.global.tile"
".mbarrier::complete_tx::bytes"
" [%0], [%1, {%2, %3}], [%4];"
:: "l"(smem_A[0]),
"l"(&tensorMapA), "r"(0), "r"(block_m),
"l"(&mbar_load[0])
);
asm volatile(
"cp.async.bulk.tensor.2d.shared::cluster.global.tile"
".mbarrier::complete_tx::bytes"
" [%0], [%1, {%2, %3}], [%4];"
:: "l"(smem_B[0]),
"l"(&tensorMapB), "r"(block_n), "r"(0),
"l"(&mbar_load[0])
);
}
// =============================================
// MAIN LOOP: Iterate over K dimension
// =============================================
int num_k_tiles = K / TILE_K;
for (int k_tile = 0; k_tile < num_k_tiles; k_tile++) {
int buf = k_tile % 2;
// Wait for current tile's TMA load to complete
asm volatile(
"{\n\t"
".reg .pred p;\n\t"
"WAIT_LOAD_%=:\n\t"
"mbarrier.try_wait.parity.shared.b64 p, [%0], %1;\n\t"
"@!p bra WAIT_LOAD_%=;\n\t"
"}\n\t"
:: "l"(&mbar_load[k_tile % STAGES]),
"r"(k_tile / STAGES % 2)
);
// Issue TMA load for NEXT tile (prefetch)
if (threadIdx.x == 0 && k_tile + 1 < num_k_tiles) {
int next_k = (k_tile + 1) * TILE_K;
int next_buf = (k_tile + 1) % 2;
// TMA load next A and B tiles...
// (similar to prologue, targeting smem_A/B[next_buf])
}
// ----- tcgen05 MMA -----
asm volatile(
"tcgen05.fence::before_thread_sync;\n\t"
"tcgen05.mma.cta_group::1.kind::f16"
" [%0], %1, %2, %3, 1;\n\t"
"tcgen05.commit.cta_group::1"
".mbarrier::arrive [%4];\n\t"
:: "l"(/* tmem_addr */),
"l"(/* desc_a */),
"l"(/* desc_b */),
"r"(/* idesc */),
"l"(&mbar_compute)
);
// Wait for MMA to complete
asm volatile(
"{\n\t"
".reg .pred p;\n\t"
"WAIT_MMA_%=:\n\t"
"mbarrier.try_wait.parity.shared.b64 p, [%0], %1;\n\t"
"@!p bra WAIT_MMA_%=;\n\t"
"}\n\t"
:: "l"(&mbar_compute), "r"(k_tile % 2)
);
}
// =============================================
// EPILOGUE: Store results from TMEM
// =============================================
// Move accumulator from TMEM to shared memory
asm volatile(
"tcgen05.st.16x256b [%0], [%1];\n\t"
:: "l"(/* smem_result */), "l"(/* tmem_addr */)
);
__syncthreads();
// Store from shared memory to global memory
// (each thread stores its portion)
int tid = threadIdx.x;
int elems_per_thread = (TILE_M * TILE_N) / blockDim.x;
half* smem_out = (half*)(smem); // reuse shared memory
for (int i = 0; i < elems_per_thread; i++) {
int idx = tid * elems_per_thread + i;
int row = block_m + idx / TILE_N;
int col = block_n + idx % TILE_N;
if (row < M && col < N) {
D[row * N + col] = smem_out[idx];
}
}
}
4.6 The Full Picture: Architecture Comparison
Evolution of GPU Tensor Core Programming
graph TD
V["Volta 2017
wmma.mma.sync
Warp-level: 32 threads
16x16x16 tiles
FP16 only
Data: Reg to TC to Reg"]
A["Ampere 2020
mma.sync + cp.async
Warp-level: 32 threads
Async global to shared copy
mbarrier sync
TF32, BF16, FP64, Sparsity
Data: Reg to TC to Reg"]
H["Hopper 2022
wgmma.mma_async
Warpgroup-level: 128 threads
TMA hardware engine
Thread Block Clusters
DSMEM, FP8
Data: Smem/Reg to TC to Reg"]
B["Blackwell 2024
tcgen05.mma
CTA-group level: 1-2 CTAs
TMEM new address space
FP4, MX formats
Enhanced TMA
CTA-group collaboration
Data: Smem to TC to TMEM"]
V --> A --> H --> B
style V fill:#6c5ce7,stroke:#333,color:#fff
style A fill:#00b894,stroke:#333,color:#fff
style H fill:#fdcb6e,stroke:#333,color:#000
style B fill:#e17055,stroke:#333,color:#fff
Peak Performance Across Generations
| Generation | FP16 Tensor (TFLOPS) | FP8 Tensor (TFLOPS) | FP4 Tensor (TFLOPS) | HBM BW (TB/s) |
|---|---|---|---|---|
| V100 (Volta) | 125 | ; | ; | 0.9 |
| A100 (Ampere) | 312 | ; | ; | 2.0 |
| H100 (Hopper) | 990 | 1,979 | ; | 3.35 |
| B200 (Blackwell) | ~1,800 | ~3,600 | ~9,000 (sparse) | 8.0 |
The Memory Wall: Compute Grows Faster Than Memory
graph TD
COMPUTE["Compute Growth:
~4x per generation
V100 to B200: ~72x"] --> GAP
MEMORY["Memory BW Growth:
~2x per generation
V100 to B200: ~9x"] --> GAP
GAP["The Gap:
Compute outpaces memory ~8:1
We must reduce data movement!"]
GAP --> SOL1["1. Lower precision
FP16 to FP8 to FP4 = less data"]
GAP --> SOL2["2. Larger on-chip memory
96 to 228 KB smem"]
GAP --> SOL3["3. TMA
Eliminates wasted thread work"]
GAP --> SOL4["4. TMEM
Eliminates register file bottleneck"]
GAP --> SOL5["5. Async everything
Overlap compute and memory"]
style GAP fill:#e74c3c,stroke:#333,color:#fff
style SOL1 fill:#2ecc71,stroke:#333,color:#fff
style SOL2 fill:#2ecc71,stroke:#333,color:#fff
style SOL3 fill:#2ecc71,stroke:#333,color:#fff
style SOL4 fill:#2ecc71,stroke:#333,color:#fff
style SOL5 fill:#2ecc71,stroke:#333,color:#fff
Appendix A ; PTX Quick Reference
Memory Spaces
// PTX memory space qualifiers:
.global // GPU HBM / GDDR (slowest, largest)
.shared // Per-SM shared memory
.local // Per-thread local memory (actually in global, cached)
.const // Constant memory (cached, broadcast)
.param // Kernel parameters
.reg // Registers (fastest)
// Blackwell only:
// TMEM: accessed via tcgen05 instructions (not a PTX address space qualifier)
Essential Instructions
// ===== Arithmetic =====
add.f32 %f0, %f1, %f2; // f0 = f1 + f2
mul.f32 %f0, %f1, %f2; // f0 = f1 * f2
fma.rn.f32 %f0, %f1, %f2, %f3; // f0 = f1 * f2 + f3 (fused, round-nearest)
mad.lo.u32 %r0, %r1, %r2, %r3; // r0 = r1 * r2 + r3 (integer)
// ===== Memory =====
ld.global.f32 %f0, [%rd0]; // Load from global
st.global.f32 [%rd0], %f0; // Store to global
ld.shared.f32 %f0, [smem_addr]; // Load from shared
st.shared.f32 [smem_addr], %f0; // Store to shared
// ===== Async Copy (Ampere+) =====
cp.async.ca.shared.global [dst], [src], 16; // 16 bytes, global to shared
cp.async.commit_group; // Commit group
cp.async.wait_group N; // Wait for Nth group
// ===== TMA (Hopper+) =====
cp.async.bulk.tensor.2d.shared::cluster.global.tile
.mbarrier::complete_tx::bytes
[smem], [tensorMap, {x, y}], [mbar]; // TMA 2D tile load
// ===== mbarrier =====
mbarrier.init.shared.b64 [addr], count; // Initialize
mbarrier.arrive.shared.b64 state, [addr]; // Arrive
mbarrier.arrive.expect_tx.shared.b64 state, [addr], tx_count; // Arrive + expect async bytes
mbarrier.try_wait.parity.shared.b64 pred, [addr], phase; // Non-blocking wait
// ===== Tensor Core (Volta/Ampere) =====
wmma.load.a.sync.aligned.m16n16k16.shared.row.f16
{regs}, [addr], stride;
wmma.mma.sync.aligned.m16n16k16.row.col.f32.f16.f16.f32
{d_regs}, {a_regs}, {b_regs}, {c_regs};
// ===== WGMMA (Hopper) =====
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16
{acc_regs}, desc_a, desc_b, scale_d, scale_a, scale_b, trans_a, trans_b;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned N;
// ===== tcgen05 (Blackwell) =====
tcgen05.fence::before_thread_sync;
tcgen05.mma.cta_group::1.kind::f16
[tmem_d], desc_a, desc_b, idesc, enable;
tcgen05.commit.cta_group::1.mbarrier::arrive [mbar];
tcgen05.ld.16x256b [tmem], [smem];
tcgen05.st.16x256b [smem], [tmem];
// ===== Warp Shuffle (all generations) =====
shfl.sync.bfly.b32 %r0, %r1, lane_mask, 0x1f; // Butterfly shuffle
shfl.sync.down.b32 %r0, %r1, delta, 0x1f; // Shift down
shfl.sync.up.b32 %r0, %r1, delta, 0x0; // Shift up
shfl.sync.idx.b32 %r0, %r1, src_lane, 0x1f; // Indexed shuffle
// ===== Control Flow =====
setp.lt.f32 %p0, %f0, %f1; // p0 = (f0 < f1)
@%p0 bra TARGET; // Conditional branch
bar.sync 0; // __syncthreads()
Appendix B ; Occupancy Calculator Cheat Sheet
Quick Reference Table (SM90 ; Hopper)
| Regs/Thread | Max Threads/SM | Max Warps | Occupancy | Notes |
|---|---|---|---|---|
| 16 | 2048 | 64 | 100% | Very few regs ; likely register spills |
| 32 | 2048 | 64 | 100% | Sweet spot for simple kernels |
| 48 | 1536 | 48 | 75% | |
| 64 | 1024 | 32 | 50% | Common for complex kernels |
| 80 | 768 | 24 | 37.5% | |
| 96 | 640 | 20 | 31.25% | |
| 128 | 512 | 16 | 25% | Register-heavy kernel |
| 255 | 256 | 8 | 12.5% | Maximum registers allowed |
Formulas
"The thing that makes physics beautiful is that it's simple. The same is true of GPUs ; once you understand that everything is about hiding latency and maximizing throughput, the whole architecture falls into place like a jigsaw puzzle. Every generation, NVIDIA finds a new piece of latency to hide. Registers were the bottleneck for Tensor Cores? Invent TMEM. Threads wasted on address math? Invent TMA. Synchronization too coarse? Invent mbarrier. The physics hasn't changed ; electrons through transistors. But the engineering is magnificent."
; Your friend, Dick Feynman (if he wrote GPU code)
End of the Four-Hour Lecture. Go build something.