# Darshan Baslani - Full Blog Text This file aggregates the readable source for every blog post on the site. # Hello, Layout! ; Visualizing Memory in CuTe - URL: https://www.dcbaslani.xyz/blog/01_hello_layout/ - Markdown: https://www.dcbaslani.xyz/blog/01_hello_layout/index.md - Text: https://www.dcbaslani.xyz/blog/01_hello_layout/index.txt - Published: 2025-06-15 - Topic: CUDA - Description: Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids. # Hello, Layout! ; Visualizing Memory in CuTe **Difficulty:** Beginner **Prerequisites:** Basic CUDA (kernel launch, `__global__`), C/C++ pointers ## 1. The Problem (The "Why") You have a flat `float*` pointer ; 1D memory. But your algorithm thinks in rows and columns: a matrix, a tile, a 2D block of pixels. Right now you're hand-writing index math like `ptr[row * WIDTH + col]` and praying you got the constant right. Every time the layout changes (transpose? padding? different tile size?), you rewrite the formula and re-introduce bugs. CuTe's `Layout` solves this by making the coordinate-to-address mapping a **first-class object**. You declare the shape (how many rows, how many columns) and the stride (the distance between elements), and the Layout *computes* the flat index for you. Change from row-major to column-major? Swap the strides. Reshape from 2D to 3D? Add a mode. The algorithm code never changes. > **B200 Note:** On Hopper and Blackwell GPUs, the TMA (Tensor Memory Accelerator) hardware needs a *descriptor* that encodes shape and stride information. Understanding Layouts is the **only** way to program TMA correctly. ## 2. The Mental Model (The Visual) A `Layout` is a **lens** that lets you view flat, 1D memory as if it were a multidimensional grid. Take `Shape<2, 4>` ; that's 2 rows and 4 columns, so 8 elements total. The `Stride` decides *how* those coordinates map to physical addresses. ### Column-Major (The Default): `(2,4):(1,2)` Stride `(1, 2)` means: move 1 row down → address +1. Move 1 column right → address +2. ```text Physical Memory (1D): ┌───┬───┬───┬───┬───┬───┬───┬───┐ │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ └───┴───┴───┴───┴───┴───┴───┴───┘ Layout maps coordinates → addresses: col 0 col 1 col 2 col 3 ┌───────┬───────┬───────┬───────┐ row 0 │ 0 │ 2 │ 4 │ 6 │ ├───────┼───────┼───────┼───────┤ row 1 │ 1 │ 3 │ 5 │ 7 │ └───────┴───────┴───────┴───────┘ address = row * 1 + col * 2 ─────── ─────── stride[0] stride[1] ``` Notice the columns are stored contiguously in memory: addresses `0,1` are column 0, addresses `2,3` are column 1, etc. That's **column-major** ; just like Fortran and MATLAB. ### Row-Major: `(2,4):(4,1)` Stride `(4, 1)` means: move 1 row down → address +4. Move 1 column right → address +1. ```text col 0 col 1 col 2 col 3 ┌───────┬───────┬───────┬───────┐ row 0 │ 0 │ 1 │ 2 │ 3 │ ├───────┼───────┼───────┼───────┤ row 1 │ 4 │ 5 │ 6 │ 7 │ └───────┴───────┴───────┴───────┘ address = row * 4 + col * 1 ─────── ─────── stride[0] stride[1] ``` Now rows are contiguous: addresses `0,1,2,3` are row 0. That's **row-major** ; the C/C++ convention. > **Key Insight:** Same shape, same data, same 8 elements. The *only* difference is the stride. That's the entire power of a Layout: **the shape says "what," the stride says "where."** ## 3. The Solution (The Code) A complete CUDA kernel that creates Layouts and prints the coordinate-to-address mapping. ```cpp #include #include using namespace cute; // Print a rank-2 layout as a grid of (row, col) -> address template __device__ void print_layout_2d(Layout const& layout) { // Only let thread 0 print if (threadIdx.x != 0 || blockIdx.x != 0) return; printf("Layout: "); print(layout); printf("\n"); for (int row = 0; row < size<0>(layout); ++row) { for (int col = 0; col < size<1>(layout); ++col) { printf(" (%d,%d)->%d", row, col, int(layout(row, col))); } printf("\n"); } printf("\n"); } __global__ void hello_layout_kernel() { // 1. Column-major (default when stride is omitted) auto col_major = make_layout(make_shape(Int<2>{}, Int<4>{})); printf("=== Column-Major ===\n"); print_layout_2d(col_major); // 2. Row-major (use LayoutRight tag) auto row_major = make_layout(make_shape(Int<2>{}, Int<4>{}), LayoutRight{}); printf("=== Row-Major ===\n"); print_layout_2d(row_major); // 3. Custom stride auto custom = make_layout(make_shape(Int<2>{}, Int<4>{}), make_stride(Int<8>{}, Int<1>{})); printf("=== Custom Stride (2,4):(8,1) ===\n"); print_layout_2d(custom); } int main() { hello_layout_kernel<<<1, 1>>>(); cudaDeviceSynchronize(); return 0; } ``` **Expected Output:** ```text === Column-Major === Layout: (_2,_4):(_1,_2) (0,0)->0 (0,1)->2 (0,2)->4 (0,3)->6 (1,0)->1 (1,1)->3 (1,2)->5 (1,3)->7 === Row-Major === Layout: (_2,_4):(_4,_1) (0,0)->0 (0,1)->1 (0,2)->2 (0,3)->3 (1,0)->4 (1,1)->5 (1,2)->6 (1,3)->7 === Custom Stride (2,4):(8,1) === Layout: (_2,_4):(_8,_1) (0,0)->0 (0,1)->1 (0,2)->2 (0,3)->3 (1,0)->8 (1,1)->9 (1,2)->10 (1,3)->11 ``` ## 4. Step-by-Step Explanation **Line: `auto col_major = make_layout(make_shape(Int<2>{}, Int<4>{}));`** - `make_shape(Int<2>{}, Int<4>{})` creates a compile-time shape of 2 rows by 4 columns. `Int{}` is CuTe's static integer ; the compiler knows the value, so the stride math becomes zero-cost. - `make_layout(shape)` with no stride argument defaults to `LayoutLeft` (column-major). CuTe computes the stride as an exclusive prefix product of the shape from left to right: stride₀ = 1, stride₁ = 1 × 2 = 2. Result: `(2,4):(1,2)`. **Line: `auto row_major = make_layout(make_shape(Int<2>{}, Int<4>{}), LayoutRight{});`** - The `LayoutRight{}` tag generates strides as an exclusive prefix product from *right to left*: stride₁ = 1, stride₀ = 1 × 4 = 4. Result: `(2,4):(4,1)`. **Line: `auto custom = make_layout(make_shape(...), make_stride(Int<8>{}, Int<1>{}));`** - Here we provide an explicit stride. Row stride of 8 means there's a gap between rows ; this is how you express a matrix that lives inside a larger allocation (e.g., a 2×4 submatrix of an 8-column matrix). **Line: `layout(row, col)`** - This is the layout's core operation: it takes a 2D coordinate and returns the flat 1D index. The formula is `row × stride[0] + col × stride[1]`. For column-major: `1×row + 2×col`. For row-major: `4×row + 1×col`. **Line: `size<0>(layout)` / `size<1>(layout)`** - `size<0>` returns the extent of mode 0 (rows = 2). `size<1>` returns mode 1 (columns = 4). This drives the loop bounds. Because the shape is static, the compiler can fully unroll these loops. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** A Layout is a *spreadsheet formula* for memory. The Shape is how many rows and columns the spreadsheet has. The Stride is the formula in each cell that says "to find my data, take my row number × *this* + my column number × *that*." Change the formula, change the order data is fetched from RAM ; but the spreadsheet looks identical to the user. **Hardware Note:** On B200/Hopper GPUs, when you set up a TMA descriptor with `make_tma_copy`, you must provide a Layout that describes the tensor's shape and strides in global memory. If you get the strides wrong, the TMA engine will fetch garbage. The concepts from this tutorial ; `make_shape`, `make_stride`, column-major vs. row-major ; are *exactly* what the TMA descriptor encodes. **CuTe Notation Cheat-Sheet:** | Notation | Meaning | | :--- | :--- | | `(2,4):(1,2)` | Shape `(2,4)`, Stride `(1,2)` ; column-major | | `(2,4):(4,1)` | Shape `(2,4)`, Stride `(4,1)` ; row-major | | `_N` in output | Static (compile-time) integer | | `N` in output | Dynamic (run-time) integer | | `Int{}` | C++ way to write a static integer | | `LayoutLeft` | Column-major stride generation (default) | | `LayoutRight` | Row-major stride generation | --- # The Art of Slicing ; Partitioning Data Across Blocks and Threads - URL: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/ - Markdown: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/index.md - Text: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/index.txt - Published: 2025-06-20 - Topic: CUDA - Description: How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads. # The Art of Slicing ; Partitioning Data Across Blocks and Threads **Difficulty:** Beginner **Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), CUDA thread blocks (`blockIdx`, `threadIdx`) ## 1. The Problem (The "Why") You have a big matrix ; say 512×512 ; sitting in global memory. Your GPU launches a grid of thread blocks, each of which should work on a small tile (e.g., 128×128). Inside each block, 256 threads need to split *that* tile further so every thread handles its own little piece. Without CuTe, you write something like: ```cpp int row_start = blockIdx.x * TILE_M; int col_start = blockIdx.y * TILE_N; int thr_row = threadIdx.x % THR_M; int thr_col = threadIdx.x / THR_M; // ... and then more math for every matrix, every stage, every layout change ``` This is fragile. Change the tile size? Rewrite the math. Transpose a matrix? Rewrite it again. Add a K-dimension for GEMM? Rewrite it *again*. CuTe replaces all of this with two composable operations: | Operation | Level | What it does | | :--- | :--- | :--- | | `local_tile` | Block (CTA) | "Give this block its tile from the big matrix." | | `local_partition` | Thread | "Give this thread its elements from the tile." | Both work on any shape, any stride, any number of dimensions. > **B200 Note:** When you program the TMA engine on Hopper/Blackwell, you still need `local_tile` to select *which* tile the TMA should fetch. The TMA descriptor handles the copy, but you tell it *where* to point using the exact same tiling logic described here. ## 2. The Mental Model (The Visual) ### Step 1: The Big Picture ; Tiling for CTAs Imagine an 8×8 matrix divided into 4×4 tiles. Four thread blocks share the work: ```text Global Matrix (8×8) col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 ┌───────────────────────────┬───────────────────────────┐ row 0 │ │ │ row 1 │ Block (0,0) │ Block (0,1) │ row 2 │ rows 0-3, cols 0-3 │ rows 0-3, cols 4-7 │ row 3 │ │ │ ├───────────────────────────┼───────────────────────────┤ row 4 │ │ │ row 5 │ Block (1,0) │ Block (1,1) │ row 6 │ rows 4-7, cols 0-3 │ rows 4-7, cols 4-7 │ row 7 │ │ │ └───────────────────────────┴───────────────────────────┘ local_tile(matrix, Shape<4,4>{}, make_coord(blockIdx.x, blockIdx.y)) → extracts one 4×4 tile for each block ``` `local_tile` applies a **tiler** (here `Shape<4,4>`) to produce tiles, then uses the block coordinate to pick the right one. Under the hood it calls `zipped_divide`, which reshapes the matrix into `((Tile), (Rest))` ; tile elements in the first mode, tile indices in the second ; then slices the rest-mode with your coordinate. ### Step 2: Inside a Tile ; Partitioning for Threads Now zoom into Block (0,0)'s 4×4 tile. We have 4 threads arranged as a 2×2 layout. Each thread gets a 2×2 chunk: ```text Block (0,0)'s Tile (4×4) col 0 col 1 col 2 col 3 ┌─────────────────┬─────────────────┐ row 0 │ T0 T0 │ T1 T1 │ row 1 │ T0 T0 │ T1 T1 │ ├─────────────────┼─────────────────┤ row 2 │ T2 T2 │ T3 T3 │ row 3 │ T2 T2 │ T3 T3 │ └─────────────────┴─────────────────┘ Thread layout: (2,2):(1,2) → 4 threads in column-major Tile shape: (4,4) Each thread: (4/2, 4/2) = (2,2) elements local_partition(tile, Layout>{}, threadIdx.x) → Thread 0 gets the top-left 2×2 subtensor ``` `local_partition` takes a **thread layout** ; a Layout that maps coordinates to thread indices ; and uses `zipped_divide` to split the tile into thread-sized pieces, then slices into the tile-mode with the thread's *coordinate* (derived from its flat index via the thread layout). > **Key Insight:** `local_tile` slices the **rest** (picks which tile). `local_partition` slices the **tile** (picks which thread's elements). They are two sides of the same `zipped_divide` coin. ## 3. The Solution (The Code) A CUDA kernel that tiles a matrix across blocks, partitions each tile across threads, and prints what each agent owns. ```cpp #include using namespace cute; template __global__ void partition_kernel(Shape_MN shape_mn, CtaTiler cta_tiler, ThreadLayout thr_layout) { // 1. Build the full matrix as a counting tensor (value at (i,j) = flat index) // In real code this would be make_tensor(make_gmem_ptr(ptr), shape, stride) auto matrix = make_counting_tensor(make_layout(shape_mn)); // (M,N) // 2. CTA-level: each block picks its tile auto cta_coord = make_coord(blockIdx.x, blockIdx.y); auto cta_tile = local_tile(matrix, cta_tiler, cta_coord); // (BLK_M,BLK_N) // 3. Thread-level: each thread picks its elements from the tile auto thr_tile = local_partition(cta_tile, thr_layout, threadIdx.x); // (THR_M,THR_N) // 4. Print from Block (0,0), Thread 0 only if (blockIdx.x == 0 && blockIdx.y == 0 && threadIdx.x == 0) { printf("=== Full matrix layout ===\n"); print(matrix.layout()); printf("\n\n"); printf("=== Block (0,0) tile ===\n"); print(cta_tile.layout()); printf("\n"); for (int m = 0; m < size<0>(cta_tile); ++m) { for (int n = 0; n < size<1>(cta_tile); ++n) { printf("%3d ", int(cta_tile(m, n))); } printf("\n"); } printf("\n=== Block (0,0), Thread 0 partition ===\n"); print(thr_tile.layout()); printf("\n"); printf("Thread 0 owns elements: "); for (int i = 0; i < size(thr_tile); ++i) { printf("%d ", int(thr_tile(i))); } printf("\n"); } } int main() { // 8×8 matrix, tiled into 4×4 blocks, 4 threads per block (2×2) auto shape_mn = make_shape(Int<8>{}, Int<8>{}); auto cta_tiler = make_shape(Int<4>{}, Int<4>{}); auto thr_layout = make_layout(make_shape(Int<2>{}, Int<2>{})); // (m,n)->thr_idx dim3 grid(size(ceil_div(get<0>(shape_mn), get<0>(cta_tiler))), size(ceil_div(get<1>(shape_mn), get<1>(cta_tiler)))); // 2×2 blocks dim3 block(size(thr_layout)); // 4 threads partition_kernel<<>>(shape_mn, cta_tiler, thr_layout); cudaDeviceSynchronize(); return 0; } ``` **Expected Output:** ```text === Full matrix layout === (_8,_8):(_1,_8) === Block (0,0) tile === (_4,_4):(_1,_8) 0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27 === Block (0,0), Thread 0 partition === (_2,_2):(_1,_8) Thread 0 owns elements: 0 1 8 9 ``` Block (0,0) gets the top-left 4×4 tile (addresses 0–3 down column 0, 8–11 down column 1, etc. ; column-major). Thread 0 gets the top-left 2×2 corner of that tile. ## 4. Step-by-Step Explanation **Line: `auto matrix = make_counting_tensor(make_layout(shape_mn));`** - `make_counting_tensor` creates a "virtual" tensor where the value at each coordinate equals its flat index. No actual memory ; it's a trick for visualizing the Layout's mapping. In a real kernel, replace this with `make_tensor(make_gmem_ptr(ptr), shape, stride)`. **Line: `auto cta_tile = local_tile(matrix, cta_tiler, cta_coord);`** - `local_tile` is CuTe's **inner partition**. Under the hood: 1. `zipped_divide(matrix, cta_tiler)` reshapes the `(8,8)` matrix into `((4,4),(2,2))` ; a 4×4 tile mode and a 2×2 "rest" mode (2 tiles per row, 2 tiles per column). 2. The `cta_coord` slices the rest-mode: `(blockIdx.x, blockIdx.y)` picks one tile. - Block (0,0) gets the tile at rest-coordinate (0,0) ; the top-left 4×4 region. **Line: `auto thr_tile = local_partition(cta_tile, thr_layout, threadIdx.x);`** - `local_partition` is CuTe's **outer partition**. Under the hood: 1. It takes the *shape* of `thr_layout` ; here `(2,2)` ; and uses it as a tiler on the `(4,4)` tile, producing `((2,2),(2,2))` ; 2×2 elements per thread, tiled over 2×2 threads. 2. It converts `threadIdx.x` into a coordinate using `thr_layout`. Thread 0 → coord `(0,0)`, Thread 1 → `(1,0)`, Thread 2 → `(0,1)`, Thread 3 → `(1,1)`. 3. That coordinate slices the tile-mode, so each thread gets the matching 2×2 block of elements. **Line: `ceil_div(get<0>(shape_mn), get<0>(cta_tiler))`** - `ceil_div(8, 4) = 2`. This gives us 2 blocks in the M-dimension (and 2 in N), for a total of 4 blocks. **`cosize` (used in real kernels for shared memory):** - `cosize(layout)` returns the size of the layout's codomain ; the number of unique addresses needed. Use it to allocate shared memory: `__shared__ float smem[cosize_v];`. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** Think of partitioning as **dealing cards at a poker table**: 1. **`local_tile`** = The casino floor manager assigns a *table* (tile) to each *dealer* (CTA). "Table 0, you get the top-left quarter of the deck." 2. **`local_partition`** = The dealer at each table divides the cards among the *players* (threads). "Player 0, you get these 4 cards." 3. **`zipped_divide`** = The reorganization step that makes dealing possible. It splits the deck into `((hand), (tables))` so you can ask for any specific table's hand in one shot. The key invariant: **every element is owned by exactly one block and exactly one thread within that block.** No overlap, no gaps (assuming tile sizes evenly divide the matrix ; see the [predication tutorial](https://www.dcbaslani.xyz/blog/0y_predication/) for handling remainders). **Hardware Note:** On Hopper/B200, the tile coordinate you compute with `local_tile` is the *same* coordinate you pass to `cute::copy` with a TMA atom. The TMA engine uses the descriptor's shape/stride (from Tutorial 01) and this coordinate to DMA the right tile into shared memory ; zero address math in registers. **The `local_tile` / `local_partition` Cheat-Sheet:** | Function | Level | Slices into... | You provide... | You get... | | :--- | :--- | :--- | :--- | :--- | | `local_tile` | CTA | Rest (which tile) | Tile shape + block coord | The tile data | | `local_partition` | Thread | Tile (which thread) | Thread layout + thread idx | That thread's elements | **Under the hood ; both use `zipped_divide`:** ```text zipped_divide(data, tiler) → ((Tile), (Rest)) │ │ local_partition local_tile slices here slices here ``` --- # The Naive Copy ; Scalar vs. Vectorized Memory Movement - URL: https://www.dcbaslani.xyz/blog/03_the_naive_copy/ - Markdown: https://www.dcbaslani.xyz/blog/03_the_naive_copy/index.md - Text: https://www.dcbaslani.xyz/blog/03_the_naive_copy/index.txt - Published: 2026-02-22 - Topic: CUDA - Description: Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it. # The Naive Copy ; Scalar vs. Vectorized Memory Movement **Difficulty:** Beginner **Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 02: The Art of Slicing](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/), basic CUDA memory model ## 1. The Problem (The "Why") You've partitioned your matrix beautifully. Every thread knows which elements it owns. Now it needs to actually *move* data ; from global memory to shared memory, or from shared memory to registers. The simplest approach: a `for` loop that copies one `float` at a time. ```cpp // The "obvious" way ; one element per iteration for (int i = 0; i < N; ++i) { dst[i] = src[i]; } ``` This works. It's also leaving **75% of your memory bandwidth on the table.** Why? Because the GPU's memory bus is 128 bits wide. A single `float` is 32 bits. Every load instruction fetches 128 bits from DRAM anyway ; but if you only asked for 32 bits, the other 96 bits are thrown away. You're paying for a 4-lane highway but driving in one lane. The fix: **vectorized loads**. Instead of `LDG.32` (load 32 bits), the compiler emits `LDG.128` (load 128 bits = 4 floats at once). Same number of memory transactions, 4× the useful data per transaction. CuTe handles this automatically ; *if* your data meets two conditions. This tutorial shows you what those conditions are and how to make sure your copies always hit the fast path. > **B200 Note:** On Hopper/Blackwell, vectorized copies are the minimum bar. The real prize is `cp.async` and TMA (Tutorials 04 and 08), which bypass registers entirely. But even those advanced paths require the same alignment and contiguity fundamentals covered here. ## 2. The Mental Model (The Visual) ### Scalar Copy: One Element at a Time ```text Global Memory (128-bit bus): ┌─────────────────────────────────────┐ │ float₀ float₁ float₂ float₃ │ ← 128 bits fetched from DRAM └─────────────────────────────────────┘ ↓ LDG.32 grabs only float₀ (32 bits) The other 96 bits? Wasted. Thread issues 4 separate loads: LDG.32 → float₀ LDG.32 → float₁ LDG.32 → float₂ LDG.32 → float₃ = 4 instructions, 4 transactions ``` ### Vectorized Copy: Four Elements at Once ```text Global Memory (128-bit bus): ┌─────────────────────────────────────┐ │ float₀ float₁ float₂ float₃ │ ← 128 bits fetched from DRAM └─────────────────────────────────────┘ ↓ LDG.128 grabs ALL FOUR floats in one shot Thread issues 1 load: LDG.128 → float₀, float₁, float₂, float₃ = 1 instruction, 1 transaction ``` ### The Two Requirements for Vectorization ```text Requirement 1: CONTIGUITY (stride-1) ✅ Contiguous: [ f₀ | f₁ | f₂ | f₃ ] stride = 1 +1 +1 +1 ❌ Strided: [ f₀ | | f₁ | ] stride = 2 +2 +2 Can't bundle into one load Requirement 2: ALIGNMENT (address divisible by vector width) ✅ Aligned: addr 0x0000 → [ f₀ f₁ f₂ f₃ ] 0 % 16 == 0 ❌ Misaligned: addr 0x0004 → [ f₁ f₂ f₃ f₄ ] 4 % 16 ≠ 0 Can't do 128-bit load from a 32-bit boundary ``` > **The Dolly Rule:** Think of vectorization like moving boxes on a loading dock. `UniversalCopy` hand-carries one box at a time ; always works, always slow. `AutoVectorizingCopy` is the smart worker who checks: "Are these boxes side-by-side (contiguous) and does the stack start at a dolly-sized slot (aligned)?" If yes, load them all onto the dolly in one trip. If not, fall back to one-at-a-time. ## 3. The Solution (The Code) A CUDA kernel that copies a tile from global to shared memory using CuTe's `copy()`, then measures effective bandwidth for both scalar and vectorized paths. ```cpp #include #include #include #include #include using namespace cute; // ─── Kernel: copy a tile from global → shared memory ─── template __global__ void copy_kernel(float const* __restrict__ g_ptr, GmemLayout gmem_layout, SmemLayout smem_layout) { extern __shared__ float smem[]; // Wrap raw pointers in CuTe Tensors auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout); // (M, N) auto s_tensor = make_tensor(make_smem_ptr(smem), smem_layout); // (M, N) // Each thread gets its partition of the tile auto thr_g = local_partition(g_tensor, Layout>{}, threadIdx.x); auto thr_s = local_partition(s_tensor, Layout>{}, threadIdx.x); // ── The actual copy ── copy(CopyOp{}, thr_g, thr_s); } // ─── Benchmark harness ─── template float benchmark_copy(const char* label, int M, int N, int iters) { size_t bytes = M * N * sizeof(float); float *d_src; cudaMalloc(&d_src, bytes); cudaMemset(d_src, 1, bytes); // fill with something // Layouts: column-major for both gmem and smem auto gmem_layout = make_layout(make_shape(M, N), LayoutLeft{}); auto smem_layout = make_layout(make_shape(M, N), LayoutLeft{}); int smem_bytes = M * N * sizeof(float); dim3 block(256); dim3 grid(1); // Warmup copy_kernel<<>>(d_src, gmem_layout, smem_layout); cudaDeviceSynchronize(); // Timed loop cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); for (int i = 0; i < iters; ++i) { copy_kernel<<>>(d_src, gmem_layout, smem_layout); } cudaEventRecord(stop); cudaEventSynchronize(stop); float ms = 0; cudaEventElapsedTime(&ms, start, stop); float gb_per_sec = (float(bytes) * iters / 1e9) / (ms / 1e3); printf("%-40s %8.2f GB/s (%6.3f ms for %d iters)\n", label, gb_per_sec, ms, iters); cudaEventDestroy(start); cudaEventDestroy(stop); cudaFree(d_src); return gb_per_sec; } int main() { int M = 128, N = 256, iters = 1000; printf("Copying %dx%d tile (%zu bytes) from Global → Shared Memory\n\n", M, N, M*N*sizeof(float)); // 1. Scalar: UniversalCopy copies one float (32 bits) per call benchmark_copy>( "UniversalCopy (scalar, 32-bit)", M, N, iters); // 2. Vectorized: AutoVectorizingCopy bundles up to 128 bits benchmark_copy>( "AutoVectorizingCopy (vector, 128-bit)", M, N, iters); // 3. For reference: explicit 64-bit vectorization benchmark_copy>( "AutoVectorizingCopy (vector, 64-bit)", M, N, iters); return 0; } ``` **Expected Output (approximate ; numbers vary by GPU):** ```text Copying 128x256 tile (131072 bytes) from Global → Shared Memory UniversalCopy (scalar, 32-bit) 85.20 GB/s ( 1.539 ms for 1000 iters) AutoVectorizingCopy (vector, 128-bit) 310.45 GB/s ( 0.422 ms for 1000 iters) AutoVectorizingCopy (vector, 64-bit) 205.30 GB/s ( 0.638 ms for 1000 iters) ``` The vectorized 128-bit copy is roughly **3–4× faster** than scalar. The 64-bit variant lands in between. ## 4. Step-by-Step Explanation **Line: `auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout);`** - `make_gmem_ptr(g_ptr)` wraps the raw pointer and tags it as global memory. CuTe uses this tag to select the right load instruction (`LDG` vs. `LDS` vs. register move). Without it, CuTe can't distinguish memory spaces. **Line: `auto thr_g = local_partition(g_tensor, Layout>{}, threadIdx.x);`** - The thread layout `(8, 32)` means 8 rows × 32 columns = 256 threads. This is the `thr_layout` from Tutorial 02. Each thread gets a `(M/8, N/32)` = `(16, 8)` subtensor ; 128 floats to copy. **Line: `copy(CopyOp{}, thr_g, thr_s);`** - This is the core call. CuTe dispatches based on `CopyOp`: - **`UniversalCopy`**: Copies one `float` at a time. The inner loop is `dst(i) = src(i)` ; a simple scalar assignment. The compiler emits `LDG.32` + `STS.32` per element. - **`AutoVectorizingCopyWithAssumedAlignment<128>`**: CuTe inspects the source and destination layouts at compile time and asks three questions: ```text 1. max_common_vector(src, dst): How many elements are contiguous (stride-1) in BOTH tensors? For column-major (16,8):(1,16) → 16 elements. 2. max_alignment(src, dst): What's the natural alignment? GCD of shape and stride components. 3. gcd(contiguous_bits, alignment_bits, MaxVecBits): The actual vector width to use. min(16 * 32, alignment, 128) = 128 bits. ``` When the result is 128 bits and both tensors qualify, CuTe calls `recast(tensor)` to reinterpret 4 floats as a single 128-bit integer, then copies those. The compiler emits `LDG.128` + `STS.128` ; one instruction moves 4 floats. **What happens when vectorization fails:** If you use a strided layout (e.g., stride-2 between elements), `max_common_vector` returns 1. CuTe falls back to element-by-element copy ; same speed as `UniversalCopy`. Your code doesn't crash, it just silently takes the slow path. This is why understanding contiguity matters. **Line: `copy(src, dst)` (no CopyOp argument)** - When you call `copy()` *without* specifying a copy operation, CuTe picks one for you: - Static layout? → `AutoVectorizingCopyWithAssumedAlignment<128>` (assumes 128-bit aligned pointers) - Dynamic layout? → `AutoVectorizingCopyWithAssumedAlignment<8>` (conservative ; no alignment assumed) - **Takeaway:** The default `copy(src, dst)` already tries to vectorize. You don't need to pass `AutoVectorizingCopy` explicitly unless you want to control `MaxVecBits`. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** Copying data is like **moving boxes on a loading dock**. `UniversalCopy` hand-carries one box at a time ; it always works, but it's slow. `AutoVectorizingCopy` is the smart worker who checks: "Are these boxes side-by-side on the shelf (contiguous, stride-1)? Does the stack start at a slot that fits my dolly (aligned to 128-bit boundary)?" If yes, load them onto the dolly and move 4 boxes in one trip. If not, fall back to hand-carrying. **The Three Laws of Auto-Vectorization:** 1. **Contiguity** ; Both source and destination must have stride-1 elements in memory. Column-major `(M,N):(1,M)` is contiguous along mode-0. Row-major `(M,N):(N,1)` is contiguous along mode-1. Irregular strides (e.g., stride-3) → scalar fallback. 2. **Alignment** ; The starting address must be divisible by the vector width. `cudaMalloc` returns 256-byte-aligned pointers, so global memory is fine. Shared memory offsets can break alignment if your tile size isn't a multiple of 4 floats (16 bytes). 3. **MaxVecBits** ; CuTe caps vectorization at this value (default 128 bits). You can lower it for testing, but 128 is the sweet spot for most GPU architectures. **The Copy Atom Hierarchy:** | Copy Operation | Vector Width | When to Use | | :--- | :--- | :--- | | `UniversalCopy` | `sizeof(T)` bits | Fallback. Always works, never fast. | | `AutoVectorizingCopyWithAssumedAlignment<128>` | Up to 128 bits | Default for static layouts. The "just works" choice. | | `AutoVectorizingCopyWithAssumedAlignment<64>` | Up to 64 bits | When you know alignment is only 8-byte. | | `AutoVectorizingCopyWithAssumedAlignment<8>` | Up to 8 bits | Conservative ; CuTe uses this for dynamic layouts. Essentially scalar. | **Hardware Note:** The GPU memory bus fetches 128 bits (or 32 bytes on some architectures) per transaction regardless of how much you asked for. Scalar loads waste bus bandwidth. Vectorized loads use the full transaction. This is why `LDG.128` doesn't cost more latency than `LDG.32` ; the memory controller does the same work either way. You're just *using* more of what it already fetched. > **Gotcha:** CuTe's auto-vectorization is a **compile-time** decision. If your layouts are dynamic (runtime shapes/strides), CuTe can't prove contiguity at compile time and falls back to scalar. Use `Int{}` (static integers) for shapes and strides whenever possible ; it's not just about performance, it's about enabling vectorization. **What comes next:** This tutorial copies with *one thread per partition*. But moving a 128×128 tile with 256 threads ; where all threads cooperate on the same tile ; requires coordinating who copies what. That's `TiledCopy` (Tutorial 04), which combines a `Copy_Atom` (the vectorization from this tutorial) with a thread layout and a value layout to orchestrate the full tile copy. --- # The Parallel Copy ; Orchestrating Threads with TiledCopy - URL: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/ - Markdown: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/index.md - Text: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/index.txt - Published: 2026-02-24 - Topic: CUDA - Description: How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies. # The Parallel Copy ; Orchestrating Threads with TiledCopy **Difficulty:** Beginner–Intermediate **Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 02: The Art of Slicing](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/), [Tutorial 03: The Naive Copy](https://www.dcbaslani.xyz/blog/03_the_naive_copy/) ## 1. The Problem (The "Why") Tutorial 03 showed how one thread can vectorize its copy ; using a dolly instead of hand-carrying boxes. But a real kernel doesn't have one worker on the loading dock; it has 32, 128, or 256 workers that all need to copy the *same tile together*, without stepping on each other's toes. You could combine Tutorial 02's `local_partition` with Tutorial 03's `copy()`: ```cpp // Manual approach ; works, but fragile auto thr_g = local_partition(g_tensor, Layout>{}, threadIdx.x); auto thr_s = local_partition(s_tensor, Layout>{}, threadIdx.x); copy(AutoVectorizingCopy{}, thr_g, thr_s); ``` This works ; but *you* have to manually choose a thread layout, make sure it divides the tile evenly, and hope the resulting per-thread partition is contiguous enough for vectorization. Change the tile size? Redo the math. Transpose the layout? Redo it again. Switch from `LDG` to `cp.async`? Rewrite the whole thing. CuTe's `TiledCopy` bundles all three decisions into one declarative object: | Piece | What it controls | From Tutorial... | | :--- | :--- | :--- | | `Copy_Atom` | What each thread can carry in one trip (the dolly) | 03 | | `thr_layout` | Where each thread stands on the tile (their grid position) | 02 | | `val_layout` | How many values each thread handles per trip | New | You declare these once. CuTe generates the partitioning, the vectorization, and the thread-to-element mapping automatically. > **B200 Note:** On Hopper/Blackwell, `TiledCopy` is also the container for TMA and `cp.async` atoms. The `thr_layout`/`val_layout` mechanism is identical ; only the `Copy_Atom` changes. Master this pattern now and TMA (Tutorial 08) becomes a one-line swap. ## 2. The Mental Model (The Visual) Imagine you have a **16×8 tile** to copy and **32 threads** (one warp). You need to answer three questions: 1. **What tool does each thread use?** → `Copy_Atom` with `UniversalCopy` ; each thread moves 128 bits (4 floats) per load instruction. This is the dolly from Tutorial 03. 2. **How are threads arranged?** → `thr_layout = (4, 8)` ; 4 rows of threads, 8 columns. 3. **How many extra values per thread?** → `val_layout = (1, 1)` ; the atom's 4 floats are enough; no extra tiling needed. ```text Tile coverage = thr_layout × atom_values × val_layout = (4, 8) × (4, 1) × (1, 1) = (16, 8) ✓ matches our tile! ``` ### The Ownership Map Each cell shows which thread (T00–T31) copies it. The atom's 4 floats go along mode-0 (down the column), so each thread owns a 4×1 vertical strip: ```text col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐ row 0 │ T00 │ T04 │ T08 │ T12 │ T16 │ T20 │ T24 │ T28 │ row 1 │ T00 │ T04 │ T08 │ T12 │ T16 │ T20 │ T24 │ T28 │ ← T00 owns row 2 │ T00 │ T04 │ T08 │ T12 │ T16 │ T20 │ T24 │ T28 │ rows 0–3, row 3 │ T00 │ T04 │ T08 │ T12 │ T16 │ T20 │ T24 │ T28 │ col 0 ├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤ row 4 │ T01 │ T05 │ T09 │ T13 │ T17 │ T21 │ T25 │ T29 │ row 5 │ T01 │ T05 │ T09 │ T13 │ T17 │ T21 │ T25 │ T29 │ row 6 │ T01 │ T05 │ T09 │ T13 │ T17 │ T21 │ T25 │ T29 │ row 7 │ T01 │ T05 │ T09 │ T13 │ T17 │ T21 │ T25 │ T29 │ ├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤ row 8 │ T02 │ T06 │ T10 │ T14 │ T18 │ T22 │ T26 │ T30 │ row 9 │ T02 │ T06 │ T10 │ T14 │ T18 │ T22 │ T26 │ T30 │ row 10 │ T02 │ T06 │ T10 │ T14 │ T18 │ T22 │ T26 │ T30 │ row 11 │ T02 │ T06 │ T10 │ T14 │ T18 │ T22 │ T26 │ T30 │ ├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤ row 12 │ T03 │ T07 │ T11 │ T15 │ T19 │ T23 │ T27 │ T31 │ row 13 │ T03 │ T07 │ T11 │ T15 │ T19 │ T23 │ T27 │ T31 │ row 14 │ T03 │ T07 │ T11 │ T15 │ T19 │ T23 │ T27 │ T31 │ row 15 │ T03 │ T07 │ T11 │ T15 │ T19 │ T23 │ T27 │ T31 │ └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ``` Notice the two critical patterns: - **Within each thread:** T00's 4 values (rows 0–3 of col 0) are contiguous in column-major memory. That's 4 floats = 128 bits ; a single `LDG.128`. The atom's `uint128_t` type made this happen. - **Across threads:** Each column has a different thread (T00, T04, T08, ...). No two threads touch the same cell. The `thr_layout = (4,8)` made this happen. ### The Three-Part Recipe ```text ┌──────────────────────────────────┐ │ Copy_Atom │ ← The tool: "what can one thread carry?" │ (UniversalCopy, │ Each load moves 4 floats / 128 bits. │ float) │ (the dolly from Tutorial 03) └────────────────┬─────────────────┘ │ ┌────────────┴────────────┐ │ │ ┌───┴──────────┐ ┌─────────┴──────────┐ │ thr_layout │ │ val_layout │ │ (4, 8) │ │ (1, 1) │ │ │ │ │ │ "32 workers │ │ "no extra values │ │ in a 4×8 │ │ beyond what the │ │ grid" │ │ atom handles" │ └──────────────┘ └────────────────────┘ Tile covered = (4 thr × 4 atom × 1 val, 8 thr × 1 atom × 1 val) = (16, 8) ``` > **The Moving Crew Rule:** Tutorial 03 was about one mover and their dolly. `TiledCopy` is the **shift supervisor's clipboard** ; a plan that assigns every mover to a grid position on the warehouse floor and specifies what tool to use. The `Copy_Atom` says what each mover carries (the dolly). The `thr_layout` says where each mover stands. The `val_layout` says how many extra trips each mover makes beyond one atom load. Together, these three things tile the entire room with no gaps and no overlaps. ## 3. The Solution (The Code) Two kernels: one visualizes the thread-to-element ownership map, one performs an actual global → shared copy. ```cpp #include #include #include #include using namespace cute; // ─── Kernel 1: Visualize which thread owns which tile cell ─── template __global__ void visualize_ownership(TiledCopy tiled_copy, TileLayout tile_layout) { extern __shared__ float smem[]; // Build a shared-memory tensor to stamp thread IDs into auto tile = make_tensor(make_smem_ptr(smem), tile_layout); // Get this thread's copy slice auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x); auto thr_view = thr_copy.partition_D(tile); // Each thread writes its ID into the cells it owns for (int i = 0; i < size(thr_view); ++i) { thr_view(i) = float(threadIdx.x); } __syncthreads(); // Thread 0 prints the full ownership map if (threadIdx.x == 0) { printf("Tile ownership map (%d x %d), %d threads:\n\n", int(size<0>(tile)), int(size<1>(tile)), int(blockDim.x)); printf(" "); for (int n = 0; n < size<1>(tile); ++n) printf("c%-4d ", n); printf("\n"); for (int m = 0; m < size<0>(tile); ++m) { printf("r%-4d ", m); for (int n = 0; n < size<1>(tile); ++n) { printf("T%02d ", int(tile(m, n))); } printf("\n"); } } } // ─── Kernel 2: Actual Global → Shared copy using TiledCopy ─── template __global__ void tiled_copy_kernel(float const* __restrict__ g_ptr, GmemLayout gmem_layout, SmemLayout smem_layout, TiledCopy tiled_copy) { extern __shared__ float smem[]; auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout); // (M, N) auto s_tensor = make_tensor(make_smem_ptr(smem), smem_layout); // (M, N) // ── The key lines: partition via TiledCopy ── auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x); auto thr_g = thr_copy.partition_S(g_tensor); // source partition auto thr_s = thr_copy.partition_D(s_tensor); // destination partition // ── Execute the copy ── copy(tiled_copy, thr_g, thr_s); __syncthreads(); // Thread 0 prints a few rows to verify correctness if (threadIdx.x == 0) { printf("\nShared memory after TiledCopy (column-major, showing row by row):\n"); for (int m = 0; m < size<0>(s_tensor); ++m) { printf(" row %2d: ", m); for (int n = 0; n < size<1>(s_tensor); ++n) { printf("%5.0f ", s_tensor(m, n)); } printf("\n"); } } } int main() { constexpr int M = 16, N = 8; // ─── Build the TiledCopy ─── // 1. Copy_Atom: each load moves 128 bits (4 floats) via uint128_t // 2. thr_layout: 32 threads in a 4×8 grid (column-major) // 3. val_layout: 1×1 ; the atom already handles 4 floats per thread auto tiled_copy = make_tiled_copy( Copy_Atom, float>{}, Layout>{}, // thr_layout: 4 rows × 8 cols = 32 threads Layout>{} // val_layout: atom covers everything ); // Tile layouts: column-major for both auto tile_layout = make_layout(make_shape(Int{}, Int{}), LayoutLeft{}); int smem_bytes = M * N * sizeof(float); // ─── Kernel 1: Visualize the ownership map ─── printf("=== TiledCopy Ownership Map ===\n\n"); visualize_ownership<<<1, 32, smem_bytes>>>(tiled_copy, tile_layout); cudaDeviceSynchronize(); // ─── Kernel 2: Actual Global → Shared copy ─── // Fill source: h_data[i] = i (flat index in column-major order) float h_data[M * N]; for (int i = 0; i < M * N; ++i) h_data[i] = float(i); float* d_data; cudaMalloc(&d_data, sizeof(h_data)); cudaMemcpy(d_data, h_data, sizeof(h_data), cudaMemcpyHostToDevice); printf("\n=== TiledCopy: Global -> Shared ===\n"); tiled_copy_kernel<<<1, 32, smem_bytes>>>(d_data, tile_layout, tile_layout, tiled_copy); cudaDeviceSynchronize(); cudaFree(d_data); return 0; } ``` **Expected Output:** ```text === TiledCopy Ownership Map === Tile ownership map (16 x 8), 32 threads: c0 c1 c2 c3 c4 c5 c6 c7 r0 T00 T04 T08 T12 T16 T20 T24 T28 r1 T00 T04 T08 T12 T16 T20 T24 T28 r2 T00 T04 T08 T12 T16 T20 T24 T28 r3 T00 T04 T08 T12 T16 T20 T24 T28 r4 T01 T05 T09 T13 T17 T21 T25 T29 r5 T01 T05 T09 T13 T17 T21 T25 T29 r6 T01 T05 T09 T13 T17 T21 T25 T29 r7 T01 T05 T09 T13 T17 T21 T25 T29 r8 T02 T06 T10 T14 T18 T22 T26 T30 r9 T02 T06 T10 T14 T18 T22 T26 T30 r10 T02 T06 T10 T14 T18 T22 T26 T30 r11 T02 T06 T10 T14 T18 T22 T26 T30 r12 T03 T07 T11 T15 T19 T23 T27 T31 r13 T03 T07 T11 T15 T19 T23 T27 T31 r14 T03 T07 T11 T15 T19 T23 T27 T31 r15 T03 T07 T11 T15 T19 T23 T27 T31 === TiledCopy: Global -> Shared === Shared memory after TiledCopy (column-major, showing row by row): row 0: 0 16 32 48 64 80 96 112 row 1: 1 17 33 49 65 81 97 113 row 2: 2 18 34 50 66 82 98 114 row 3: 3 19 35 51 67 83 99 115 row 4: 4 20 36 52 68 84 100 116 row 5: 5 21 37 53 69 85 101 117 row 6: 6 22 38 54 70 86 102 118 row 7: 7 23 39 55 71 87 103 119 row 8: 8 24 40 56 72 88 104 120 row 9: 9 25 41 57 73 89 105 121 row 10: 10 26 42 58 74 90 106 122 row 11: 11 27 43 59 75 91 107 123 row 12: 12 28 44 60 76 92 108 124 row 13: 13 29 45 61 77 93 109 125 row 14: 14 30 46 62 78 94 110 126 row 15: 15 31 47 63 79 95 111 127 ``` The values confirm column-major order: element 0 at (0,0), element 1 at (1,0), ..., element 16 at (0,1). The `TiledCopy` moved every element to the right shared memory cell ; 32 threads, zero conflicts, all vectorized. ## 4. Step-by-Step Explanation **Line: `auto tiled_copy = make_tiled_copy(Copy_Atom{...}, thr_layout, val_layout);`** This is the declaration ; the shift supervisor's clipboard. Three ingredients: 1. **`Copy_Atom, float>{}`** ; The copy atom. `uint128_t` means each thread loads 128 bits = 4 floats in a single instruction (`LDG.128`). The `float` tells CuTe the logical element type. 2. **`Layout>{}`** ; Thread layout. 32 threads arranged as 4 rows × 8 columns in column-major order. Thread 0 → position (0,0), Thread 1 → (1,0), ..., Thread 4 → (0,1), ..., Thread 31 → (3,7). 3. **`Layout>{}`** ; Value layout. No extra values per thread beyond the atom's 4 floats. If the tile were bigger, you'd increase this to cover the extra elements (see "Bigger Tiles" below). **Tile coverage:** `(4 threads × 4 atom_vals × 1 val, 8 threads × 1 atom_val × 1 val) = (16, 8)`. **Line: `auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x);`** Each thread asks the `TiledCopy` for its personal *slice* ; a lightweight object that knows which tile cells belong to this thread. This is each mover reading their assignment off the clipboard. **Line: `auto thr_g = thr_copy.partition_S(g_tensor);`** Partitions the source (global memory) tensor for this thread. The result is a tensor containing only the elements this thread should read. The shape is `(ValuesPerAtom, Repetitions...)`: - For our 16×8 tile: shape `(_4)` ; four contiguous floats loaded as one `uint128_t`. No repetitions because the tile matches the `TiledCopy` coverage exactly. - If the tile were **32×8**: shape `(_4, _2)` ; four floats per load × 2 rounds along mode-0. CuTe adds the repetition dimension automatically. **Line: `auto thr_s = thr_copy.partition_D(s_tensor);`** Same for the destination (shared memory). Its shape matches `thr_g` so that `copy()` knows where each source value lands. **Line: `copy(tiled_copy, thr_g, thr_s);`** Executes the copy. CuTe loops over all repetition dimensions (if any), applying the `Copy_Atom` to each batch of values. For our 16×8 tile: - 32 threads × 4 floats each = 128 floats = entire tile - Each thread issues one `LDG.128` (global load) and one `STS.128` (shared store) - Total: 32 loads + 32 stores, all vectorized ; the entire tile moves in one round ### Bigger Tiles and Repetitions What if your tile is 128×128 but you still have 32 threads? ```cpp auto big_copy = make_tiled_copy( Copy_Atom, float>{}, Layout>{}, // still 32 threads Layout>{} // same atom ); // TiledCopy covers (16, 8) per round. // Tile is (128, 128) → repetitions = (128/16, 128/8) = (8, 16). // partition_S returns shape (_4, _8, _16). // copy() loops over the 8×16 = 128 repetitions automatically. ``` Or you can increase `val_layout` to give each thread more work per round: ```cpp auto wide_copy = make_tiled_copy( Copy_Atom, float>{}, Layout>{}, // 32 threads Layout>{} // each thread handles 2 columns of atom-loads ); // Coverage per round: (4×4×2, 8×1×1) = (32, 8). // partition_S for a 128×128 tile: shape (_4, _2, _4, _16). ``` Don't worry about memorizing these shapes ; the point is that `copy()` handles the loops. You just set up the `TiledCopy` and call `copy()`. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** Tutorial 03 was about one mover and their dolly. `TiledCopy` is the **shift supervisor's clipboard** ; a plan that assigns every mover to a grid position on the warehouse floor. The `Copy_Atom` says what tool each mover uses (the dolly). The `thr_layout` is where they stand. The `val_layout` is how many extra trips each mover makes. Three pieces, one plan, zero overlap. **The `val_layout` and Vectorization:** The atom's vector direction must align with the contiguous memory dimension, or you lose vectorization: | Memory Layout | Atom Type | Why It Works | | :--- | :--- | :--- | | Column-major `(M,N):(1,M)` | `UniversalCopy` with thr (4,8) | 4 floats along mode-0 are stride-1 → `LDG.128` | | Row-major `(M,N):(N,1)` | `UniversalCopy` with thr (8,4) | Swap thread arrangement so the atom's 4 values hit stride-1 mode-1 | | Strided / dynamic | `UniversalCopy` | Scalar fallback ; always works, always slow | If the atom's 4-element chunk lands on non-contiguous addresses, the hardware can't coalesce them and you're back to scalar speed. Always check: *"which dimension has stride 1?"* ; and point the atom's vector width in that direction. **The `TiledCopy` API Cheat-Sheet:** | API Call | What It Does | | :--- | :--- | | `make_tiled_copy(atom, thr, val)` | Build the plan from three ingredients | | `tiled_copy.get_thread_slice(tid)` | Each worker reads their assignment | | `slice.partition_S(src)` | Partition source for this thread | | `slice.partition_D(dst)` | Partition destination for this thread | | `copy(tiled_copy, thr_src, thr_dst)` | Execute the copy (loops over repetitions) | **`TiledCopy` vs. Manual `local_partition` + `copy`:** | | `local_partition` + `copy` | `TiledCopy` | | :--- | :--- | :--- | | Thread mapping | You compute it | Declared once | | Vectorization | Depends on partition contiguity | Follows from atom type | | Swap to `cp.async` / TMA | Rewrite everything | Change one `Copy_Atom` | | Repetitions for big tiles | Manual loop | Automatic | **Hardware Note:** For global memory loads, the GPU coalesces requests from threads in the same warp that hit the same 128-byte cache line. To maximize coalescing, threads with adjacent IDs should access adjacent memory addresses. For column-major data, spread threads along mode-0 (rows) first ; which is what `thr_layout = (T_rows, T_cols)` in column-major order does naturally: threads 0, 1, 2, 3 hit rows 0–3 of the same column, which are adjacent in memory. > **Gotcha ; partition shape surprise:** `partition_S` and `partition_D` return tensors shaped `(ValuesPerAtom, Repetitions...)`, NOT `(M, N)`. The first mode is the "hand" for a single atom invocation; the remaining modes are how many times the atom repeats to cover the tile. If you `print()` the partition and see an unexpected number of modes, it's because your tile is bigger than the `TiledCopy`'s single-round coverage. This is normal ; `copy()` iterates over the repetitions automatically. **What comes next:** Each thread's copy is now vectorized and coordinated. But when 32 threads simultaneously write to shared memory, they can collide on the *same bank*. If threads 0, 8, 16, 24 all write to addresses that map to shared memory bank 0, the hardware serializes those writes ; a **bank conflict**. That's the topic of Tutorial 05 (Swizzling), where `composition` with a `Swizzle` layout remaps addresses to spread them across banks. --- # Swizzling ; Avoiding Shared Memory Bank Conflicts - URL: https://www.dcbaslani.xyz/blog/05_swizzling/ - Markdown: https://www.dcbaslani.xyz/blog/05_swizzling/index.md - Text: https://www.dcbaslani.xyz/blog/05_swizzling/index.txt - Published: 2026-02-26 - Topic: CUDA - Description: How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code. # Swizzling ; Avoiding Shared Memory Bank Conflicts **Difficulty:** Intermediate **Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/), basic shared memory understanding ## 1. The Problem (The "Why") Tutorial 04's `TiledCopy` moves data from global to shared memory ; 32 threads, 128-bit vectorized stores, zero coordination issues. Beautiful. But there's a hidden trap. Shared memory isn't one big flat buffer. It's split into **32 banks**, each serving one 4-byte access per cycle. When two threads in the same warp access the *same bank* at the *same time* (different addresses within that bank), the hardware serializes their accesses. This is a **bank conflict**. An N-way conflict means N sequential accesses instead of 1 ; your shared memory bandwidth drops to 1/N. Here's the catch: bank conflicts depend on the **access pattern**, not just the data layout. Your `TiledCopy` might *write* to smem conflict-free (column-first, nicely spread across banks). But the next stage ; an MMA reading that data row-first ; might collide on every access. The culprit is regularity. Column-major strides like 8, 16, 32, or 128 divide evenly into 32 banks, so different columns in the same row keep landing on the same bank. The fix: **swizzle** the shared memory layout. CuTe's `Swizzle` XORs parts of the address to break this regularity ; one line of code, zero bank conflicts. > **B200 Note:** On Blackwell (SM100), all supported MMA swizzle modes ; including no-swizzle (8×16B interleaved) ; are **bank-conflict-free on both the MMA read side and the TMA write side**. Swizzling still matters, though: using no-swizzle or smaller swizzle modes can reduce **TMA achievable throughput** when populating shared memory. So on Hopper/Blackwell the swizzle is primarily about maximizing TMA write bandwidth, not avoiding MMA read conflicts. ## 2. The Mental Model (The Visual) ### How Shared Memory Banks Work Shared memory is divided into 32 banks, each 4 bytes wide. Bank assignment is cyclic: ```text Float index: 0 1 2 3 4 ... 31 32 33 ... Bank: B00 B01 B02 B03 B04 ... B31 B00 B01 ... bank = float_index % 32 ``` Within a single warp (32 threads executing simultaneously): - Each thread accesses a *different* bank → **1 cycle** (full parallelism) - K threads access the *same* bank → **K cycles** (serialized) ### The Problem: Column-Major + Row Access = Conflict Consider an 8×8 tile stored column-major in shared memory (stride = `(1, 8)`): ```text WITHOUT Swizzle ; Bank map for 8×8 column-major (1,8): c0 c1 c2 c3 c4 c5 c6 c7 r0 B00 B08 B16 B24 B00 B08 B16 B24 ← 2-way conflict! r1 B01 B09 B17 B25 B01 B09 B17 B25 c0 & c4 share B00 r2 B02 B10 B18 B26 B02 B10 B18 B26 c1 & c5 share B08 r3 B03 B11 B19 B27 B03 B11 B19 B27 c2 & c6 share B16 r4 B04 B12 B20 B28 B04 B12 B20 B28 c3 & c7 share B24 r5 B05 B13 B21 B29 B05 B13 B21 B29 r6 B06 B14 B22 B30 B06 B14 B22 B30 r7 B07 B15 B23 B31 B07 B15 B23 B31 ``` Reading **down a column** (the TiledCopy write path): banks 0,1,2,3,4,5,6,7 ; all different. No conflict. Reading **across a row** (e.g., MMA consuming data): row 0 hits banks 0,8,16,24,**0**,8,16,24 ; **2-way conflict** on every bank! Columns 0 and 4 always collide, 1 and 5 collide, and so on. Why? Because the column stride (8) divides evenly into 32: `(col × 8) % 32` cycles with period 4, so columns 0 and 4 produce the same bank. The regularity of the stride creates a repeating pattern that piles threads onto the same banks. > With larger strides the problem gets worse. A stride of 32 maps *every column in the same row* to the same bank ; an 8-way conflict for 8 columns. A stride of 128? Still hits the same bank pattern, just at a bigger scale. ### The Fix: Swizzle Breaks the Pattern `Swizzle<3, 2, 3>` XORs high address bits into low address bits, scrambling the bank assignment so no two columns in the same row share a bank: ```text WITH Swizzle<3,2,3> ; Bank map for the same 8×8 tile: c0 c1 c2 c3 c4 c5 c6 c7 r0 B00 B08 B16 B24 B04 B12 B20 B28 ← all 8 unique! r1 B01 B09 B17 B25 B05 B13 B21 B29 r2 B02 B10 B18 B26 B06 B14 B22 B30 r3 B03 B11 B19 B27 B07 B15 B23 B31 r4 B04 B12 B20 B28 B00 B08 B16 B24 ← all 8 unique! r5 B05 B13 B21 B29 B01 B09 B17 B25 r6 B06 B14 B22 B30 B02 B10 B18 B26 r7 B07 B15 B23 B31 B03 B11 B19 B27 ``` Read across row 0: banks 0,8,16,24,**4**,12,20,28 ; **all 8 unique!** Read across row 4: banks 4,12,20,28,**0**,8,16,24 ; **all 8 unique!** Read down column 4: banks 4,5,6,7,0,1,2,3 ; **all 8 unique!** Zero bank conflicts in any direction. The XOR shifted columns 4–7 by 4 banks relative to columns 0–3, breaking the collision pattern. ### The Brick Wall Analogy ```text Without swizzle (joints aligned): With swizzle (staggered): ┌────┬────┬────┬────┐ ┌────┬────┬────┬────┐ │ B0 │ B8 │ B0 │ B8 │ │ B0 │ B8 │ B4 │B12 │ ├────┼────┼────┼────┤ ├────┼────┼────┼────┤ │ B0 │ B8 │ B0 │ B8 │ │ B4 │B12 │ B0 │ B8 │ ├────┼────┼────┼────┤ ├────┼────┼────┼────┤ │ B0 │ B8 │ B0 │ B8 │ │ B0 │ B8 │ B4 │B12 │ └────┴────┴────┴────┘ └────┴────┴────┴────┘ Same banks every row Banks shift per row → conflicts stack up → conflicts eliminated ``` Swizzle is the GPU equivalent of **staggered brick-laying**. In a brick wall, each row is offset by half a brick so the joints don't line up ; this prevents cracks from running straight through. In shared memory, each row's addresses are XOR-shifted so the bank assignments don't repeat ; this prevents bank conflicts from stacking up. ### How `Swizzle` Works The swizzle modifies a flat address by XOR-ing two groups of bits: ```text Address bit layout for Swizzle<3, 2, 3>: bit: 7 6 5 │ 4 3 2 │ 1 0 ────────│─────────│────── source │ target │ free (B=3) │ (B=3) │(M=2) │ ↑ │ └────XOR──┘ (shift S=3) swizzled = addr ^ ((addr >> S) & mask) where mask covers B bits at the target position. ``` The three parameters: | Parameter | Meaning | Effect | | :--- | :--- | :--- | | **M** (free bits) | Bottom M bits are untouched | 2^M elements stay contiguous. **M=2 → 4 floats (128 bits) stay together → vectorized loads still work!** | | **B** (XOR width) | Number of bits to XOR | Scrambles across 2^B banks at a time. B=3 → 8-bank groups. | | **S** (shift) | Distance between source and target bit fields | Target bits = `[M : M+B)`. Source bits = `[M+S : M+S+B)`. | The critical parameter is **M**: it controls the granularity of the swizzle. Because bits `[0:M)` are untouched, blocks of 2^M consecutive elements remain contiguous after swizzling. With M=2, blocks of 4 floats (= 128 bits) are preserved ; exactly what `LDG.128` / `STS.128` needs. ## 3. The Solution (The Code) A bank conflict visualizer that prints the bank assignment map with and without swizzle, followed by a `TiledCopy` demonstrating that the swizzle is transparent to the copy. ```cpp #include #include #include using namespace cute; // ─── Kernel: Print bank assignment for every tile cell ─── template __global__ void bank_conflict_visualizer(Layout smem_layout, const char* label) { if (threadIdx.x != 0) return; int M = size<0>(smem_layout); int N = size<1>(smem_layout); // ── Print bank map ── printf("%s (%d x %d):\n\n", label, M, N); printf(" "); for (int n = 0; n < N; ++n) printf("c%-4d ", n); printf("\n"); for (int m = 0; m < M; ++m) { printf("r%-4d ", m); for (int n = 0; n < N; ++n) { int addr = smem_layout(m, n); // flat offset (in floats) int bank = addr % 32; printf("B%02d ", bank); } printf("\n"); } // ── Count row-wise conflicts ── int total_conflicts = 0; for (int m = 0; m < M; ++m) { int bank_hits[32] = {}; for (int n = 0; n < N; ++n) { bank_hits[smem_layout(m, n) % 32]++; } for (int b = 0; b < 32; ++b) { if (bank_hits[b] > 1) total_conflicts += bank_hits[b] - 1; } } printf("\nRow-wise bank conflicts: %d (%s)\n\n", total_conflicts, total_conflicts == 0 ? "CLEAN" : "CONFLICTS!"); } // ─── Kernel: TiledCopy into swizzled smem ─── template __global__ void copy_with_swizzle(float const* __restrict__ g_ptr, GmemLayout gmem_layout, SmemLayout smem_layout, TiledCopy tiled_copy) { extern __shared__ float smem[]; auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout); auto s_tensor = make_tensor(make_smem_ptr(smem), smem_layout); auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x); auto thr_g = thr_copy.partition_S(g_tensor); auto thr_s = thr_copy.partition_D(s_tensor); // ── Copy ; swizzle is completely transparent ── copy(tiled_copy, thr_g, thr_s); __syncthreads(); // Thread 0 verifies the data (reading through the swizzled layout) if (threadIdx.x == 0) { printf("Shared memory (logical view through swizzled layout):\n"); for (int m = 0; m < size<0>(s_tensor); ++m) { printf(" row %d: ", m); for (int n = 0; n < size<1>(s_tensor); ++n) { printf("%5.0f ", s_tensor(m, n)); } printf("\n"); } } } int main() { constexpr int M = 8, N = 8; // ─── 1. Bank conflict visualizer ─── printf("=== Bank Conflict Visualizer ===\n\n"); // Plain column-major layout: (8,8):(1,8) auto plain = Layout, Stride<_1, _8>>{}; // Swizzled layout: composition(swizzle, layout) // composition applies the XOR to the flat offset that layout produces auto swizzled = composition(Swizzle<3, 2, 3>{}, plain); bank_conflict_visualizer<<<1, 1>>>(plain, "WITHOUT Swizzle"); cudaDeviceSynchronize(); bank_conflict_visualizer<<<1, 1>>>(swizzled, "WITH Swizzle<3,2,3>"); cudaDeviceSynchronize(); // ─── 2. TiledCopy with swizzled smem ─── printf("=== TiledCopy + Swizzle ===\n\n"); auto gmem_layout = Layout, Stride<_1, _8>>{}; // TiledCopy: 32 threads, 2 floats (64 bits) per atom auto tiled_copy = make_tiled_copy( Copy_Atom, float>{}, // 64 bits = 2 floats Layout>{}, // 32 threads in 4×8 Layout>{} ); float h_data[M * N]; for (int i = 0; i < M * N; ++i) h_data[i] = float(i); float* d_data; cudaMalloc(&d_data, sizeof(h_data)); cudaMemcpy(d_data, h_data, sizeof(h_data), cudaMemcpyHostToDevice); int smem_bytes = M * N * sizeof(float); copy_with_swizzle<<<1, 32, smem_bytes>>>( d_data, gmem_layout, swizzled, tiled_copy); cudaDeviceSynchronize(); cudaFree(d_data); return 0; } ``` **Expected Output:** ```text === Bank Conflict Visualizer === WITHOUT Swizzle (8 x 8): c0 c1 c2 c3 c4 c5 c6 c7 r0 B00 B08 B16 B24 B00 B08 B16 B24 r1 B01 B09 B17 B25 B01 B09 B17 B25 r2 B02 B10 B18 B26 B02 B10 B18 B26 r3 B03 B11 B19 B27 B03 B11 B19 B27 r4 B04 B12 B20 B28 B04 B12 B20 B28 r5 B05 B13 B21 B29 B05 B13 B21 B29 r6 B06 B14 B22 B30 B06 B14 B22 B30 r7 B07 B15 B23 B31 B07 B15 B23 B31 Row-wise bank conflicts: 8 (CONFLICTS!) WITH Swizzle<3,2,3> (8 x 8): c0 c1 c2 c3 c4 c5 c6 c7 r0 B00 B08 B16 B24 B04 B12 B20 B28 r1 B01 B09 B17 B25 B05 B13 B21 B29 r2 B02 B10 B18 B26 B06 B14 B22 B30 r3 B03 B11 B19 B27 B07 B15 B23 B31 r4 B04 B12 B20 B28 B00 B08 B16 B24 r5 B05 B13 B21 B29 B01 B09 B17 B25 r6 B06 B14 B22 B30 B02 B10 B18 B26 r7 B07 B15 B23 B31 B03 B11 B19 B27 Row-wise bank conflicts: 0 (CLEAN!) === TiledCopy + Swizzle === Shared memory (logical view through swizzled layout): row 0: 0 8 16 24 32 40 48 56 row 1: 1 9 17 25 33 41 49 57 row 2: 2 10 18 26 34 42 50 58 row 3: 3 11 19 27 35 43 51 59 row 4: 4 12 20 28 36 44 52 60 row 5: 5 13 21 29 37 45 53 61 row 6: 6 14 22 30 38 46 54 62 row 7: 7 15 23 31 39 47 55 63 ``` The data is logically correct ; `s_tensor(m, n)` returns the right value even though the physical addresses in shared memory have been scrambled. The swizzle is completely transparent to the reader: you access `(row, col)` the same way you always did, and CuTe routes to the swizzled address behind the scenes. ## 4. Step-by-Step Explanation **Line: `auto plain = Layout, Stride<_1, _8>>{};`** The unswizzled column-major layout. `plain(m, n) = m * 1 + n * 8`. This is the address formula from Tutorial 01 ; the same `address = coord · stride` dot product. The bank for element `(m, n)` is `(m + 8n) % 32`. **Line: `auto swizzled = composition(Swizzle<3, 2, 3>{}, plain);`** This is the one-line fix. `composition(f, g)` creates a new function `h(x) = f(g(x))`: 1. `plain(m, n)` converts coordinates to a flat offset: `m + 8*n` 2. `Swizzle<3, 2, 3>` XORs high bits into low bits of that offset The result is a new layout where `swizzled(m, n)` gives a *different* flat offset ; one that spreads banks evenly. CuTe stores and retrieves data through this remapped offset, so the logical view is unchanged but the physical addresses avoid conflicts. **How the XOR works for `Swizzle<3, 2, 3>`:** ```text Example: plain address 32 (row=0, col=4) in binary = 0b100000 Step 1: Extract source bits [5:8] → 0b1 (bit 5 is set) Step 2: Shift right by S=3 → 0b100 (now at bit position 2) Step 3: XOR with original → 0b100000 ^ 0b000100 = 0b100100 = 36 Plain bank: 32 % 32 = 0 (same as col 0!) Swizzled bank: 36 % 32 = 4 (different ; conflict eliminated) ``` The XOR takes "which region of the tile am I in?" (the high bits) and mixes it into "which bank do I hit?" (the low bits). Different regions get different scrambles, so they never collide. **Line: `auto swizzled = composition(Swizzle<3, 2, 3>{}, plain);`** *(why these specific numbers?)* - **M=2 (free bits):** The bottom 2 bits of the address are untouched. 2^2 = 4 consecutive floats stay contiguous = 128 bits. This preserves vectorized `STS.128` stores. If M were 0, even adjacent elements could get scrambled and vectorization would break. - **B=3 (XOR width):** 3 bits → scramble across groups of 2^3 = 8 banks. Enough to break the 8-column pattern of our 8×8 tile. - **S=3 (shift):** Source bits start at position M+S = 5, right above the target bits at position M = 2. No overlap between source and target. **Line: `copy(tiled_copy, thr_g, thr_s);`** The `TiledCopy` doesn't know about the swizzle ; and doesn't need to. It partitions `s_tensor` based on its layout (which now includes the swizzle), and the `copy()` call stores through the swizzled addresses. Each thread's 2-float store still lands on contiguous addresses (because M=2 preserves 4-element contiguity), so vectorization is unaffected. **Line: `s_tensor(m, n)` in the print loop** Reading back through the swizzled layout is also transparent. `s_tensor(m, n)` computes the swizzled address, reads from that location, and returns the correct value. The logical view is identical to the plain layout ; the scrambling only affects the physical address. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** Swizzle is **staggered brick-laying** for shared memory. In a brick wall, each row is offset so joints don't line up vertically ; this prevents cracks from running straight through. In shared memory, each row's addresses are XOR-shifted so bank assignments don't repeat across columns ; this prevents bank conflicts from stacking up. The `composition` call is the mortar that binds the swizzle to your layout: one line, and every access goes through the staggered pattern automatically. **Choosing Swizzle Parameters:** | Parameter | Rule of Thumb | | :--- | :--- | | **M** (free bits) | Set to `log2(vector_width / sizeof(element))`. For 128-bit loads on `float`: M = log2(128/32) = 2. For `half`: M = log2(128/16) = 3. | | **B** (XOR width) | Set to `log2(num_columns_to_disambiguate)`. For 8 columns: B=3. For 16: B=4 (but you'll need the address space to support it). | | **S** (shift) | Usually = B (non-overlapping source and target fields). This is the simplest and most common choice. | **Common Swizzle Configurations in CUTLASS:** | Swizzle | Use Case | Free Bits | Scramble Width | | :--- | :--- | :--- | :--- | | `Swizzle<3, 3, 3>` | 128-byte smem tiles, `half` elements | 8 halfs = 128 bits | 8 banks | | `Swizzle<3, 2, 3>` | 128-byte smem tiles, `float` elements | 4 floats = 128 bits | 8 banks | | `Swizzle<2, 3, 3>` | 64-byte smem tiles | 8 elements | 4 banks | | `Swizzle<1, 3, 3>` | 32-byte smem tiles | 8 elements | 2 banks | | `Swizzle<0, 0, 0>` | No swizzle (identity) | ; | ; | **Why the swizzle doesn't break vectorization:** The M "free" bits guarantee that blocks of 2^M consecutive elements remain at consecutive addresses after swizzling. For M=2, any group of 4 adjacent floats stays contiguous ; exactly what `STS.128` needs. The swizzle only shuffles *which group of 4* goes where, not the elements within the group. **`composition` ; the key CuTe operation:** `composition(f, g)` computes `f(g(x))`. When `f` is a `Swizzle` and `g` is a `Layout`, the result is a new layout-like object that maps coordinates to swizzled offsets. You can use it anywhere a layout is expected: ```cpp // Unswizzled ; has bank conflicts auto smem_layout = Layout, Stride<_1, _128>>{}; // Swizzled ; bank-conflict-free, one-line change auto smem_layout = composition(Swizzle<3, 2, 3>{}, Layout, Stride<_1, _128>>{}); // Use it exactly like a normal layout auto s_tensor = make_tensor(make_smem_ptr(smem), smem_layout); ``` **Hardware Note:** Shared memory bank conflicts show up in `ncu` (NVIDIA Nsight Compute) under the metric `l1tex__data_bank_conflicts_pipe_lsu_mem_shared`. If this number is non-zero, you have conflicts. The fix is almost always a swizzle on your smem layout. On Hopper/Blackwell, all MMA swizzle modes (including no-swizzle) are bank-conflict-free on the MMA read side ; the swizzle in CUTLASS's default smem layouts for WGMMA/tcgen05 is there to maximize **TMA write throughput** when populating shared memory, not to avoid read-side bank conflicts. > **Gotcha ; swizzle and `cosize`:** A swizzled layout may produce offsets larger than the plain layout's maximum. Always allocate shared memory based on `cosize(swizzled_layout)`, not `size(plain_layout)`. In practice, for well-chosen parameters (where B+M+S ≤ address bits), the max offset stays within the original range, but it's good practice to use `cosize` regardless. > **Gotcha ; debugging swizzled smem:** If you `printf` raw smem addresses, the data looks scrambled. This is expected ; the physical layout *is* scrambled. Always access through the CuTe tensor (using logical coordinates), and the swizzle is transparent. If you need to dump raw smem for debugging, compose with the inverse swizzle (XOR is its own inverse ; applying the same swizzle twice gives the original address). **What comes next:** With vectorized, parallel, bank-conflict-free copies from global to shared memory, the data movement story is complete. Tutorial 06 (Hello, MMA) shifts to the *compute* side: feeding that data into a Tensor Core instruction to trigger a hardware matrix multiply. --- # Hello, MMA — Your First Tensor Core Instruction - URL: https://www.dcbaslani.xyz/blog/06_hello_mma/ - Markdown: https://www.dcbaslani.xyz/blog/06_hello_mma/index.md - Text: https://www.dcbaslani.xyz/blog/06_hello_mma/index.txt - Published: 2026-03-03 - Topic: CUDA - Description: How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores. # Hello, MMA ; Your First Tensor Core Instruction **Difficulty:** Intermediate **Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/) (same partition API pattern) ## 1. The Problem (The "Why") Tutorials 03–05 were about *moving data* ; getting floats from global memory into shared memory, fast and conflict-free. But at some point, you need to actually *compute*. On modern NVIDIA GPUs, the fastest math unit isn't the CUDA core ; it's the **Tensor Core**. A Tensor Core executes a matrix multiply-accumulate (MMA) as a single hardware instruction. The SM80 `mma.sync.aligned.m16n8k16` instruction multiplies a 16×16 half-precision matrix by an 8×16 half-precision matrix and accumulates into a 16×8 float matrix ; **2048 multiply-adds in a single instruction**, across all 32 threads of a warp. (Note: this instruction has a latency of 20+ cycles; "single instruction" means one issue slot, not one clock cycle. See *Hardware note ; throughput* below.) The catch: you can't just pass pointers. The hardware expects A, B, and C fragments to be distributed across 32 threads in a **very specific register layout**. Thread 0 holds certain rows and columns of A, thread 7 holds others, and the MMA instruction reaches into all 32 threads' registers simultaneously. Get the distribution wrong and you get garbage ; or a hardware trap. CuTe's `TiledMMA` handles this distribution for you, using the **exact same partition pattern** you learned with `TiledCopy`: | TiledCopy (Tutorial 04) | TiledMMA (This Tutorial) | | :--- | :--- | | `make_tiled_copy(Copy_Atom, thr, val)` | `make_tiled_mma(MMA_Atom{})` | | `get_thread_slice(tid)` | `get_thread_slice(tid)` | | `partition_S` / `partition_D` | `partition_A` / `partition_B` / `partition_C` | | `copy(tiled_copy, src, dst)` | `gemm(tiled_mma, fragA, fragB, fragC)` | Same API shape, different hardware backend. If you understood Tutorial 04's ownership maps and partitions, you already know 80% of this tutorial. > **B200 Note:** On Hopper/Blackwell, the same `TiledMMA` pattern drives WGMMA (Warpgroup MMA) ; 128 threads feeding a larger Tensor Core instruction. Tutorial 09 covers that. The API is identical; only the atom changes. ## 2. The Mental Model (The Visual) ### The Stamping Press Think of a Tensor Core as a **stamping press** in a factory. You load raw-material trays (A and B fragments) into specific slots, press the button, and out comes the product (C accumulator). The press is fixed-size ; it always stamps a 16×8 tile of results from 16×16 and 8×16 inputs. ```text ┌─────────────────────────────┐ │ STAMPING PRESS │ │ (One MMA Instruction) │ │ │ ┌──────────────┐ │ │ ┌──────────┐ │ A (16×16) │────▶│ 32 threads cooperate: │────▶│ C (16×8) │ │ half │ │ each loads its tray slot, │ │ float │ │ 256 values │ │ hardware does the rest │ │ 128 vals │ └──────────────┘ │ │ └──────────┘ ┌──────────────┐ │ 2048 multiply-adds │ │ B (8×16) │────▶│ in a SINGLE instruction │ │ half │ │ │ │ 128 values │ └─────────────────────────────-┘ └──────────────┘ ``` ### Per-Thread Fragment Sizes The 256 values of A, 128 values of B, and 128 values of C are split evenly across 32 threads: ```text Thread #7 (one of 32): ┌────────────────────────────────────────────────────┐ │ Registers: │ │ A fragment: 8 half values (256 total / 32) │ │ B fragment: 4 half values (128 total / 32) │ │ C fragment: 4 float values (128 total / 32) │ │ │ │ The hardware knows exactly which matrix cells │ │ these 8+4+4 values correspond to. │ │ CuTe knows too ; that's what partition_A/B/C do. │ └────────────────────────────────────────────────────┘ ``` You don't need to know the exact register-to-coordinate mapping (it's in the PTX docs if you're curious). CuTe's partition functions handle it transparently ; just like `TiledCopy` handles thread-to-element mapping without you memorizing a table. ### The API Flow ```text make_tiled_mma(MMA_Atom{}) │ get_thread_slice(tid) │ ┌────┴──────────────────┬──────────────────┐ │ │ │ partition_A(sA) partition_B(sB) partition_C(gC) │ │ │ make_fragment_like make_fragment_like make_fragment_like │ │ │ copy(smem → reg) copy(smem → reg) clear(accum) │ │ │ └───────────┬───────────┘ │ │ │ gemm(tiled_mma, fragA, fragB, accum)───┘ │ copy(accum → gC) ``` ## 3. The Solution (The Code) Two kernels: first, an ownership map that shows which thread owns which C element (same technique as Tutorial 04). Second, the actual micro-GEMM: load A and B into shared memory, partition them for the MMA, execute one Tensor Core instruction, and write C back to global memory. ```cpp #include #include #include #include #include #include using namespace cute; // ─── Kernel 1: MMA Ownership Map ─── // Which thread owns which elements of the C matrix? __global__ void mma_ownership_map() { using MMA_Op = SM80_16x8x16_F32F16F16F32_TN; auto tiled_mma = make_tiled_mma(MMA_Atom{}); // Shared memory to stamp thread IDs into C positions __shared__ int smem_C[16 * 8]; auto sC_layout = make_layout(make_shape(Int<16>{}, Int<8>{})); auto sC = make_tensor(make_smem_ptr(smem_C), sC_layout); // Clear for (int i = threadIdx.x; i < 128; i += blockDim.x) smem_C[i] = -1; __syncthreads(); // Each thread stamps its ID into its owned C cells auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x); auto tCsC = thr_mma.partition_C(sC); for (int i = 0; i < size(tCsC); ++i) tCsC(i) = threadIdx.x; __syncthreads(); // Thread 0 prints the map if (threadIdx.x == 0) { printf("MMA C Ownership Map (16x8) ; SM80_16x8x16_F32F16F16F32_TN\n"); printf("Each thread holds 4 float accumulators.\n\n"); printf(" "); for (int n = 0; n < 8; ++n) printf("n=%-4d", n); printf("\n"); for (int m = 0; m < 16; ++m) { printf("m=%-4d ", m); for (int n = 0; n < 8; ++n) printf("T%02d ", sC(m, n)); printf("\n"); } } } // ─── Kernel 2: Micro-GEMM ─── // One warp computes C(16×8) += A(16×16) × B(8×16)^T using a single MMA instruction. __global__ void micro_gemm(half const* A_ptr, half const* B_ptr, float* C_ptr) { using MMA_Op = SM80_16x8x16_F32F16F16F32_TN; auto tiled_mma = make_tiled_mma(MMA_Atom{}); // ── Shared memory ── __shared__ half smem_A[16 * 16]; // M × K __shared__ half smem_B[8 * 16]; // N × K // A layout: (M,K) = (16,16), K stride-1 (required by TN atom) auto sA_layout = make_layout(make_shape(Int<16>{}, Int<16>{}), make_stride(Int<16>{}, Int<1>{})); // B layout: (N,K) = (8,16), K stride-1 auto sB_layout = make_layout(make_shape(Int<8>{}, Int<16>{}), make_stride(Int<16>{}, Int<1>{})); // C layout: (M,N) = (16,8), column-major auto gC_layout = make_layout(make_shape(Int<16>{}, Int<8>{})); auto sA = make_tensor(make_smem_ptr(smem_A), sA_layout); auto sB = make_tensor(make_smem_ptr(smem_B), sB_layout); auto gC = make_tensor(make_gmem_ptr(C_ptr), gC_layout); // ── 1. Load A, B from global to shared (simple loop ; not TiledCopy) ── auto gA = make_tensor(make_gmem_ptr(A_ptr), sA_layout); auto gB = make_tensor(make_gmem_ptr(B_ptr), sB_layout); for (int i = threadIdx.x; i < size(sA); i += blockDim.x) sA(i) = gA(i); for (int i = threadIdx.x; i < size(sB); i += blockDim.x) sB(i) = gB(i); __syncthreads(); // ── 2. Partition A, B, C for this thread ── auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x); auto tCsA = thr_mma.partition_A(sA); // (MMA, MMA_M, MMA_K) = (8, 1, 1) auto tCsB = thr_mma.partition_B(sB); // (MMA, MMA_N, MMA_K) = (4, 1, 1) auto tCgC = thr_mma.partition_C(gC); // (MMA, MMA_M, MMA_N) = (4, 1, 1) // ── 3. Create register fragments ── auto tCrA = make_fragment_like(tCsA); // 8 half regs for A auto tCrB = make_fragment_like(tCsB); // 4 half regs for B auto accum = make_fragment_like(tCgC); // 4 float regs for C clear(accum); // ── 4. Copy A, B from shared memory to registers ── copy(tCsA, tCrA); copy(tCsB, tCrB); // ── 5. Execute the MMA! ── gemm(tiled_mma, tCrA, tCrB, accum); // ── 6. Write result back to global memory ── copy(accum, tCgC); } int main() { constexpr int M = 16, N = 8, K = 16; // ─── 1. Ownership Map ─── printf("=== MMA Ownership Map ===\n\n"); mma_ownership_map<<<1, 32>>>(); cudaDeviceSynchronize(); // ─── 2. Micro-GEMM ─── printf("\n=== Micro-GEMM: C(16x8) = A(16x16) * B(8x16)^T ===\n\n"); // Test data: A = all ones, B = all ones → C[m][n] = K = 16.0 half h_A[M * K], h_B[N * K]; float h_C[M * N] = {}; for (int i = 0; i < M * K; ++i) h_A[i] = __float2half(1.0f); for (int i = 0; i < N * K; ++i) h_B[i] = __float2half(1.0f); half *d_A, *d_B; float *d_C; cudaMalloc(&d_A, M * K * sizeof(half)); cudaMalloc(&d_B, N * K * sizeof(half)); cudaMalloc(&d_C, M * N * sizeof(float)); cudaMemcpy(d_A, h_A, M * K * sizeof(half), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, N * K * sizeof(half), cudaMemcpyHostToDevice); micro_gemm<<<1, 32>>>(d_A, d_B, d_C); cudaDeviceSynchronize(); cudaMemcpy(h_C, d_C, M * N * sizeof(float), cudaMemcpyDeviceToHost); // C is column-major: element (m,n) is at offset m + M*n printf("C (16x8) ; each entry = K * 1.0 * 1.0 = 16.0:\n\n"); printf(" "); for (int n = 0; n < N; ++n) printf("n=%-6d", n); printf("\n"); for (int m = 0; m < M; ++m) { printf("m=%-4d ", m); for (int n = 0; n < N; ++n) printf("%-8.1f", h_C[m + M * n]); printf("\n"); } cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); return 0; } ``` **Expected Output:** ```text === MMA Ownership Map === MMA C Ownership Map (16x8) ; SM80_16x8x16_F32F16F16F32_TN Each thread holds 4 float accumulators. n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7 m=0 T00 T00 T04 T04 T08 T08 T12 T12 m=1 T01 T01 T05 T05 T09 T09 T13 T13 m=2 T02 T02 T06 T06 T10 T10 T14 T14 m=3 T03 T03 T07 T07 T11 T11 T15 T15 m=4 T16 T16 T20 T20 T24 T24 T28 T28 m=5 T17 T17 T21 T21 T25 T25 T29 T29 m=6 T18 T18 T22 T22 T26 T26 T30 T30 m=7 T19 T19 T23 T23 T27 T27 T31 T31 m=8 T00 T00 T04 T04 T08 T08 T12 T12 m=9 T01 T01 T05 T05 T09 T09 T13 T13 m=10 T02 T02 T06 T06 T10 T10 T14 T14 m=11 T03 T03 T07 T07 T11 T11 T15 T15 m=12 T16 T16 T20 T20 T24 T24 T28 T28 m=13 T17 T17 T21 T21 T25 T25 T29 T29 m=14 T18 T18 T22 T22 T26 T26 T30 T30 m=15 T19 T19 T23 T23 T27 T27 T31 T31 === Micro-GEMM: C(16x8) = A(16x16) * B(8x16)^T === C (16x8) ; each entry = K * 1.0 * 1.0 = 16.0: n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7 m=0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 m=1 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 ... m=15 16.0 16.0 16.0 16.0 16.0 16.0 16.0 16.0 ``` The ownership map reveals the hardware's register distribution: threads 0–3 and 16–19 alternate in 4-row blocks along M, and each thread owns pairs of adjacent columns. Thread T00 holds C[0][0], C[0][1], C[8][0], C[8][1] ; four floats scattered across two 4-row blocks. You don't need to memorize this pattern; `partition_C` handles it automatically. The GEMM result is 16.0 everywhere: each entry is the dot product of a length-16 all-ones row from A with a length-16 all-ones row from B. One instruction, 2048 multiply-adds, 32 threads cooperating through their registers. (The instruction has 20+ cycles of latency, but it occupies only one issue slot ; so pipelining multiple MMAs is how real kernels hide that cost.) ## 4. Step-by-Step Explanation **Line: `using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;`** This names the PTX instruction we want: `mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32`. The naming convention: | Part | Meaning | | :--- | :--- | | `SM80` | Ampere architecture (compute capability 8.0+) | | `16x8x16` | M × N × K ; the tile dimensions of one MMA instruction | | `F32F16F16F32` | Types: D=float, A=half, B=half, C=float (D, A, B, C order) | | `TN` | A is K-major (Transposed), B is K-major (Normal) | The "TN" layout is the most common: both A and B have their K dimension as stride-1. This means your shared memory layouts for A(M,K) and B(N,K) should both have K as the fast-moving axis. **Line: `auto tiled_mma = make_tiled_mma(MMA_Atom{});`** Wraps the raw hardware instruction in CuTe's `MMA_Atom`, then promotes it to a `TiledMMA`. With no extra tiling arguments, this creates the simplest possible MMA ; one atom, one warp, no repetition. The tile size is exactly 16×8×16. Compare with Tutorial 04's `make_tiled_copy(Copy_Atom, thr_layout, val_layout)`. For MMA, the thread layout and value layout are determined by the hardware instruction itself ; there's nothing to choose. The atom *is* the complete plan. **Lines: `sA_layout` with `make_stride(Int<16>{}, Int<1>{})`** A is (M, K) = (16, 16) with K stride-1. This means `sA(m, k) = m * 16 + k * 1` ; moving along K is contiguous in memory. The TN atom requires this: the PTX instruction's "row" descriptor for A means K is the fast index. If you accidentally used column-major (M stride-1, K stride-16), the MMA would read the wrong values ; you'd be feeding columns of A where it expects rows. **Lines: `sB_layout` with `make_stride(Int<16>{}, Int<1>{})`** B is (N, K) = (8, 16), also K stride-1. Same reason: the TN atom's "col" descriptor for B also means K is the fast index. Both A and B want K contiguous. > This is why "TN" is popular: when A and B are both K-major, your global memory loads can be coalesced along the K dimension. You'll see this pattern again in Tutorial 07's full GEMM. **Line: `auto tCsA = thr_mma.partition_A(sA); // (MMA, MMA_M, MMA_K) = (8, 1, 1)`** `partition_A` asks: "Given the full shared memory tensor `sA` and the MMA's register layout, which elements belong to this thread?" It returns a 3-mode tensor: - **Mode 0 (MMA):** The 8 half values this thread feeds into one atom invocation. - **Mode 1 (MMA_M):** How many times the atom repeats along M. Our A has M=16 and the atom handles M=16, so 1 repetition. - **Mode 2 (MMA_K):** How many times the atom repeats along K. Our K=16 matches the atom's K=16, so 1 repetition. Compare with Tutorial 04: `partition_S` returned `(ValuesPerAtom, Reps...)`. Same idea, different dimension names. **Line: `auto tCrA = make_fragment_like(tCsA);`** Creates a register tensor with the same shape as the smem partition: `(8, 1, 1)` of half-precision values. These 8 registers are this thread's personal tray ; the slot in the stamping press that it's responsible for loading. **Line: `copy(tCsA, tCrA);`** Copies this thread's 8 half values from shared memory to registers. Each thread reads from different smem addresses (determined by `partition_A`), so there's no conflict. After this line, all 32 threads have their A fragments loaded and ready. **Line: `gemm(tiled_mma, tCrA, tCrB, accum);`** This is the magic line. `gemm()` dispatches the PTX `mma.sync` instruction, which: 1. Reads A fragments from all 32 threads' registers (8 halfs each = 256 total = 16×16 matrix). 2. Reads B fragments from all 32 threads' registers (4 halfs each = 128 total = 8×16 matrix). 3. Multiplies the 16×16 matrix by the 8×16 matrix (as C += A × B^T). 4. Accumulates the 16×8 result into all 32 threads' C registers (4 floats each = 128 total). This is a single instruction, but **not** a single clock cycle ; the MMA has a latency of 20+ cycles. However, it only occupies one issue slot, so the warp scheduler can issue other independent instructions (or another MMA from a different warp) while the Tensor Core is still working. Real GEMM kernels pipeline multiple MMAs to hide this latency. The `.sync` in the PTX name means all threads in the warp participate simultaneously ; this is a collective operation, not a per-thread one. **Line: `copy(accum, tCgC);`** Copies the 4 float accumulator values from registers to their corresponding positions in global memory C. Each thread writes to different addresses (determined by `partition_C`), so no conflicts. The host then reads back the complete 16×8 result. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** `MMA_Atom` is a **stamping press** ; a fixed-size hardware machine that takes raw-material trays (A and B register fragments), processes them all at once, and stamps out a product (C accumulator). Each worker (thread) loads their assigned tray slot. `TiledMMA` is the factory floor plan that coordinates the workers. `gemm()` is pressing the button. This analogy extends the Tutorial 03–04 warehouse metaphor: data arrives from the truck (global memory) via the loading dock (TiledCopy + smem), and now the stamping press (TiledMMA) processes it in the factory. The same workers (threads) do both jobs ; they just switch from mover role to machine-operator role. **Fragment size formula:** For any MMA atom with shape M×N×K using W threads: | Fragment | Per-thread values | Total | | :--- | :--- | :--- | | A | M × K / W | M × K | | B | N × K / W | N × K | | C | M × N / W | M × N | For SM80_16x8x16 with W=32: A = 8 halfs, B = 4 halfs, C = 4 floats. **What "TN" means for your layouts:** | Layout | A (M, K) | B (N, K) | | :--- | :--- | :--- | | **TN** | K stride-1 `(K, 1)` | K stride-1 `(K, 1)` | | TT | K stride-1 `(K, 1)` | N stride-1 `(1, N)` | | NN | M stride-1 `(1, M)` | N stride-1 `(1, N)` | | NT | M stride-1 `(1, M)` | K stride-1 `(K, 1)` | TN is the most common because both A and B have K contiguous ; global memory loads along the K-reduction dimension are coalesced. **The ownership map tells you something important:** Thread T00 owns C[0][0], C[0][1], C[8][0], C[8][1]. These four values are *not* contiguous in column-major memory (offsets 0, 16, 8, 24). This means writing the C accumulator to global memory via `copy(accum, tCgC)` produces scattered stores ; each thread writes 4 separate floats to non-contiguous addresses. For a real GEMM kernel (Tutorial 07), the epilogue stage would use a TiledCopy to reorganize the writes through shared memory for coalesced global stores. **Hardware note ; checking your GPU:** The SM80 MMA atom requires compute capability 8.0 or higher (A100, RTX 3090, etc.). To check: ```bash nvidia-smi --query-gpu=compute_cap --format=csv,noheader # Should print 8.0 or higher ``` **Hardware note ; latency vs. throughput:** A single `mma.sync.m16n8k16` has a **latency** of 20+ clock cycles ; the result isn't ready to read for that long. But the **maximum issue frequency** (throughput) is much higher than one-per-20-cycles, because the Tensor Core is pipelined. The exact throughput depends on the GPU architecture: | GPU | Max issue frequency (per SM partition) | | :--- | :--- | | A100 (SM80) | 1 MMA per 8 cycles | | Consumer Ampere (e.g. RTX 3090) | 1 MMA per 16 cycles | | Some other consumer GPUs | 1 MMA per 32 cycles | On A100, each SM has 4 partitions (sub-partitions / warp schedulers), so the SM-wide throughput is one MMA per 2 cycles, or 2048 FMAs every 2 cycles. Multiply by 108 SMs and a 1.41 GHz clock and you get the advertised Tensor Core TFLOPS ; but only if the data pipeline (Tutorials 03–05) keeps the Tensor Cores fed and you have enough independent MMAs in flight to hide the 20+ cycle latency. > **Gotcha ; layout mismatch:** If your smem layout doesn't match the TN convention (K stride-1), the MMA will read the wrong elements. The symptom is silently wrong results, not a crash. Always double-check that your A and B shared memory strides put K as stride-1 for TN atoms. > **Gotcha ; half precision:** The input matrices *must* be `half` (F16). If you accidentally store `float` values in smem and cast the pointer to `half*`, you'll get garbage. Use `__float2half()` to convert before storing. **What comes next:** This tutorial computed a single 16×8 tile using one MMA instruction. But real matrices are much larger than 16×8. Tutorial 07 (The Global GEMM) shows how to tile over M, N, and K ; looping the MMA across tiles, piping data from global memory through shared memory into the Tensor Core in a continuous stream. --- # The Global GEMM — Putting It All Together - URL: https://www.dcbaslani.xyz/blog/07_the_global_gemm/ - Markdown: https://www.dcbaslani.xyz/blog/07_the_global_gemm/index.md - Text: https://www.dcbaslani.xyz/blog/07_the_global_gemm/index.txt - Published: 2026-03-07 - Topic: CUDA - Description: Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory. # The Global GEMM ; Putting It All Together **Difficulty:** Intermediate–Advanced **Prerequisites:** [Tutorial 02: Partitioning](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/), [Tutorial 04: TiledCopy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/), [Tutorial 05: Swizzling](https://www.dcbaslani.xyz/blog/05_swizzling/), [Tutorial 06: Hello, MMA](https://www.dcbaslani.xyz/blog/06_hello_mma/) ## 1. The Problem (The "Why") Tutorial 06 triggered a single Tensor Core instruction on a 16×8×16 (M×N×K) micro-tile. That's 2048 multiply-adds ; impressive for one instruction, but real-world matrices are thousands of elements wide. You need to compute C(M×N) = A(M×K) × Bᵀ(N×K) where M, N, and K can each be 4096 or more. The answer isn't one giant instruction. It's **tiling** ; the same idea from Tutorial 02, now applied at every level: | Level | What's tiled | Controlled by | | :--- | :--- | :--- | | **CTA grid** | Output C is divided into BLK_M×BLK_N tiles, one per thread block | `local_tile` + `blockIdx` | | **K-loop** | The reduction dimension K is divided into BLK_K chunks, processed sequentially | The mainloop `for` loop | | **Warp MMA** | Each BLK_M×BLK_N tile is further divided by the `TiledMMA` across 4 warps | `partition_A/B/C` + `gemm()` | This is the structure of every high-performance GEMM on GPUs ; from CUTLASS to cuBLAS. The names change but the three-level tiling is always the same. This tutorial writes that kernel from scratch, using nothing but the CuTe primitives you already know. > **B200 Note:** On Hopper/Blackwell, the structure is identical but the data-movement layer changes: TiledCopy with `cp.async` or TMA replaces explicit loads, and the K-loop becomes a software pipeline (Tutorial 10). The compute layer (`TiledMMA` + `gemm()`) stays the same. ## 2. The Mental Model (The Visual) ### Running the Factory for a Full Production Day Tutorials 03–06 built the factory: - **Loading dock** (TiledCopy) ; moves raw materials from the truck (global memory) into the warehouse (shared memory), using dollies (vectorized loads) and staggered shelving (swizzle). - **Stamping press** (TiledMMA) ; processes materials from the warehouse shelves into finished products (C accumulator). Now we run the factory for a full **production day**: ```text Global Memory (The Truck) ┌────────────────────────────────────────────────┐ │ A (M × K) B (N × K) │ └──────────┬─────────────────────┬───────────────┘ │ │ ═══════════╪═════════════════════╪══════ K-loop (the shift) ════ ║ ▼ ▼ ║ ║ ┌─────────────┐ ┌─────────────┐ ║ ║ │ A tile │ │ B tile │ ← TiledCopy ║ ║ │ (128 × 32) │ │ (128 × 32) │ gmem → smem ║ ║ └──────┬──────┘ └──────┬──────┘ ║ ║ │ Shared Memory │ ║ ║ ▼ (swizzled) ▼ ║ ║ ┌─────────────────────────────────┐ ║ ║ │ STAMPING PRESS │ ← TiledMMA ║ ║ │ partition → fragment → gemm │ smem → regs → MMA ║ ║ └────────────────┬────────────────┘ ║ ║ │ ║ ║ accum += partial C ║ ║ ║ ═══════════════════════════════════════ repeat for each K tile ══ │ ▼ Epilogue (end of shift ; ship the product) ┌──────────┐ │ C tile │ ← copy accum → gmem │ (128×128) │ └──────────┘ ``` Each thread block is an independent factory. The grid of blocks tiles the M×N output: ```text Grid of CTAs over the output C (M × N): N → ┌─────────┬─────────┬─────────┬───┐ │ CTA │ CTA │ CTA │ │ M │ (0,0) │ (0,1) │ (0,2) │...│ Each CTA computes ↓ │ 128×128 │ 128×128 │ 128×128 │ │ one 128×128 tile of C ├─────────┼─────────┼─────────┼───┤ │ CTA │ CTA │ CTA │ │ │ (1,0) │ (1,1) │ (1,2) │...│ │ 128×128 │ 128×128 │ 128×128 │ │ ├─────────┼─────────┼─────────┼───┤ │ ... │ ... │ ... │ │ └─────────┴─────────┴─────────┴───┘ grid = (M / 128, N / 128) ``` ### The Three Phases ```text Phase 1: COPY (the loading dock) gmem A[BLK_M × BLK_K] ──TiledCopy──▶ smem_A (swizzled) gmem B[BLK_N × BLK_K] ──TiledCopy──▶ smem_B (swizzled) __syncthreads() Phase 2: COMPUTE (the stamping press) smem_A ──partition_A──▶ registers ──┐ ├──gemm()──▶ accum += partial smem_B ──partition_B──▶ registers ──┘ __syncthreads() Phase 3: EPILOGUE (shipping) accum ──partition_C──▶ gmem C[BLK_M × BLK_N] ``` Phases 1 and 2 repeat `ceil(K / BLK_K)` times. Phase 3 happens once. ## 3. The Solution (The Code) A complete GEMM kernel: `C (M×N, float) += A (M×K, half) × B^T (N×K, half)`. Uses 128 threads (4 warps), 128×128 CTA tiles, and SM80 Tensor Cores. **Constraints:** M and N must be multiples of 128; K must be a multiple of 32. (Removing these constraints requires boundary masking ; important for production, orthogonal to the core GEMM structure.) ```cpp #include #include #include #include #include #include #include #include #include #include using namespace cute; // Tile sizes constexpr int BLK_M = 128; constexpr int BLK_N = 128; constexpr int BLK_K = 32; constexpr int NUM_THREADS = 128; // 4 warps __global__ void gemm_kernel( half const* __restrict__ A_ptr, half const* __restrict__ B_ptr, float* __restrict__ C_ptr, int M, int N, int K) { // SETUP: Layouts, TiledCopy, TiledMMA // Global memory tensors // A: (M, K) with K stride-1 B: (N, K) with K stride-1 // C: (M, N) with N stride-1 auto mA = make_tensor(make_gmem_ptr(A_ptr), make_shape(M, K), make_stride(K, Int<1>{})); auto mB = make_tensor(make_gmem_ptr(B_ptr), make_shape(N, K), make_stride(K, Int<1>{})); auto mC = make_tensor(make_gmem_ptr(C_ptr), make_shape(M, N), make_stride(N, Int<1>{})); // CTA tiling: each block claims its tile of A, B, C auto cta_coord = make_coord(blockIdx.x, blockIdx.y); // gA: (BLK_M, BLK_K, k_tiles) ; this block's strip of A, sliced along K auto gA = local_tile(mA, make_shape(Int{}, Int{}), make_coord(get<0>(cta_coord), _)); // gB: (BLK_N, BLK_K, k_tiles) auto gB = local_tile(mB, make_shape(Int{}, Int{}), make_coord(get<1>(cta_coord), _)); // gC: (BLK_M, BLK_N) ; this block's output tile auto gC = local_tile(mC, make_shape(Int{}, Int{}), cta_coord); // Shared memory layouts: K stride-1, swizzled // Swizzle<3,3,3>: M=3 → 8 halfs (128 bits) stay contiguous for vectorized loads auto sA_layout = composition(Swizzle<3, 3, 3>{}, make_layout(make_shape(Int{}, Int{}), make_stride(Int{}, Int<1>{}))); auto sB_layout = composition(Swizzle<3, 3, 3>{}, make_layout(make_shape(Int{}, Int{}), make_stride(Int{}, Int<1>{}))); // Shared memory allocation (static) __shared__ half smem_A[cosize_v]; __shared__ half smem_B[cosize_v]; auto sA = make_tensor(make_smem_ptr(smem_A), sA_layout); auto sB = make_tensor(make_smem_ptr(smem_B), sB_layout); // TiledCopy: gmem → smem via cp.async // Each atom loads 128 bits = 8 halfs along the K (stride-1) direction. // 128 threads, one per M-row. Each thread loads 32 K-values (4 × 128-bit async copies). auto tiled_copy = make_tiled_copy( Copy_Atom, half>{}, Layout>{}, // 128 threads all in M Layout>{} // 32 K-values per thread (4 atoms of 8 halfs) ); // Coverage: (128 × 1, 1 × 32) = (128, 32) ✓ // TiledMMA: the Tensor Core plan // SM80_16x8x16 atom, 4 warps arranged (2,2) in M×N. // Base coverage: (2×16, 2×8) = (32, 16) per atom group. // For 128×128: gemm() iterates 4× in M, 8× in N, 2× in K automatically. auto tiled_mma = make_tiled_mma( MMA_Atom{}, Layout>{} // 4 warps: 2 in M, 2 in N ); // PARTITIONS: thread-level views // Copy partitions auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x); auto tAgA = thr_copy.partition_S(gA); // (CPY_VEC, CPY_M, CPY_K, k_tiles) auto tAsA = thr_copy.partition_D(sA); // (CPY_VEC, CPY_M, CPY_K) auto tBgB = thr_copy.partition_S(gB); // same structure for B auto tBsB = thr_copy.partition_D(sB); // MMA partitions auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x); auto tCsA = thr_mma.partition_A(sA); // (MMA_VAL, MMA_M, MMA_K) auto tCsB = thr_mma.partition_B(sB); // (MMA_VAL, MMA_N, MMA_K) auto tCgC = thr_mma.partition_C(gC); // (MMA_VAL, MMA_M, MMA_N) // Register fragments auto tCrA = make_fragment_like(tCsA); auto tCrB = make_fragment_like(tCsB); auto accum = make_fragment_like(tCgC); // C accumulator (float) clear(accum); // MAINLOOP: iterate over K tiles int k_tiles = size<2>(gA); // = K / BLK_K for (int k = 0; k < k_tiles; ++k) { // Phase 1: COPY gmem → smem copy(tiled_copy, tAgA(_, _, _, k), tAsA); copy(tiled_copy, tBgB(_, _, _, k), tBsB); cp_async_fence(); cp_async_wait<0>(); __syncthreads(); // Phase 2: COMPUTE smem → registers → MMA copy(tCsA, tCrA); copy(tCsB, tCrB); gemm(tiled_mma, tCrA, tCrB, accum); __syncthreads(); } // EPILOGUE: write C accumulator to global memory copy(accum, tCgC); } // Host: verification + timing int main() { constexpr int M = 512, N = 512, K = 256; // Allocate and initialize int sizeA = M * K, sizeB = N * K, sizeC = M * N; half *h_A = new half[sizeA]; half *h_B = new half[sizeB]; float *h_C = new float[sizeC]; // A = all 1.0h, B = all 1.0h → C[m][n] = K for (int i = 0; i < sizeA; ++i) h_A[i] = __float2half(1.0f); for (int i = 0; i < sizeB; ++i) h_B[i] = __float2half(1.0f); half *d_A, *d_B; float *d_C; cudaMalloc(&d_A, sizeA * sizeof(half)); cudaMalloc(&d_B, sizeB * sizeof(half)); cudaMalloc(&d_C, sizeC * sizeof(float)); cudaMemcpy(d_A, h_A, sizeA * sizeof(half), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, sizeB * sizeof(half), cudaMemcpyHostToDevice); // Launch dim3 grid(M / BLK_M, N / BLK_N); // (4, 4) for 512×512 dim3 block(NUM_THREADS); // 128 gemm_kernel<<>>(d_A, d_B, d_C, M, N, K); cudaDeviceSynchronize(); // Verify cudaMemcpy(h_C, d_C, sizeC * sizeof(float), cudaMemcpyDeviceToHost); float max_err = 0.0f; float expected = float(K); // K * 1.0 * 1.0 for (int i = 0; i < sizeC; ++i) max_err = fmaxf(max_err, fabsf(h_C[i] - expected)); printf("GEMM: C(%d x %d) = A(%d x %d) * B^T(%d x %d)\n", M, N, M, K, N, K); printf("Tile: %d x %d x %d, Threads: %d\n\n", BLK_M, BLK_N, BLK_K, NUM_THREADS); // Print a 4×4 corner printf("C[0:4][0:4] (expected %.0f everywhere):\n", expected); for (int m = 0; m < 4; ++m) { printf(" "); for (int n = 0; n < 4; ++n) printf("%8.1f", h_C[m * N + n]); printf("\n"); } printf("\nMax absolute error: %e\n", max_err); printf("Result: %s\n", max_err < 1e-3f ? "PASS" : "FAIL"); // Timing cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); int warmup = 5, iters = 20; for (int i = 0; i < warmup; ++i) gemm_kernel<<>>(d_A, d_B, d_C, M, N, K); cudaDeviceSynchronize(); cudaEventRecord(start); for (int i = 0; i < iters; ++i) gemm_kernel<<>>(d_A, d_B, d_C, M, N, K); cudaEventRecord(stop); cudaEventSynchronize(stop); float ms; cudaEventElapsedTime(&ms, start, stop); float avg_ms = ms / iters; double flops = 2.0 * M * N * K; double tflops = (flops / (avg_ms / 1000.0)) / 1e12; printf("\nPerformance: %.3f ms (%.2f TFLOPS)\n", avg_ms, tflops); // Cleanup cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cudaEventDestroy(start); cudaEventDestroy(stop); delete[] h_A; delete[] h_B; delete[] h_C; return 0; } ``` **Expected Output:** ```text GEMM: C(512 x 512) = A(512 x 256) * B^T(512 x 256) Tile: 128 x 128 x 32, Threads: 128 C[0:4][0:4] (expected 256 everywhere): 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 256.0 Max absolute error: 0.000000e+00 Result: PASS Performance: X.XXX ms (Y.YY TFLOPS) ``` Each entry of C is 256.0: the dot product of a length-256 all-ones row from A with a length-256 all-ones row from B. The K-loop ran `256 / 32 = 8` iterations, each accumulating a BLK_K=32 partial product. After 8 iterations, the accumulator held the full result. ## 4. Step-by-Step Explanation ### Setup Phase **Lines: Global tensor creation with `make_tensor`** ```cpp auto mA = make_tensor(make_gmem_ptr(A_ptr), make_shape(M, K), make_stride(K, Int<1>{})); ``` Creates a CuTe tensor wrapping global memory. The stride `(K, 1)` means K is stride-1 ; the TN layout required by our MMA atom (Tutorial 06). Note that `K` is a runtime value while `Int<1>{}` is static. CuTe handles mixed static/dynamic strides. **Lines: CTA tiling with `local_tile`** ```cpp auto gA = local_tile(mA, make_shape(Int{}, Int{}), make_coord(get<0>(cta_coord), _)); ``` This is Tutorial 02's `local_tile` at the CTA level ; the casino floor manager assigning tables: - Divides A's M dimension into chunks of BLK_M (128). Block `blockIdx.x` claims its chunk. - Divides A's K dimension into chunks of BLK_K (32). The underscore `_` means "give me all K-tiles as a third mode." - Result shape: `(128, 32, k_tiles)` ; this block's full strip of A, with `k_tiles = K / BLK_K` slices along K. For C, both coordinates are fixed (no underscore): ```cpp auto gC = local_tile(mC, make_shape(Int{}, Int{}), cta_coord); ``` Result shape: `(128, 128)` ; this block's one output tile. **Lines: Shared memory with swizzle** ```cpp auto sA_layout = composition(Swizzle<3, 3, 3>{}, make_layout(make_shape(Int<128>{}, Int<32>{}), make_stride(Int<32>{}, Int<1>{}))); ``` The base layout is `(128, 32):(32, 1)` ; K stride-1, same as gmem. `composition(Swizzle<3,3,3>{}, ...)` applies the XOR remapping from Tutorial 05. The parameters: M=3 preserves 8-half (128-bit) contiguity for vectorized stores, B=3 scrambles 8-bank groups, S=3 separates the bit fields. This is the standard swizzle for half-precision data. **Lines: TiledCopy construction** ```cpp auto tiled_copy = make_tiled_copy( Copy_Atom, half>{}, Layout>{}, Layout>{} ); ``` This is Tutorial 04's supervisor's clipboard, now sized for the GEMM tile: - **Copy_Atom:** `SM80_CP_ASYNC_CACHEALWAYS` ; hardware async copy, 128 bits = 8 halfs per transaction along K (stride-1). Data goes directly from gmem to smem, bypassing registers. - **thr_layout `(128, 1)`:** All 128 threads assigned to the M dimension ; one thread per row. - **val_layout `(1, 32)`:** Each thread copies 32 K-values (4 atoms × 8 halfs each). - **Coverage:** `(128 × 1, 1 × 32) = (128, 32)` ; exactly one smem tile. Each K-loop iteration fills the tile with one `copy()` call per matrix. **Lines: TiledMMA construction** ```cpp auto tiled_mma = make_tiled_mma( MMA_Atom{}, Layout>{} ); ``` Tutorial 06's stamping press, now with multiple warps. The second argument arranges 4 warps in a 2×2 grid over M and N: - Base atom covers 16M × 8N × 16K. - 2×2 warps → combined base: 32M × 16N × 16K. - For the 128×128×32 smem tile, `gemm()` internally loops 4× in M (128/32), 8× in N (128/16), and 2× in K (32/16). You don't write these loops ; `gemm()` handles them. ### Partition Phase **Lines: Copy partitions** ```cpp auto tAgA = thr_copy.partition_S(gA); // (CPY_VEC, CPY_M, CPY_K, k_tiles) auto tAsA = thr_copy.partition_D(sA); // (CPY_VEC, CPY_M, CPY_K) ``` Same as Tutorial 04: `partition_S` creates this thread's view of the source (gmem), `partition_D` for the destination (smem). The extra `k_tiles` mode on `tAgA` comes from gA's third mode ; it passes through untouched. In the mainloop, `tAgA(_, _, _, k)` selects the k-th K-tile for copying. **Lines: MMA partitions and fragments** ```cpp auto tCsA = thr_mma.partition_A(sA); // (MMA_VAL, MMA_M, MMA_K) auto tCrA = make_fragment_like(tCsA); // register copy auto accum = make_fragment_like(tCgC); // C accumulator ``` Same as Tutorial 06: `partition_A` maps smem elements to this thread's MMA fragment slots. `make_fragment_like` creates a register tensor with the same shape. The accumulator `accum` lives in registers for the entire kernel ; it's never written to smem. ### Mainloop ```cpp for (int k = 0; k < k_tiles; ++k) { copy(tiled_copy, tAgA(_, _, _, k), tAsA); // gmem → smem copy(tiled_copy, tBgB(_, _, _, k), tBsB); cp_async_fence(); cp_async_wait<0>(); __syncthreads(); copy(tCsA, tCrA); // smem → registers copy(tCsB, tCrB); gemm(tiled_mma, tCrA, tCrB, accum); // MMA! __syncthreads(); } ``` Each iteration processes one BLK_K=32 slice of the K dimension: 1. **Copy** ; TiledCopy loads A's k-th tile `(128×32)` and B's k-th tile `(128×32)` from global to shared memory. The swizzle is transparent ; the TiledCopy writes through swizzled addresses automatically. 2. **Sync** ; All 128 threads must finish writing to smem before any thread starts reading for the MMA. 3. **Compute** ; Copy from swizzled smem to register fragments, then `gemm()` executes the MMA atoms. Because the tile is larger than one atom, `gemm()` internally loops over M, N, and K repetitions. The result **accumulates** into `accum` ; each K-iteration adds to the running sum, computing the partial dot products. 4. **Sync** ; All threads must finish reading smem before the next iteration overwrites it. After `k_tiles` iterations, `accum` holds the complete C tile. ### Epilogue ```cpp copy(accum, tCgC); ``` Each thread writes its accumulator fragment (128 floats for this config) to global memory via the MMA's C partition. As noted in Tutorial 06, the C fragments are scattered ; each thread writes to non-contiguous addresses. For a production kernel, CUTLASS routes the epilogue through shared memory for coalesced writes. For correctness, the simple scattered write works fine. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy: The Full Production Day.** - **CTA grid** = multiple independent factories, one per output tile. They never interact. - **K-loop (mainloop)** = one production shift. Each iteration loads a batch of raw materials (A/B K-tiles via TiledCopy), processes them on the stamping press (TiledMMA + gemm), and adds the result to the running accumulator. - **Epilogue** = end of shift. Ship the finished product (accumulated C tile) to global memory. The three-level tiling is the single most important structural insight for GPU GEMM. Everything else ; pipelining, double buffering, warpgroup MMA ; is optimization on top of this skeleton. **Tile size cheat sheet:** | Parameter | Typical Values | Trade-off | | :--- | :--- | :--- | | BLK_M × BLK_N | 128×128, 128×256, 256×128 | Larger tiles → more compute per gmem load (better arithmetic intensity). But more smem usage → fewer CTAs per SM → less occupancy. | | BLK_K | 32, 64 | Larger K → fewer mainloop iterations → less loop overhead. But more smem per tile (BLK_M × BLK_K × sizeof(half) per matrix). | | Threads | 128 (4 warps), 256 (8 warps) | More threads → more copy bandwidth, but more MMA rep sharing overhead. | **Why the swizzle parameters are `<3, 3, 3>` for half:** - M=3: 2^3 = 8 halfs = 128 bits stay contiguous. Matches the 128-bit vectorized copy. - B=3: 2^3 = 8 banks scrambled. Breaks the column-stride conflict pattern. - S=3: Source bits `[6:9)`, target bits `[3:6)`. Non-overlapping. Compare with Tutorial 05's `Swizzle<3, 2, 3>` for float: M=2 (4 floats = 128 bits). The only difference is the free-bit count M, which must match the element size for 128-bit vectorization. **Why two `__syncthreads()` in the mainloop:** ```text Iteration k: Iteration k+1: copy gmem → smem copy gmem → smem ← OVERWRITES smem! ──sync── (1) ──sync── gemm reads smem gemm reads smem ──sync── (2) ──sync── ``` Sync (1): Ensure all threads finished writing smem before any thread reads it for gemm. Sync (2): Ensure all threads finished reading smem before the next iteration's copy overwrites it. Without sync (2), a fast thread could start writing the (k+1)-th A tile into smem while a slow thread is still reading the k-th A tile for its gemm. **The epilogue bottleneck:** Our epilogue uses `copy(accum, tCgC)` ; each thread writes its C fragments directly to global memory with scattered stores. On A100, this wastes memory bus bandwidth because each 4-byte float write fetches a full 128-byte cache line. CUTLASS's production epilogues: 1. Each warp writes its C fragment to smem (in a contiguous layout). 2. Threads read back from smem in a coalesced pattern and write to gmem. 3. Optional: fuse element-wise operations (bias add, ReLU) into the epilogue. For this tutorial, the simple epilogue is correct and clear. Optimizing it is a straightforward extension. **Performance expectations:** This kernel will not match cuBLAS. Missing optimizations: - No double-buffering (the copy and compute phases are serialized) - No software pipelining (Tutorial 10) - No TMA (Tutorial 08) - Scattered C epilogue - No predication for boundary tiles On A100 with 512×512×256, expect 5–15 TFLOPS (vs. ~150 TFLOPS peak for cuBLAS with large matrices). The point is not speed ; it's understanding the structure that every fast GEMM kernel shares. > **Gotcha ; dynamic vs. static shapes:** The global tensors use dynamic shapes (runtime M, N, K), but the smem layouts and tile shapes use static `Int{}`. This is intentional ; CuTe needs static shapes to generate efficient code (compile-time layout math, unrolled loops). Dynamic shapes in gmem are fine because the copy and MMA operate on the statically-shaped smem/register tiles. > **Gotcha ; smem capacity:** Two smem tiles of 128×32 halfs = 2 × 128 × 32 × 2 bytes = 16 KB. Well within SM80's 164 KB limit. But 128×128 tiles with BLK_K=64 would need 32 KB ; still fine, but leaves less for double buffering. Always check: `smem_bytes = 2 × BLK_M × BLK_K × sizeof(half)`. **What comes next:** This kernel serializes copy and compute ; the Tensor Cores are idle while waiting for data loads. Tutorial 08 (The TMA Revolution) introduces hardware-accelerated copies that free the threads. Tutorial 10 (Pipeline Overlap) shows how to overlap copy and compute using ping-pong buffering, approaching peak throughput. --- # The TMA Revolution (Async Copy) - URL: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/ - Markdown: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/index.md - Text: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/index.txt - Published: 2026-03-23 - Topic: CUDA - Description: With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware. # The TMA Revolution (Async Copy) **Difficulty:** Advanced **Prerequisites:** [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/) ## 1. The Problem (The "Why") "The CPU is wasting time calculating addresses for copies. Let the hardware do it." Until now, we have used `TiledCopy` to coordinate warps and threads to fetch data from global memory into shared memory. The problem? **Every single thread is computing memory addresses.** For each element loaded, threads execute instructions just to resolve `address = coord · stride`. This burns registers and arithmetic logic unit (ALU) cycles that should be spent on matrix multiplication. With the Hopper and Blackwell architectures, NVIDIA introduced the **Tensor Memory Accelerator (TMA)**. Instead of having 128 threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware. ## 2. The Mental Model (The Visual) If `TiledCopy` is a warehouse crew of 128 workers carrying boxes (where each worker calculates their path), `TMA` is an **autonomous forklift with a manifest**. ```text TiledCopy (threads do address math): 128 workers compute addresses: address = coord · stride Then load/store each element. [threads] ──address math──▶ LDG/LDS ──▶ smem TMA (hardware does address math): 1 dispatcher submits a manifest (descriptor). The TMA engine moves the entire pallet. [one thread] ──submit manifest──▶ TMA engine ──▶ smem ``` You create the manifest on the host, hand it to the forklift inside the kernel, and the hardware handles the rest asynchronously. ## 3. The Solution (The Code) Here is a minimal, complete C++ example of copying a 128×64 tile using TMA. Notice how no threads are computing layouts inside the kernel loop! ```Cpp #include #include #include #include #include using namespace cute; constexpr int BLK_M = 128; constexpr int BLK_K = 64; // Shared memory workspace struct SharedStorage { alignas(128) cutlass::half_t smem[BLK_M * BLK_K]; uint64_t tma_barrier; }; template __global__ void tma_copy_kernel(TmaDesc tma) { __shared__ SharedStorage ss; // 1. Describe the shared memory target auto sA_layout = make_layout(make_shape(Int{}, Int{}), make_stride(Int{}, Int<1>{})); auto sA = make_tensor(make_smem_ptr(ss.smem), sA_layout); // 2. Fetch the global memory tensor from the TMA descriptor auto gA = tma.get_tma_tensor(make_shape(Int{}, Int{})); // 3. Partition the TMA copy auto [tAgA, tAsA] = tma_partition( tma, Int<0>{}, // CTA block coordinate Layout<_1>{}, // Grid layout group_modes<0,2>(sA), // Collapse modes for flat copy group_modes<0,2>(gA) ); // 4. Dispatch the autonomous forklift! if (threadIdx.x == 0) { using Bar = cutlass::arch::ClusterTransactionBarrier; // Initialize the mbarrier Bar::init(&ss.tma_barrier, 1); int bytes_to_copy = BLK_M * BLK_K * sizeof(cutlass::half_t); // Announce the expected transaction size Bar::arrive_and_expect_tx(&ss.tma_barrier, bytes_to_copy); // Issue the async copy using the manifest copy(tma.with(ss.tma_barrier), tAgA, tAsA); // Block until the forklift delivers the pallet Bar::wait(&ss.tma_barrier, 0); } __syncthreads(); // Now sA is filled and ready for computation! } void host_launch_example() { // We must build the TMA descriptor on the HOST. cutlass::half_t* d_A; cudaMalloc(&d_A, BLK_M * BLK_K * sizeof(cutlass::half_t)); auto mA_layout = make_layout(make_shape(Int{}, Int{}), make_stride(Int{}, Int<1>{})); auto mA = make_tensor(make_gmem_ptr(d_A), mA_layout); auto sA_layout = make_layout(make_shape(Int{}, Int{}), make_stride(Int{}, Int<1>{})); // Printing the manifest for the autonomous forklift auto tma = make_tma_atom(SM90_TMA_LOAD{}, mA, sA_layout, make_shape(Int{}, Int{})); // Launch the kernel tma_copy_kernel<<<1, 128>>>(tma); cudaDeviceSynchronize(); cudaFree(d_A); } int main() { host_launch_example(); printf("TMA Copy Successful!\n"); return 0; } ``` ## 4. Step-by-Step Explanation Line `make_tma_atom(SM90_TMA_LOAD{}, mA, sA_layout, ...)`: This runs entirely on the host. It bundles the `Layout` math into a hardware-readable manifest. We are telling the GPU how global memory and shared memory relate, *before* the kernel starts. Line `auto [tAgA, tAsA] = tma_partition(...)`: We slice up the tensors for the copy. Unlike `TiledCopy` which assigns slices per-thread, `tma_partition` treats the CTA as one large worker. Line `if (threadIdx.x == 0)`: Only **one dispatcher thread** is needed to initiate the TMA load. Line `copy(tma.with(&ss.tma_barrier), tAgA, tAsA)`: The actual dispatch command. The hardware asynchronously begins fetching the entire tile into shared memory. Line `Bar::wait(...)`: Because the forklift works asynchronously, we must wait at the loading dock (the `mbarrier`) before we can safely read the data. ## 5. Engineer's Notebook (Latent Space Notes) **Analogy:** `Tensor Memory Accelerator (TMA)` is an **autonomous forklift with a manifest**. You don’t have 128 workers carrying boxes anymore. One dispatcher hands the forklift a manifest (the TMA descriptor), and the hardware moves the entire pallet into shared memory. `make_tma_atom` is printing the manifest for the autonomous forklift (the TMA descriptor). **Hardware Constraints & Gotchas:** > **Gotcha ; TMA descriptors are host-built.** `make_tma_atom` must run on the CPU. The descriptor encodes the address math for the tile and cannot be created inside the kernel. > **Gotcha ; TMA is async and barrier-driven.** One thread launches the copy; an mbarrier (`ClusterTransactionBarrier`) is required before any thread reads from smem. Without the `wait()`, you will read garbage data. > **Gotcha ; TMA prefers static layouts.** Tile shape and strides should be `Int{}` so the descriptor is fully static and the copy is vector-friendly. CuTe uses these static layouts at compile-time to guarantee contiguity and optimize the transfer. > **B200/Hopper Note:** TMA is the primary way to saturate memory bandwidth on SM90+ architectures. Understanding strides (`address = coord · stride`) is the only way to program the TMA descriptor correctly! --- # WGMMA ; Warpgroup MMA - URL: https://www.dcbaslani.xyz/blog/09_wgmma/ - Markdown: https://www.dcbaslani.xyz/blog/09_wgmma/index.md - Text: https://www.dcbaslani.xyz/blog/09_wgmma/index.txt - Published: 2026-04-04 - Topic: CUDA - Description: How to use Warpgroup MMA (WGMMA) to feed NVIDIA Tensor Cores directly from shared memory, bypassing the register file bottleneck. # 09. WGMMA (Warpgroup MMA) **Difficulty:** Advanced **Prerequisites:** Tutorial 06: Hello, MMA, Tutorial 08: The TMA Revolution ## 1. The Problem (The "Why") Single warps are too small. We need 128 threads driving the Tensor Core together. In previous generations, threads had to load memory from shared memory into their own registers before they could perform an MMA (Matrix Multiply Accumulate). This creates a massive bottleneck: the register file gets saturated, and the `ldmatrix` instructions waste precious clock cycles. We need a way to feed the Tensor Core directly from the shared memory staging area where the TMA forklift just dropped our data. ## 2. The Mental Model (The Visual) WGMMA is the **Heavy Industrial Press**. Unlike a standard press where each worker loads their own tray, the WGMMA press reads directly from the forklift's delivery zone. ```text Single-Warp MMA (SM80/SM89): smem ──(ldmatrix)──▶ registers ──(mma.sync)──▶ accumulators WGMMA (SM90): smem ──(wgmma descriptor)────────────────────▶ accumulators (128 threads coordinate) ``` And to make things even more efficient, we have **TMA Multicast** ; the forklift driver shouting "who else needs this?" and dropping one pallet into multiple CTA factories at once: ```text Global Memory │ (TMA Forklift reads once) ▼ Cluster Broadcast Network ┌──┴──┐ ▼ ▼ smem smem (Delivered to multiple CTA factories simultaneously) CTA0 CTA1 ``` ## 3. The Solution (The Code) Here is how you define a WGMMA instruction using CuTe: ```cpp #include #include #include #include using namespace cute; void example() { // 1. Define the Heavy Industrial Press (WGMMA) // 64x64x16 atom. F32 accumulators, F16 inputs. // SS = both A and B are read directly from Shared Memory. // GMMA::Major::K means the K dimension is contiguous (TN layout). using WGMMA_Atom = SM90_64x64x16_F32F16F16_SS; // 2. Build the factory floor plan for the 4-warp crew auto tiled_mma = make_tiled_mma(WGMMA_Atom{}); // 3. Assume smem_A and smem_B are descriptors to our shared memory // (In reality, you would build these using make_smdest_ptr) // auto frag_A = ... // auto frag_B = ... // auto accum = ... // 4. Press the button! (Requires exactly 128 threads) // gemm(tiled_mma, frag_A, frag_B, accum); } ``` ## 4. Step-by-Step Explanation Line 14: `SM90_64x64x16_F32F16F16_SS<...>` defines our specific `WGMMA` operation. It calculates a 64x64x16 matrix multiply. Crucially, the `_SS` suffix means both A and B operands will be sourced directly from Shared Memory descriptors, not from registers. The template parameters `` indicate that both A and B have K as their fast-moving contiguous dimension. Line 17: `make_tiled_mma` binds the instruction to a `Warpgroup`. CuTe knows that this SM90 atom requires exactly 128 threads (4 warps) and configures the `TiledMMA` layout automatically to coordinate them. Line 26: `gemm(...)` dispatches the asynchronous `wgmma` instruction to the hardware. All 128 threads must participate in this call. ## 5. Engineer's Notebook (Latent Space Notes) Analogy: Think of the `Warpgroup` as a synchronized crew of 4 warps working together as a single unit, and `SM90_WGMMA` as the Heavy Industrial Press. It takes inputs directly from the forklift's delivery zone (`smem` desc) rather than personal trays. `tma_multicast` is the forklift driver shouting "who else needs this?", delivering one pallet to multiple factories simultaneously. Hardware Note: On Hopper (SM90), WGMMA requires exactly 128 threads. You cannot launch a warpgroup MMA with a single warp; it is a collaborative operation across 4 warps. TMA Multicast requires Threadblock Clusters ; you can only multicast a TMA load to CTAs that are part of the same hardware cluster. --- # Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry - URL: https://www.dcbaslani.xyz/blog/cute-dsl-blog/ - Markdown: https://www.dcbaslani.xyz/blog/cute-dsl-blog/index.md - Text: https://www.dcbaslani.xyz/blog/cute-dsl-blog/index.txt - Published: 2026-04-19 - Topic: CUDA - Description: Welcome to the ultimate guide to cute-dsl! Bringing the power of CuTe's concepts like Layouts, Tilers, and vectorized memory operations into a familiar, Pythonic interface. # Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry Welcome to the ultimate guide to `cute-dsl`! If you've tinkered with CUDA programming and are curious about CUTLASS's CuTe but felt overwhelmed by the complex C++ templates, `cute-dsl` is your new best friend. It brings the power of CuTe's concepts; like Layouts, Tilers, and vectorized memory operations; straight into a familiar, Pythonic interface. In this blog, we'll dive into the basics, starting with simple APIs, building intuition around memory layouts with ASCII diagrams, and graduating to vectorized kernels. ## The Essentials: `cute.kernel` and `cute.jit` At the core of `cute-dsl` are two crucial decorators: - `@cute.kernel`: This marks a Python function as a device kernel, meaning it will run natively on the GPU. It behaves similarly to PyTorch's Triton `@triton.jit` or Numba's `@cuda.jit`. - `@cute.jit`: This marks a host function that will compile and launch your device kernels. Think of it as the bridge between your standard Python code and your GPU parallel executions. ### A Gentle Start: Hello, GPU Let's look at the absolute simplest program you can write: ```python import cutlass import cutlass.cute as cute @cute.kernel def hello_kernel(): # Fetch thread coordinates just like CUDA! tidx, _, _ = cute.arch.thread_idx() if tidx == 0: cute.printf("hello from gpu") @cute.jit def hello_world(): cutlass.cuda.initialize_cuda_context() hello_kernel().launch(grid=(1, 1, 1), block=(32, 1, 1)) if __name__ == "__main__": compiled = cute.compile(hello_world) compiled() ``` When you define `hello_kernel`, it gets compiled into PTX code behind the scenes. Inside it, we use `cute.arch.thread_idx()` ; which returns `(threadIdx.x, threadIdx.y, threadIdx.z)`, just like in normal CUDA. We then launch it using standard `(grid, block)` dimensions inside a `@cute.jit`-compiled host function. Beautifully simple! ## Demystifying Layouts A **Layout** in `cute` is the fundamental building block of memory access. It dictates how a logical multi-dimensional coordinate maps to a 1D flat memory index. A Layout is parameterized by two things: **Shape** and **Stride**. Let's imagine an 8x8 matrix in **Column-Major** layout: `shape=(8, 8)` and `stride=(1, 8)`. The formula to convert a multi-dimensional coordinate to a flat memory index is: `index = row * stride[0] + col * stride[1]`. With stride `(1, 8)`, moving one step down a column (incrementing row) moves by 1 in memory, while moving one step across a row (incrementing column) jumps by 8. ```text Logical 8x8 View (showing flat index) Flat Memory (Column-Major) +----+----+----+----- ----------- | 0 | 8 | 16 | ... [0] -> (row=0, col=0) +----+----+----+----- [1] -> (row=1, col=0) | 1 | 9 | 17 | ... [2] -> (row=2, col=0) +----+----+----+----- ... | 2 | 10 | 18 | ... [8] -> (row=0, col=1) +----+----+----+----- [9] -> (row=1, col=1) ``` In **Row-Major** layout, the stride would be `(8, 1)` ; meaning moving to the *next row* jumps by 8 in flat memory, while moving to the *next column* steps by just 1. So elements within the same row are contiguous in memory, which is the opposite of column-major. Layouts in `cute` can also be **hierarchically nested**. For example, a shape of `(2, (2, 4))` with stride `(1, (2, 4))` describes a 2-element outer dimension where each element contains a nested 2×4 sub-layout. This nesting lets you mathematically model complex memory access patterns like blocked or swizzled layouts ; essential for high-performance GPU kernels. ## Logical Divide vs Zipped Divide When we want to transfer data between different memory hierarchies (like Global Memory -> Shared Memory -> Registers), we often do so in small parallelized blocks called **Tiles**. Let's take our 8x8 matrix layout `L = (8, 8)` and divide it using a **Tiler** of size `(2, 4)`. A tiler dictates the shape of the blocks we want to chop our layout into. ```python L = cute.make_layout(shape=(8, 8), stride=(1, 8)) tiler = (2, 4) L_logical = cute.logical_divide(L, tiler) L_zipped = cute.zipped_divide(L, tiler) ``` ### Logical Divide A **Logical Divide** splits each mode independently into `(tile_size, num_tiles)`: - Rows (size 8) ÷ Tiler Row (2) → `(2, 4)` meaning "2 rows per tile, 4 tile-rows" - Cols (size 8) ÷ Tiler Col (4) → `(4, 2)` meaning "4 cols per tile, 2 tile-cols" The combined output shape becomes: `((2, 4), (4, 2))`. ```text Logical Divide: ((2, 4), (4, 2)) Mode 0 (Rows): (row_within_tile, which_tile_row) = (2, 4) Mode 1 (Cols): (col_within_tile, which_tile_col) = (4, 2) Indexing requires specifying all 4 sub-dimensions: Data[ (row_in_tile, tile_row_idx), (col_in_tile, tile_col_idx) ] The row and column modes remain separate ; they are NOT grouped by "tile" vs "grid". This makes it awkward to grab a full 2x4 tile. ``` ### Zipped Divide A **Zipped Divide** first performs a logical divide, then *re-groups* the results: it bundles all the "within-tile" dimensions into Mode 0 and all the "which-tile" dimensions into Mode 1. - **Mode 0 (Tile)**: `(2, 4)` ; the shape of one tile (rows-in-tile, cols-in-tile) - **Mode 1 (Grid)**: `(4, 2)` ; how many tiles (tile-rows, tile-cols) The numbers look identical to logical divide ; `((2, 4), (4, 2))` ; but the *meaning* of each mode has changed completely! ```text Zipped Divide: Original (8, 8) ÷ Tiler (2, 4) Mode 0 = One Tile (2, 4): Mode 1 = Grid of Tiles (4, 2): col 0 col 1 col 2 col 3 tile_row tile_col +------+------+------+------+ +-------+-------+ | | | | | row0 | (0,0) | (0,1) | +------+------+------+------+ +-------+-------+ | | | | | row1 | (1,0) | (1,1) | +------+------+------+------+ +-------+-------+ A single 2x4 tile | (2,0) | (2,1) | +-------+-------+ | (3,0) | (3,1) | +-------+-------+ 4×2 grid = 8 tiles total ``` ### Why is Zipped Divide Preferred? Zipped divide gives you a clean two-level hierarchy: `(Tile_Coordinate, Grid_Coordinate)`. When doing vectorized loads or memory copies, you typically want to grab a *whole tile* at a specific *grid position*. Compare the two approaches: - **Logical Divide**: To grab a tile you must index 4 sub-dimensions separately: `Data[(row_in_tile, tile_row_idx), (col_in_tile, tile_col_idx)]` ; cumbersome and error-prone. - **Zipped Divide**: The layout already mirrors `(Tile, Grid)`, so you cleanly index: `Data[tile_coord, grid_coord]` ; much simpler! This is why `zipped_divide` is the go-to operation for partitioning data across threads and memory hierarchies in CuTe kernels. ## Anatomy of the Naive Elementwise Add Kernel Let's see how indexing looks across different paradigms for a simple elementwise `C = A + B` kernel. ```python @cute.kernel def naive_elementwise_add_kernel(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor): tidx, _, _ = cute.arch.thread_idx() bidx, _, _ = cute.arch.block_idx() bdim, _, _ = cute.arch.block_dim() idx = bidx * bdim + tidx m, n = A.shape ni = idx % n mi = idx // n C[mi, ni] = A[mi, ni] + B[mi, ni] ``` ### The Indexing Difference - **PyTorch**: Fully abstracted ; you simply write `C = A + B` and PyTorch handles parallelism, memory layout, and iteration internally. - **CUDA**: GPU memory is flat 1D, so you typically work with raw pointers: `C[idx] = A[idx] + B[idx]`. If you need multi-dimensional indexing, you must manually compute row/column indices from the flat thread index and handle bounds checking yourself. - **cute-dsl**: The best of both worlds! Each thread computes a flat `idx`, then uses simple arithmetic (`idx % n` and `idx // n`) to recover 2D coordinates `(mi, ni)`. You then index with clean multi-dimensional syntax: `A[mi, ni]`. Under the hood, CuTe's Layout algebra automatically translates these logical coordinates to the correct physical memory offset ; no raw pointer math needed! ## Graduating to Vectorized Execution GPUs love vectorized operations. Why fetch one float per thread when you can comfortably fetch four? Vectorized pipelines saturate memory bandwidth efficiently. To vectorize seamlessly, we systematically reshape our datasets using ; you guessed it ; **tilers**! ```python @cute.jit def vectorized_add(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor): # Defining a tile of 4 elements wide tiler = (1, 4) # Zipping it chunks our layout! A = cute.zipped_divide(A, tiler) B = cute.zipped_divide(B, tiler) ... ``` By applying `zipped_divide`, our flat element dimensions transform into chunks of `(1, 4)`. The `A`, `B`, and `C` layouts effectively represent shape `((1, 4), (M, N/4))`: - **Mode 0**: `(1, 4)` is our Inner Tile footprint (yielding 4 elements horizontally). - **Mode 1**: `(M, N/4)` is our Outer Grid footprint. Each coordinate in the grid points to a vector chunk! Let's witness the magic in the vectorized kernel: ```python @cute.kernel def vectorized_add_kernel(A, B, C): tidx, _, _ = cute.arch.thread_idx() bidx, _, _ = cute.arch.block_idx() bdim, _, _ = cute.arch.block_dim() idx = bidx * bdim + tidx # Retrieve boundaries from Mode 1 (the grid) m, n = A.shape[1] ni = idx % n mi = idx // n # MAGIC HAPPENS HERE a_val = A[(None, (mi, ni))].load() b_val = B[(None, (mi, ni))].load() C[(None, (mi, ni))] = a_val + b_val ``` ### Breaking down `A[(None, (mi, ni))]` This is why we advocated for Zipped Divide earlier! After `zipped_divide`, `A` expects coordinates in the format `[TileCoord, GridCoord]`: - **`None` for TileCoord**: In CuTe DSL, `None` works like Python's `:` slice ; it means "select everything along this mode." Since our tile shape is `(1, 4)`, passing `None` selects all 4 elements of the tile. This gives us back a small tensor (a 4-element vector) rather than a single scalar. - **`(mi, ni)` for GridCoord**: This pinpoints *which* tile in the `(M, N/4)` grid this thread should access. - **`.load()`**: This tells the compiler to emit a hardware-level **vectorized memory load** (e.g., `ld.global.v4`), fetching all 4 elements in a single memory transaction directly into registers ; far more efficient than 4 separate scalar loads! ### Conclusion By leveraging Layouts, divide operations, and tilers, `cute-dsl` abstracts away the complex C++ template machinery of raw CUTLASS while preserving its performance. You get optimized layout mapping, zero raw pointer math, and vectorized register loading ; all with syntax that feels like writing everyday PyTorch or NumPy code. Time to fire up your GPUs! --- # Breaking PyTorch Boundaries: Fusing RMSNorm and GDN in Triton for Qwen 3.5 - URL: https://www.dcbaslani.xyz/blog/qwen_3.5/ - Markdown: https://www.dcbaslani.xyz/blog/qwen_3.5/index.md - Text: https://www.dcbaslani.xyz/blog/qwen_3.5/index.txt - Published: 2026-05-25 - Topic: Triton - Description: Rebuilding and optimizing the Qwen 3.5 (9B) Gated Delta Attention inference stack with custom Triton kernels, achieving 5.2x speedup on NVIDIA B200. # Breaking PyTorch Boundaries: Fusing RMSNorm and GDN in Triton for Qwen 3.5 I rebuilt the inference stack for Qwen 3.5 (9B) from scratch in PyTorch, profiled it, and replaced the key bottlenecks with hand-written Triton GPU kernels. By fusing memory-bound operations and rewriting the Gated Delta Network (GDN) linear attention layers to run without slow Python loops, I bumped throughput from **16 to 83 tokens/second** on a single NVIDIA B200 (a 5.2x speedup). --- ## Table of Contents - [Project Overview](#project-overview) - [Hardware Setup](#hardware-setup) - [Finding the Bottlenecks (Baseline Profiling)](#finding-the-bottlenecks-baseline-profiling) - [Fusing the Residual Stream](#fusing-the-residual-stream) - [Optimizing Gated Delta Attention](#optimizing-gated-delta-attention) - [Triton vs. Other Frameworks](#triton-vs-other-frameworks) - [Comparison with vLLM](#comparison-with-vllm) - [Takeaways](#takeaways) --- ## Project Overview Unlike vanilla transformers, Qwen 3.5 uses a **hybrid architecture** that interleaves standard multi-head attention layers with **Gated Delta Network (GDN)** linear attention layers. The GDN layers use a recurrent state-space mechanism that compresses past context into a fixed-size state matrix instead of maintaining a growing KV cache. ``` Layer 0: Linear Attention (GDN) Layer 1: Linear Attention (GDN) Layer 2: Linear Attention (GDN) Layer 3: Full Attention (SDPA) ← every 4th layer Layer 4: Linear Attention (GDN) ... Layer 31: Full Attention (SDPA) ``` Because **75% of the decoder layers** are GDN layers, optimizing them is critical; they dominate both prefill and decode latency. The goal here wasn't to build a production serving system, but to understand what happens between the CUDA cores and HBM when you run text generation, and to see how much performance we can recover through targeted Triton kernels. ### Code Components | Component | Description | |---|---| | [qwen.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/qwen.py) | Custom PyTorch implementation of the full Qwen 3.5 model (983 LOC) | | [fused_zero_centered_rmsnorm.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/kernels/triton/fused_zero_centered_rmsnorm.py) | Triton kernel fusing residual additions and zero-centered RMSNorm | | [prefill.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/kernels/linear_attention/prefill.py) | Triton chunkwise GDN prefill kernel using WY decomposition (993 LOC) | | [decode.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/kernels/linear_attention/decode.py) | Triton single-token GDN decode kernel | | [modal/inference.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/modal/inference.py) | Inference pipeline deployed to Modal (NVIDIA B200) | | [modal/vllm_bench.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/modal/vllm_bench.py) | Benchmarking script comparing our performance against vLLM | --- ## Hardware Setup I developed local tests on a GTX 1650 to iterate fast, and ran final benchmarks on an NVIDIA B200 via Modal. ```mermaid graph LR subgraph Local["Local Development"] GPU1["GTX 1650
4GB VRAM
Turing SM75"] DEV["Functional correctness
Kernel compilation
Numerical validation"] GPU1 --> DEV end subgraph Cloud["Modal Cloud"] GPU2["NVIDIA B200
192GB HBM3e
Blackwell SM100"] PERF["Throughput benchmarks
Profiler traces
Production metrics"] GPU2 --> PERF end DEV -->|"modal run"| PERF ``` | Platform | Local Development | Cloud Benchmark | |---|---|---| | **GPU** | GTX 1650 (Turing SM75) | NVIDIA B200 (Blackwell SM100) | | **VRAM** | 4 GB GDDR6 | 192 GB HBM3e | | **Memory Bandwidth** | 128 GB/s | 8,000 GB/s | | **Role** | Fast functional testing | Throughput and profiling | ### Profiling Setup I captured profiler traces on the B200 using PyTorch's built-in `torch.profiler` with CPU + CUDA activity recording and memory tracking. The profiling script ([inference-torch-profiler.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/modal/inference-torch-profiler.py)) profiles the prefill pass and a few decode steps, exporting Chrome traces for analysis. I ran profiles at three main development stages: > [!WARNING] > **Profiling Overhead:** The absolute times in the traces are much slower than actual inference because the profiler (with memory tracking and `CUDA_LAUNCH_BLOCKING=1`) adds significant overhead. The throughput numbers below reflect actual, unprofiled runs, while the trace times show the relative gains under profiling conditions. | Trace | Stage | Profiled Time (with overhead) | Actual Throughput | |---|---|---|---| | `initial_trace.json` | Pure PyTorch baseline | 10.53 s | ~16 tok/s | | `fused_rms_norm.json` | + Fused Triton RMSNorm | 10.40 s | ~50 tok/s | | `triton_linear_attention.json` | + Triton GDN kernels | 6.51 s | ~83 tok/s | --- ## Finding the Bottlenecks ### The Initial Trace I started by profiling the baseline PyTorch model on the B200 to see where the actual execution time was spent. ![Initial trace showing a single decode step and the cascade of tiny CUDA kernels](/blog/images/initial_trace.jpg) The trace showed a classic memory-bandwidth bottleneck. A single decode step; which should ideally consist of a few fused operations; was launching dozens of tiny individual CUDA kernels: ``` aten::add → 50 µs (residual addition) aten::pow → 30 µs (variance computation) aten::mean → 40 µs (reduction for RMSNorm) aten::rsqrt → 20 µs (inverse sqrt) aten::mul → 25 µs (weight application) aten::to → 15 µs (dtype cast) ``` Each of these operations touches HBM; reading the hidden state (8 KB per token), performing one arithmetic step, and writing the result back. The real bottleneck wasn't the arithmetic; it was the kernel launch overhead and memory roundtrips. ### The Abstraction Bottleneck The standard PyTorch pattern treats each layer and operation as an isolated object: ```mermaid graph TB subgraph Standard["Standard PyTorch Pattern (Memory-Bound)"] direction TB H0["hidden_states"] -->|"Write to HBM"| HBM1["HBM"] HBM1 -->|"Read from HBM"| ADD["aten::add (residual)"] ADD -->|"Write to HBM"| HBM2["HBM"] HBM2 -->|"Read from HBM"| POW["aten::pow"] POW -->|"Write to HBM"| HBM3["HBM"] HBM3 -->|"Read from HBM"| MEAN["aten::mean"] MEAN -->|"Write to HBM"| HBM4["HBM"] HBM4 -->|"Read from HBM"| RSQRT["aten::rsqrt"] RSQRT -->|"Write to HBM"| HBM5["HBM"] HBM5 -->|"Read from HBM"| MUL["aten::mul (weight)"] MUL -->|"Write to HBM"| HBM6["HBM"] end ``` That means **six HBM round-trips for a single normalization**. On the B200, each round-trip for a 4096-dim hidden state costs very little time in raw transfer, but the launch overhead for each tiny kernel is 20-50 µs. Multiply that by 32 layers and 2 norms per layer, and you get **64 normalization passes per decode step**. Essentially, PyTorch's module abstraction (where each layer/operation is a self-contained forward pass) forces separate kernel launches and intermediate memory writes, which kills performance on the GPU. --- ## Fusing the Residual Stream ### Passing the Residual Stream Down To fix this, we can adopt a trick used by systems like vLLM: pass the residual stream as an independent tensor so that the next layer's normalization kernel can fuse both the addition and the normalization step. Instead of: ```python # Standard: Layer owns its residual hidden = layer_norm(hidden + residual) ``` We restructure the decoder to return both: ```python # Fused: Residual flows between layers hidden, residual = fused_norm(hidden, residual) ``` ### The Fused Triton Kernel Here is the Triton implementation: ```python @triton.jit def _fused_zero_centered_rmsnorm( Y_ptr, Y_row_stride, S_ptr, S_row_stride, # output residual X_ptr, X_row_stride, R_ptr, R_row_stride, # input residual W_ptr, n_cols, eps, BLOCK_SIZE: tl.constexpr, ): row_idx = tl.program_id(0) col_offsets = tl.arange(0, BLOCK_SIZE) mask = col_offsets < n_cols # Step 1: Load X and R from HBM (the ONLY HBM read) X_row = tl.load(X_ptr + row_idx * X_row_stride + col_offsets, mask=mask) R_row = tl.load(R_ptr + row_idx * R_row_stride + col_offsets, mask=mask) # Step 2: Fused residual add (stays in SRAM) S_row = X_row + R_row tl.store(S_ptr + row_idx * S_row_stride + col_offsets, S_row, mask=mask) # Step 3: RMSNorm in FP32 (stays in SRAM) S_row = S_row.to(tl.float32) W_row = tl.load(W_ptr + col_offsets, mask=mask).to(tl.float32) mean_square = tl.sum(S_row * S_row, axis=0) / n_cols rstd = tl.rsqrt(mean_square + eps) S_row = S_row * rstd # Step 4: Zero-centered weight trick (Qwen-specific) Y_row = S_row * (1.0 + W_row) # Step 5: Write normalized output to HBM (the ONLY HBM write) tl.store(Y_ptr + row_idx * Y_row_stride + col_offsets, Y_row.to(S_row_dtype), mask=mask) ``` ### Key Implementation Details 1. **Zero-Centered Weights:** Standard RMSNorm applies `output = norm(x) * weight`. Qwen 3.5 uses zero-centered weights initialized at 0, applying `output = norm(x) * (1.0 + weight)`. Standard RMSNorm kernels will output incorrect activations and cause the model to diverge. 2. **FP32 Accumulation:** We must upcast the accumulated sum to FP32 before computing the variance. BF16's limited precision causes compounding rounding errors across 32 layers, leading to numerical divergence. 3. **Dual Outputs:** The kernel outputs the normalized hidden state `Y` and the new residual `S` in one go. While `S` still gets written to HBM for the next layer, the fusion eliminates 5 intermediate round-trips and launches. ### Memory Traffic and Kernel Launches ```mermaid graph TB subgraph Fused["Fused Triton Kernel (1 Launch)"] direction TB READ["Read X, R from HBM"] --> SRAM["SRAM (Shared Memory)"] SRAM --> ADD_F["Add: S = X + R"] ADD_F --> NORM_F["Normalize in FP32"] NORM_F --> WEIGHT_F["Apply (1 + W)"] WEIGHT_F --> WRITE["Write Y, S to HBM"] end ``` | Metric | Before (PyTorch) | After (Fused Triton) | |---|---|---| | HBM round-trips per norm | 6 | 1 | | Kernel launches per norm | 6 | 1 | | Total norms per decode step | 64 | 64 | | Kernel launches eliminated | ; | **320 per step** | ### Restructuring the Decoder Layer We also need to map the standard HuggingFace weights into our fused model format during loading (duplicating `input_layernorm.weight` into both a `standard` and `fused` slot). ```python class Qwen3_5DecoderLayer(nn.Module): def forward(self, hidden_states, residual, ...): if residual is None: # First layer: no residual yet, use standard norm residual = hidden_states hidden_states = self.input_layernorm_standard(hidden_states) else: # Subsequent layers: fused residual + norm hidden_states, residual = self.input_layernorm_fused(X=hidden_states, R=residual) # Token mixer (attention or GDN) hidden_states = self.token_mixer(hidden_states) # Post-attention: always fused hidden_states, residual = self.post_attention_layernorm(X=hidden_states, R=residual) hidden_states = self.mlp(hidden_states) return hidden_states, residual # ← residual flows to next layer ``` ### Performance Gains (Stage 1) ![Profiler trace after applying fused RMSNorm, showing reduced kernel launch density](/blog/images/fused_rms_norm.png) | Implementation | Throughput (B200, 150 tokens) | |---|---| | Pure PyTorch baseline | ~16 tok/s | | + Fused Triton RMSNorm | **~50 tok/s** (3.1x speedup) | Fusing RMSNorm and residual addition got us to ~50 tokens/sec. However, the profiler showed we were still heavily bottlenecked by the Gated Delta Network (GDN) linear attention layers, which were still running in pure PyTorch with sequential Python loops. --- ## Optimizing Gated Delta Attention ### GDN Attention vs. Standard Attention Standard multi-head attention processes all tokens in parallel via $Q K^\top V$. The KV cache grows linearly with sequence length, but the attention computation itself is highly parallelizable. GDN linear attention works differently. It maintains a **fixed-size recurrent state matrix** $H \in \mathbb{R}^{K \times V}$ (128 × 128 = 16 KB per head) that compresses all past context: ```mermaid graph LR subgraph Standard["Standard Attention"] Q1["Q"] --> DOT1["Q @ Kᵀ"] K1["K cache
(grows with seq)"] --> DOT1 DOT1 --> SOFT["softmax"] SOFT --> DOT2["... @ V"] V1["V cache
(grows with seq)"] --> DOT2 end subgraph GDN["GDN Linear Attention"] Q2["q_t"] --> OUT["o_t = q_t @ H_t"] STATE["H_t
(fixed 128×128)"] --> OUT K2["k_t"] --> UPDATE["H_{t+1} = γ·H_t + k_t ⊗ δ_t"] V2["v_t"] --> DELTA["δ_t = β·(v_t - k_t @ H_t)"] STATE --> DELTA DELTA --> UPDATE end ``` The recurrence relation for each token: $$H_{t+1} = \gamma_t \cdot H_t + k_t \otimes \beta_t \cdot (v_t - k_t^\top H_t)$$ $$o_t = q_t^\top \cdot H_{t+1}$$ Where the learned gating decay $\gamma_t$ is defined as: $$\gamma_t = \exp\left(-\exp(A_{\text{log}}) \cdot \text{softplus}(a_t + \text{dt}_{\text{bias}})\right)$$ ### The Bottleneck: Sequential Loops in PyTorch The initial PyTorch implementation processed chunks sequentially using a Python `for` loop: ```python # The bottleneck: sequential iteration for i in range(total_sequence_length // chunk_size): q_i, k_i, v_i = query[:, :, i], key[:, :, i], value[:, :, i] attn = q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i] v_prime = k_cumdecay[:, :, i] @ last_recurrent_state # ← HBM read v_new = v_i - v_prime attn_inter = (q_i * g[:, :, i, :, None].exp()) @ last_recurrent_state # ← HBM read core_attn_out[:, :, i] = attn_inter + attn @ v_new last_recurrent_state = ( # ← HBM write, then read again next iteration last_recurrent_state * g[:, :, i, -1, None, None].exp() + (k_i * ...).transpose(-1, -2) @ v_new ) ``` For a prefill sequence of 512 tokens with a chunk size of 64, we have $\frac{512}{64} = 8$ chunks. In every loop iteration: 1. **Read** the recurrent state matrix $H$ from HBM (128 × 128 float32 elements = 64 KB/head × 32 heads = 2 MB total). 2. Perform small matrix multiplications. 3. **Write** the updated 2 MB state matrix back to HBM. 4. Yield execution back to Python to start the next iteration. This results in: $$\text{Total HBM Traffic} = 8 \text{ chunks} \times (2\text{ MB read} + 2\text{ MB write}) = 32\text{ MB}$$ This is 32 MB of redundant memory transfers per layer. All of this state could instead live entirely in the GPU's SRAM (shared memory) during the entire forward pass. ### Writing the Triton Chunkwise GDN Kernel To parallelize this, we use the chunkwise formulation from the [Flash Linear Attention (FLA)](https://github.com/fla-org/flash-linear-attention) paper: intra-chunk interactions are computed in parallel via matrix multiplications, while the inter-chunk state updates are propagated sequentially while keeping the state matrix in SRAM. #### Prefill Kernel Architecture The prefill is split into two phases for maximum parallelism: ```mermaid flowchart TB subgraph Phase1["Phase 1: WY Prepass (Embarrassingly Parallel)"] direction LR P1["Grid: (num_seqs, num_heads, num_chunks)"] P1 --> C1["Chunk 0"] P1 --> C2["Chunk 1"] P1 --> C3["Chunk 2"] P1 --> CN["Chunk N"] C1 --> WY1["Compute:
• K @ Kᵀ gram matrix
• (I+N)⁻¹ via Neumann series
• Gate cumulative products
• Q @ Kᵀ interaction matrix"] C2 --> WY2["Same per-chunk
computation"] C3 --> WY3["Same per-chunk
computation"] CN --> WYN["Same per-chunk
computation"] end subgraph Phase2["Phase 2: Sequential State Propagation (State in SRAM)"] direction TB S0["State H₀
(loaded once)"] --> PROC0["Process Chunk 0
using precomputed WY"] PROC0 -->|"State stays in SRAM"| PROC1["Process Chunk 1
using precomputed WY"] PROC1 -->|"State stays in SRAM"| PROC2["Process Chunk 2
using precomputed WY"] PROC2 -->|"State stays in SRAM"| PROCN["Process Chunk N
using precomputed WY"] PROCN --> SN["State H_N
(written once)"] end Phase1 -->|"WY data in HBM"| Phase2 ``` #### The WY Decomposition The intra-chunk math uses the delta rule to formulate a lower-triangular system: $$(I + N) \cdot X = \beta \cdot \left(\frac{V}{G} - K \cdot S_{\text{in}}^\top\right)$$ Where $N$ is a strictly lower-triangular nilpotent matrix ($N[j,i] = \beta_j \cdot (k_j \cdot k_i)$ for $i < j$). Since $N$ is nilpotent of order $C$ (chunk size), we can invert $(I+N)$ exactly using the Neumann series: $$(I + N)^{-1} = (I - N)(I + N^2)(I + N^4) \cdots$$ This is implemented as a fixed-depth doubling chain: ```python @triton.jit def _apply_unit_lower_inverse(nil, rhs, BV: tl.constexpr, CHUNK: tl.constexpr): """(I+N)^{-1} via doubling. Uses TF32 tensor cores for numerical stability.""" sol = rhs - _dot_f32(nil, rhs) # (I - N) @ rhs power = _dot_f32(nil, nil) # N² if CHUNK >= 4: sol = sol + _dot_f32(power, sol) # += N² @ sol power = _dot_f32(power, power) # N⁴ if CHUNK >= 8: sol = sol + _dot_f32(power, sol) # += N⁴ @ sol power = _dot_f32(power, power) # N⁸ if CHUNK >= 16: sol = sol + _dot_f32(power, sol) # += N⁸ @ sol power = _dot_f32(power, power) # N¹⁶ if CHUNK >= 32: sol = sol + _dot_f32(power, sol) # += N¹⁶ @ sol return sol ``` #### Blackwell-Specific Optimizations Getting clean performance on the B200 required a few low-level adjustments: 1. **Precision Splitting (TF32 vs. BF16):** The nilpotent inverse chain uses TF32 (19-bit mantissa) since numerical errors compound quickly across the log-depth doubling chain. For the large K=128 contractions (`K @ K^T`, `Q @ state^T`), we use BF16 tensor cores to get a 4x speedup, where rounding is bounded for single-shot matmuls. 2. **Blackwell Code-Gen Bug:** We hit a bug in Triton's compiler on Blackwell (specifically the `TritonGPUHoistTMEMAlloc` pass) where it incorrectly fused `tl.dot` outputs. We bypassed it by wrapping the dot product in an inline PTX `mov.f32` barrier: ```python @triton.jit def _dot_f32(a, b): out = tl.dot(a, b, input_precision="tf32", out_dtype=tl.float32) return tl.inline_asm_elementwise( asm="mov.f32 $0, $1;", constraints="=r,r", args=[out], dtype=tl.float32, is_pure=True, pack=1, ) ``` 3. **Adaptive Tiling:** We adjusted tile sizes dynamically depending on the workload: | Parameter | Small Batch | Large Batch | Rationale | |---|---|---|---| | `CHUNK` | 32 | 16 | Larger chunks amortize launch overhead; smaller chunks save Gram matrix computation | | `BV` (V-tile) | 16 | 16-32 | Controls shared memory occupancy vs. register pressure | | `num_warps` | 4 | 2 | Reduces sync overhead on smaller sequence batches | #### The Decode Kernel For the single-token decode kernel, things are simpler since we don't need chunking. We map the grid over batch and heads: 1. Load the state tile $H[\text{BV}, K]$ from HBM (single read). 2. Compute the gate decay value $\gamma = \exp\left(-\exp(A_{\text{log}}) \cdot \text{softplus}(a + \text{dt}_{\text{bias}})\right)$. 3. Run the recurrent update step directly in registers. 4. Write out the output token and the new state back to HBM. ```python @triton.jit def gdn_decode_kernel(...): # Gate computation g = tl.exp(-tl.exp(A_log_val) * softplus_x) beta = tl.sigmoid(b_val) # Decay existing state old_state = g * b_h # Delta rule update old_v = tl.sum(old_state * b_k[None, :], axis=1) delta_v = beta * (b_v - old_v) # Output BEFORE state update (frees registers) old_o = tl.sum(old_state * b_q[None, :], axis=1) kq = tl.sum(b_k * b_q) b_o = scale * (old_o + delta_v * kq) tl.store(out_ptr + ..., b_o.to(tl.bfloat16)) # State update (register-only, no extra HBM read) state_out = old_state + delta_v[:, None] * b_k[None, :] tl.store(new_state_ptr + ..., state_out) ``` > [!TIP] > Storing the output *before* computing `state_out` frees up registers and prevents local memory spilling. Since Blackwell caps registers at 255 per thread, this manual scheduling prevents register spills. ### Performance Gains (Stage 2) ![Profiler trace with fused GDN kernels, showing a clean decode step free of Python loop overhead](/blog/images/triton_linear_attention.png) | Implementation | End-to-End Time (150 tok) | Throughput | |---|---|---| | Pure PyTorch baseline | 9.28 s | ~16 tok/s | | + Fused Triton RMSNorm | 3.02 s | ~50 tok/s | | + Triton GDN Kernels | 1.81 s | **~83 tok/s** (5.2x speedup) | --- ## Triton vs. Other Frameworks When choosing how to write these kernels, there were three main options: 1. **Triton:** Fast iteration, Python-based, portable. But we lose low-level control over TMA (Tensor Memory Accelerator) and ran into compiler bugs on Blackwell. 2. **TileLang:** Good compromise, offers warp specialization and better occupancy control, but the ecosystem is very young. 3. **CuTe / CUTLASS (C++):** Maximum performance, explicit layout swizzling, and full control over asynchronous memory copies. But it requires writing 3,000+ lines of C++ template metaprogramming per kernel, adding weeks of development. ### Where Triton Falls Short on Blackwell Writing kernels for the B200 highlighted a few areas where Triton's compiler abstractions hit limits: - **TMA Control:** The B200 features a Tensor Memory Accelerator to prefetch data asynchronously. Triton abstracts this, meaning we cannot manually orchestrate tile prefetching or schedule memory loads. - **Swizzle Layouts:** Preventing bank conflicts in shared memory requires specific data layouts. Triton manages this automatically but can generate suboptimal swizzling patterns for non-power-of-two tiles. - **Compiler Bugs:** The code-generation bug in `TritonGPUHoistTMEMAlloc` meant we had to resort to inline PTX assembly barriers. ### Why I Stuck With Triton Even with these issues, Triton was the right choice for this project: | Factor | Triton | CuTe/CUTLASS | |---|---|---| | Iteration Time | Minutes | Hours (compile + run) | | Lines of Code | ~1,100 total | ~3,000+ per kernel | | Maintainability | High (Python-like syntax) | Low (highly complex C++ templates) | | Throughput | 83 tokens/s | Theoretical peak (maybe ~90 tok/s) | | Development Time | 2 weeks | 2+ months(I have fulltime college :( ) | Ultimately, hitting 83 tokens/second in a couple of weeks with Triton is a much better engineering tradeoff than spending months in C++ to chase a marginal single-digit performance gain. --- ## Comparison with vLLM To see where these optimizations stand relative to a production serving system, I benchmarked our setup against vLLM on the same B200 hardware: | System | Throughput | Setup | |---|---|---| | Custom Triton Kernels | **83 tok/s** | Single request, no batching, raw Triton kernels | | vLLM v0.20.1 | **~250 tok/s** | Production stack (PagedAttention, CUDA Graphs, scheduling) | ### Analyzing the Performance Difference The 3x throughput gap is due to systems-level architecture rather than raw kernel execution speeds: | Optimization | Custom Pipeline | vLLM | |---|---|---| | **CUDA Graphs** | No (adds CPU overhead per decode step) | Yes (captures and replays decode graphs) | | **Continuous Batching**| No (runs single request at a time) | Yes (batches active requests dynamically) | | **PagedAttention** | No (pre-allocates flat tensors) | Yes (dynamic KV page management) | | **Decode Loop** | Python-driven loop | Optimized C++ scheduling / engine | In a production environment, kernel and systems-level optimizations are complementary. The custom Triton kernels developed here could be plugged directly into an engine like vLLM to gain the benefits of PagedAttention and CUDA Graphs. ### When to Stop Optimizing At 83 tokens/second, we hit the point of diminishing returns for kernel optimization: - Getting from 16 to 83 tokens/sec required about 1,100 lines of Triton. - To push past 83 tokens/sec, we would need to eliminate CPU launch overhead using CUDA Graphs (a system orchestration task) or rewrite the kernels in CuTe to squeak out a few percentage points of memory efficiency. Because the core performance bottleneck is now CPU dispatch and framework overhead, further micro-tuning the Triton kernels wouldn't make sense. --- ## Takeaways Writing high-performance GPU kernels is mostly about respecting the memory hierarchy: 1. **Keep data in SRAM:** Moving intermediate states back and forth to HBM wastes bandwidth. Keep inputs in shared memory as long as possible. 2. **Minimize kernel launches:** Fusing dependent operations (like residual additions and normalization) cuts launch overhead. 3. **Eliminate Python loops:** Replace sequential execution paths with chunked, parallel CUDA blocks. ### Quick Stats - **Model:** Qwen 3.5-9B (hybrid GDN + multi-head attention) - **Hardware:** NVIDIA B200 (192 GB HBM3e) - **Baseline Speed:** 16 tokens/s - **Optimized Speed:** 83 tokens/s (5.2x speedup) - **Triton Code:** ~1,100 lines - **PyTorch Model Code:** ~983 lines --- *Built by Darshan Baslani. Kernels prototyped on a GTX 1650, benchmarked on NVIDIA B200 via Modal.* *Source: [github.com/Darshan-Baslani/qwen3.5-optimized](https://github.com/Darshan-Baslani/qwen3.5-optimized)* --- # The Feynman GPU Lectures - URL: https://www.dcbaslani.xyz/blog/gpu_masterclass/ - Markdown: https://www.dcbaslani.xyz/blog/gpu_masterclass/index.md - Text: https://www.dcbaslani.xyz/blog/gpu_masterclass/index.txt - Published: 2026-06-05 - Topic: CUDA - Description: A GPU masterclass that builds from transistors and CUDA cores up through SM architecture, memory systems, Tensor Cores, Hopper, and Blackwell. # 🎓 The Feynman GPU Lectures ### *"What I cannot create, I do not understand."* ; Richard Feynman > **Welcome.** We'll start from the transistor, build up to a CUDA core, wire those into a monster called a Streaming Multiprocessor, and then ride the architectural wave from Volta all the way to Blackwell. I'll give you real code ; not toy code ; in CUDA and PTX. By the end, you'll know what `tcgen05.mma` does at the wire level. > > No fluff. No marketing. Just physics, logic, and math. --- ## Table of Contents - [Hour 1 ; The Transistor to the CUDA Core](#hour-1--the-transistor-to-the-cuda-core) - [1.1 The Transistor: Nature's Switch](#11-the-transistor-natures-switch) - [1.2 From Transistors to Logic Gates](#12-from-transistors-to-logic-gates) - [1.3 From Gates to an ALU](#13-from-gates-to-an-alu) - [1.4 The CUDA Core: A Bare-Metal FPU](#14-the-cuda-core-a-bare-metal-fpu) - [1.5 SIMT: How 32 Threads Breathe Together](#15-simt-how-32-threads-breathe-together) - [1.6 The Streaming Multiprocessor (SM)](#16-the-streaming-multiprocessor-sm) - [1.7 Warp Divergence: The Cost of `if`](#17-warp-divergence-the-cost-of-if) - [Hour 2 ; Memory: The Real Bottleneck](#hour-2--memory-the-real-bottleneck) - [2.1 The Memory Hierarchy: A Map of Latencies](#21-the-memory-hierarchy-a-map-of-latencies) - [2.2 Registers: Zero-Cost Storage](#22-registers-zero-cost-storage) - [2.3 Shared Memory & Bank Conflicts](#23-shared-memory--bank-conflicts) - [2.4 Global Memory & Coalescing](#24-global-memory--coalescing) - [2.5 Occupancy: The Art of Hiding Latency](#25-occupancy-the-art-of-hiding-latency) - [2.6 The Warp Scheduler: Juggling Latency](#26-the-warp-scheduler-juggling-latency) - [Hour 3 ; The Generational Leap: Volta → Ampere → Hopper](#hour-3--the-generational-leap-volta--ampere--hopper) - [3.1 Tensor Cores: Why Multiply-Accumulate Is King](#31-tensor-cores-why-multiply-accumulate-is-king) - [3.2 Volta (SM70): Where Tensor Cores Were Born](#32-volta-sm70-where-tensor-cores-were-born) - [3.3 Ampere (SM80): Async Copy & mbarrier](#33-ampere-sm80-async-copy--mbarrier) - [3.4 Deep Dive: mbarrier ; Phase-Based Synchronization](#34-deep-dive-mbarrier--phase-based-synchronization) - [3.5 Hopper (SM90): TMA, Clusters & WGMMA](#35-hopper-sm90-tma-clusters--wgmma) - [3.6 WGMMA: The Warpgroup Matrix Machine](#36-wgmma-the-warpgroup-matrix-machine) - [Hour 4 ; Blackwell: The Fifth Generation](#hour-4--blackwell-the-fifth-generation) - [4.1 Blackwell Architecture Overview](#41-blackwell-architecture-overview) - [4.2 Tensor Memory (TMEM): A New Address Space](#42-tensor-memory-tmem-a-new-address-space) - [4.3 tcgen05: The Fifth-Gen Tensor Core ISA](#43-tcgen05-the-fifth-gen-tensor-core-isa) - [4.4 FP4 and Microscaling: Extreme Precision Engineering](#44-fp4-and-microscaling-extreme-precision-engineering) - [4.5 Putting It All Together: A Blackwell GEMM Kernel](#45-putting-it-all-together-a-blackwell-gemm-kernel) - [4.6 The Full Picture: Architecture Comparison](#46-the-full-picture-architecture-comparison) - [Appendix A ; PTX Quick Reference](#appendix-a--ptx-quick-reference) - [Appendix B ; Occupancy Calculator Cheat Sheet](#appendix-b--occupancy-calculator-cheat-sheet) --- # Hour 1 ; The Transistor to the CUDA Core ## 1.1 The Transistor: Nature's Switch Everything begins with a single idea: **a voltage-controlled switch**. A **MOSFET** (Metal-Oxide-Semiconductor Field-Effect Transistor) has three terminals: ```mermaid graph LR G["Gate (Control)"] --> T["Channel"] S["Source"] --> T T --> D["Drain"] style G fill:#ff6b6b,stroke:#333,color:#fff style S fill:#4ecdc4,stroke:#333,color:#fff style D fill:#45b7d1,stroke:#333,color:#fff style T fill:#f9f9f9,stroke:#333 ``` There are two flavors: | Type | Conducts when Gate is... | Pulls output toward... | |------|--------------------------|------------------------| | **NMOS** | HIGH (Vdd) | Ground (0) | | **PMOS** | LOW (0) | Supply (1) | > **The key insight**: A transistor is not a mystical device. It's a water faucet. Gate voltage = handle. Current = water. That's it. Modern GPUs use **FinFET** transistors ; the "fin" is a 3D gate that wraps around the channel on three sides, giving better electrostatic control at tiny scales: ```mermaid graph TD subgraph "FinFET Process Nodes" V["Volta (2017)
TSMC 12nm
21.1B transistors"] T["Turing (2018)
TSMC 12nm
18.6B transistors"] A["Ampere (2020)
TSMC 7nm
54.2B transistors"] H["Hopper (2022)
TSMC 4N
80B transistors"] B["Blackwell (2024)
TSMC 4NP
208B transistors"] end V --> T --> A --> H --> B style V fill:#6c5ce7,stroke:#333,color:#fff style T fill:#a29bfe,stroke:#333,color:#fff style A fill:#00b894,stroke:#333,color:#fff style H fill:#fdcb6e,stroke:#333,color:#000 style B fill:#e17055,stroke:#333,color:#fff ``` > **208 billion transistors** on Blackwell. That's about 26 transistors for every human alive, on a chip the size of your thumbnail (well, two chips fused together). --- ## 1.2 From Transistors to Logic Gates Every digital circuit is built from three atomic operations. Here's how transistors make them: ### The NOT Gate (Inverter) ; 2 Transistors ```mermaid graph TD VDD["Vdd (+)"] --> PMOS PMOS -->|"when Input=0, PMOS ON"| OUT["Output"] OUT --> NMOS NMOS -->|"when Input=1, NMOS ON"| GND["Ground"] INPUT["Input"] -.-> PMOS INPUT -.-> NMOS style VDD fill:#e74c3c,stroke:#333,color:#fff style GND fill:#2c3e50,stroke:#333,color:#fff style OUT fill:#f1c40f,stroke:#333,color:#000 style INPUT fill:#3498db,stroke:#333,color:#fff ``` | Input | PMOS | NMOS | Output | |-------|------|------|--------| | 0 | ON | OFF | **1** | | 1 | OFF | ON | **0** | ### The NAND Gate ; 4 Transistors The NAND gate is the **universal gate** ; you can build ANY logic function from NANDs alone. ``` Vdd ┌─┤─┐ A ──┤P₁ │ └─┬─┘ ┌─┤─┐ B ──┤P₂ │──── Output (= NOT(A AND B)) └─┬─┘ ┌─┤─┐ A ──┤N₁ │ └─┬─┘ ┌─┤─┐ B ──┤N₂ │ └─┬─┘ GND PMOS: Parallel (either A=0 OR B=0 → output=1) NMOS: Series (both A=1 AND B=1 → output=0) ``` | A | B | NAND(A,B) | |---|---|-----------| | 0 | 0 | **1** | | 0 | 1 | **1** | | 1 | 0 | **1** | | 1 | 1 | **0** | > **Why NAND is universal**: `NOT(A) = NAND(A,A)`. `AND(A,B) = NOT(NAND(A,B))`. `OR(A,B) = NAND(NOT(A), NOT(B))`. Every other gate follows. This is why chip designers think in NANDs. --- ## 1.3 From Gates to an ALU An ALU is just a **network of gates** that performs arithmetic. Let's build one from the bottom up. ### Half Adder ; The Atom of Arithmetic ```mermaid graph LR A["A"] --> XOR["⊕ XOR"] B["B"] --> XOR A --> AND["∧ AND"] B --> AND XOR --> S["Sum"] AND --> C["Carry"] style A fill:#3498db,stroke:#333,color:#fff style B fill:#3498db,stroke:#333,color:#fff style XOR fill:#e74c3c,stroke:#333,color:#fff style AND fill:#e67e22,stroke:#333,color:#fff style S fill:#2ecc71,stroke:#333,color:#fff style C fill:#2ecc71,stroke:#333,color:#fff ``` $$Sum = A \oplus B$$ $$Carry = A \wedge B$$ ### Full Adder ; Handles a Carry-In $$Sum = A \oplus B \oplus C_{in}$$ $$C_{out} = (A \wedge B) \lor (C_{in} \wedge (A \oplus B))$$ **Gate count**: ~9 gates = ~36 transistors per bit. ### 32-bit Adder Chain 32 full adders. But naive chaining is slow (carry ripples through all 32 bits). GPUs use **Carry-Lookahead Adders (CLA)**: ```mermaid graph TD subgraph "Carry-Lookahead Adder - 32-bit" G0["Generate/Propagate
Bits 0-7"] --> CLA1["CLA Block"] G1["Generate/Propagate
Bits 8-15"] --> CLA2["CLA Block"] G2["Generate/Propagate
Bits 16-23"] --> CLA3["CLA Block"] G3["Generate/Propagate
Bits 24-31"] --> CLA4["CLA Block"] CLA1 --> SUPER["Carry-Lookahead
Super Block"] CLA2 --> SUPER CLA3 --> SUPER CLA4 --> SUPER end style SUPER fill:#e74c3c,stroke:#333,color:#fff style CLA1 fill:#3498db,stroke:#333,color:#fff style CLA2 fill:#3498db,stroke:#333,color:#fff style CLA3 fill:#3498db,stroke:#333,color:#fff style CLA4 fill:#3498db,stroke:#333,color:#fff ``` **Key math**: For each bit position *i*: - **Generate**: $G_i = A_i \wedge B_i$ (this bit produces a carry regardless) - **Propagate**: $P_i = A_i \oplus B_i$ (this bit passes a carry through) - **Carry**: $C_i = G_i \lor (P_i \wedge C_{i-1})$ The lookahead trick: expand the recursion so all carries compute in $O(\log n)$ gate delays instead of $O(n)$. **Total for a 32-bit CLA**: ~200 gates ≈ ~800 transistors. Delay: ~4-5 gate stages. --- ## 1.4 The CUDA Core: A Bare-Metal FPU Now let's build an actual floating-point unit. An FP32 number (IEEE 754): ``` ┌──────┬──────────┬───────────────────────┐ │ Sign │ Exponent │ Mantissa │ │ 1 bit│ 8 bits │ 23 bits │ └──────┴──────────┴───────────────────────┘ Value = (-1)^sign × 2^(exponent-127) × 1.mantissa ``` ### FP32 Multiply Pipeline To compute $A \times B$: ```mermaid graph TD A["Input A
(sign, exp, mant)"] --> S1 B["Input B
(sign, exp, mant)"] --> S1 S1["Stage 1: Sign XOR
sign_result = sign_A ⊕ sign_B
1 gate"] --> S2 S2["Stage 2: Exponent Add
exp_result = exp_A + exp_B - 127
8-bit adder + subtractor"] --> S3 S3["Stage 3: Mantissa Multiply
mant_result = 1.mant_A x 1.mant_B
24x24 multiplier array"] --> S4 S4["Stage 4: Normalize
Shift mantissa, adjust exponent
Leading-zero detector + barrel shifter"] --> S5 S5["Stage 5: Round
IEEE 754 rounding
Round-to-nearest-even logic"] --> R["Result"] style S1 fill:#e74c3c,stroke:#333,color:#fff style S2 fill:#e67e22,stroke:#333,color:#fff style S3 fill:#f1c40f,stroke:#333,color:#000 style S4 fill:#2ecc71,stroke:#333,color:#fff style S5 fill:#3498db,stroke:#333,color:#fff style R fill:#9b59b6,stroke:#333,color:#fff ``` The **24×24 mantissa multiplier** is the big one. Using a Wallace tree multiplier: ~2,000-3,000 gates. **Total transistor count for one FP32 CUDA core**: approximately **8,000–20,000 transistors**. > **Compare to a CPU core**: An Intel Golden Cove core has ~500 million transistors. A CUDA core has ~15,000. That's a factor of **30,000×**. A CUDA core is *stupidly simple* by design. That's the point. ### What a CUDA Core Does NOT Have ```mermaid graph LR subgraph "CPU Core" BP["Branch Predictor
~50K transistors"] OOO["Out-of-Order Engine
~200K transistors"] ROB["Reorder Buffer
~100K transistors"] L1D["32KB L1D Cache
Private"] L1I["32KB L1I Cache
Private"] TLB["TLB
~1000 entries"] end subgraph "CUDA Core" FPU["FP32 ALU
~15K transistors"] DONE["That is it."] end style BP fill:#e74c3c,stroke:#333,color:#fff style OOO fill:#e74c3c,stroke:#333,color:#fff style ROB fill:#e74c3c,stroke:#333,color:#fff style FPU fill:#2ecc71,stroke:#333,color:#fff style DONE fill:#2ecc71,stroke:#333,color:#fff ``` No branch predictor. No out-of-order execution. No speculation. No private cache. Just: **input → ALU → output**. The GPU's philosophy is: *"Why predict branches when you can just run 10,000 threads and always have something useful to do?"* --- ## 1.5 SIMT: How 32 Threads Breathe Together ### The Warp ; The Atom of GPU Execution A **warp** is 32 threads that execute the **same instruction** at the **same time** on 32 different data. This is **SIMT** ; Single Instruction, Multiple Threads. It's like SIMD (SSE/AVX), but each "lane" has its own registers and can branch independently (since Volta). ```mermaid graph TD subgraph "One Warp - 32 threads" PC["Program Counter
(shared)"] --> INST["Instruction: FADD R1, R2, R3"] INST --> T0["Thread 0
R1_0 = R2_0 + R3_0"] INST --> T1["Thread 1
R1_1 = R2_1 + R3_1"] INST --> T2["Thread 2
R1_2 = R2_2 + R3_2"] INST --> DOTS["..."] INST --> T31["Thread 31
R1_31 = R2_31 + R3_31"] end style PC fill:#e74c3c,stroke:#333,color:#fff style INST fill:#f1c40f,stroke:#333,color:#000 style T0 fill:#3498db,stroke:#333,color:#fff style T1 fill:#3498db,stroke:#333,color:#fff style T2 fill:#3498db,stroke:#333,color:#fff style T31 fill:#3498db,stroke:#333,color:#fff ``` **The hardware implementation**: The warp scheduler fetches ONE instruction. The dispatch unit sends it to 32 CUDA cores simultaneously. Each core reads from its own register bank and writes to its own register bank. One instruction, 32 executions. That's a **32× multiplier on instruction fetch/decode efficiency** compared to a scalar processor. ### Thread Identity ```cuda // Every thread knows exactly where it is: int globalId = blockIdx.x * blockDim.x + threadIdx.x; int warpId = threadIdx.x / 32; // which warp within this block int laneId = threadIdx.x % 32; // which lane within the warp (0-31) ``` --- ## 1.6 The Streaming Multiprocessor (SM) The SM is the **fundamental compute unit** of a GPU. Everything above (cores, warps, schedulers) lives inside one SM. ```mermaid graph TD subgraph SM["Streaming Multiprocessor"] subgraph Q0["Quadrant 0"] WS0["Warp Scheduler 0
+ Dispatch Unit"] CORES0["32 x FP32 Cores"] TC0["1 x Tensor Core"] LS0["8 x Load/Store Units"] SFU0["4 x SFU"] RF0["16,384 x 32-bit Registers"] end subgraph Q1["Quadrant 1"] WS1["Warp Scheduler 1"] CORES1["32 x FP32 Cores"] TC1["1 x Tensor Core"] RF1["16,384 Registers"] end subgraph Q2["Quadrant 2"] WS2["Warp Scheduler 2"] CORES2["32 x FP32 Cores"] TC2["1 x Tensor Core"] RF2["16,384 Registers"] end subgraph Q3["Quadrant 3"] WS3["Warp Scheduler 3"] CORES3["32 x FP32 Cores"] TC3["1 x Tensor Core"] RF3["16,384 Registers"] end SMEM["Shared Memory / L1 Cache
(Configurable, up to 228 KB on Hopper)"] ICACHE["Instruction Cache"] CCACHE["Constant Cache"] TEX["Texture Units"] end style SM fill:#1a1a2e,stroke:#e94560,color:#fff style Q0 fill:#16213e,stroke:#0f3460,color:#fff style Q1 fill:#16213e,stroke:#0f3460,color:#fff style Q2 fill:#16213e,stroke:#0f3460,color:#fff style Q3 fill:#16213e,stroke:#0f3460,color:#fff style SMEM fill:#e94560,stroke:#333,color:#fff style WS0 fill:#0f3460,stroke:#333,color:#fff style CORES0 fill:#533483,stroke:#333,color:#fff style TC0 fill:#e74c3c,stroke:#333,color:#fff ``` ### SM Specs Across Generations | Feature | Volta (SM70) | Ampere (SM80) | Hopper (SM90) | Blackwell (SM100) | |---------|-------------|---------------|---------------|-------------------| | FP32 Cores/SM | 64 | 128 | 128 | 128 | | Tensor Cores/SM | 8 | 4 (2x capable) | 4 (4th gen) | 4 (5th gen) | | Register File | 256 KB | 256 KB | 256 KB | 256 KB | | Shared Mem (max) | 96 KB | 164 KB | 228 KB | 228 KB | | Max Threads/SM | 2048 | 2048 | 2048 | 2048 | | Max Warps/SM | 64 | 64 | 64 | 64 | | Warp Schedulers | 4 | 4 | 4 | 4 | | Max Blocks/SM | 32 | 32 | 32 | 32 | > **The golden ratio of GPUs**: 2048 threads / 256 KB registers = 128 bytes (32 registers) per thread at 100% occupancy. Every register you add above 32 reduces your occupancy. This is the *fundamental tradeoff* of GPU programming. --- ## 1.7 Warp Divergence: The Cost of `if` Consider this code: ```cuda __global__ void divergent_kernel(float* data, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx % 2 == 0) { data[idx] = expensive_function_A(data[idx]); // Even threads } else { data[idx] = expensive_function_B(data[idx]); // Odd threads } } ``` ### What Happens Inside the Warp ```mermaid graph TD subgraph "Warp Execution Timeline" T1["Clock 1-10: Execute Branch A
Active Mask: 0x55555555
Threads 0,2,4,...,30 ACTIVE
Threads 1,3,5,...,31 IDLE"] T2["Clock 11-20: Execute Branch B
Active Mask: 0xAAAAAAAA
Threads 1,3,5,...,31 ACTIVE
Threads 0,2,4,...,30 IDLE"] T3["Clock 21+: Reconverge
Active Mask: 0xFFFFFFFF
All 32 threads ACTIVE"] end T1 --> T2 --> T3 style T1 fill:#e74c3c,stroke:#333,color:#fff style T2 fill:#e67e22,stroke:#333,color:#fff style T3 fill:#2ecc71,stroke:#333,color:#fff ``` **Throughput**: Both branches execute sequentially. If each branch takes 10 cycles, the divergent warp takes **20 cycles** instead of 10. That's **50% efficiency**. **The hardware truth**: Those "idle" CUDA cores aren't powered down ; they're executing the instruction but their **write-back is masked**. The energy is mostly wasted. ### The Fix: Think in Warps, Not Threads ```cuda // GOOD: All threads in a warp take the same branch __global__ void coalesced_kernel(float* data, int N) { int warpId = (blockIdx.x * blockDim.x + threadIdx.x) / 32; int laneId = threadIdx.x % 32; if (warpId % 2 == 0) { // Entire warp goes here ; no divergence data[warpId * 32 + laneId] = expensive_function_A(data[warpId * 32 + laneId]); } else { // Entire warp goes here ; no divergence data[warpId * 32 + laneId] = expensive_function_B(data[warpId * 32 + laneId]); } } ``` --- # Hour 2 ; Memory: The Real Bottleneck > *"The speed of computation doesn't matter if you can't feed the beast."* ## 2.1 The Memory Hierarchy: A Map of Latencies This is the single most important diagram in GPU programming: ```mermaid graph TD REG["Registers
256 KB per SM
0 cycles - same clock
~20 TB/s aggregate"] --> SMEM SMEM["Shared Memory
Up to 228 KB per SM
~20-30 cycles
~19 TB/s per SM"] --> L1 L1["L1 Cache
Combined with Shared Mem
~30-40 cycles"] --> L2 L2["L2 Cache
40-96 MB
~200-300 cycles
~6-12 TB/s"] --> HBM HBM["HBM / GDDR
80-192 GB
~400-800 cycles
2-8 TB/s"] --> PCIE PCIE["System Memory via PCIe
Host RAM
~10,000+ cycles
64 GB/s"] style REG fill:#2ecc71,stroke:#333,color:#fff style SMEM fill:#3498db,stroke:#333,color:#fff style L1 fill:#9b59b6,stroke:#333,color:#fff style L2 fill:#e67e22,stroke:#333,color:#fff style HBM fill:#e74c3c,stroke:#333,color:#fff style PCIE fill:#c0392b,stroke:#333,color:#fff ``` > **The fundamental equation**: At 1.5 GHz, 400 cycles = **267 nanoseconds**. A single global memory access takes as long as 400 floating-point operations. This is why GPUs need **thousands of threads** ; to keep the ALUs busy while other threads wait for memory. Let me put it differently. Say each of your threads needs data from HBM. At 400 cycles per access, you need: $$\text{Warps needed to hide latency} = \frac{\text{Memory latency}}{\text{Instruction throughput}} = \frac{400 \text{ cycles}}{4 \text{ cycles/instruction}} = 100 \text{ warps}$$ But we only have 64 warps per SM. So we **need** shared memory and caches to reduce effective latency, or we'll always be memory-bound. --- ## 2.2 Registers: Zero-Cost Storage Registers are the fastest memory in the system. Each SM has **65,536 x 32-bit registers** (256 KB). ```cuda __global__ void register_demo(float* out, float* in, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; // These live in REGISTERS ; zero-cost access float a = in[idx]; // Register R1 float b = in[idx + N]; // Register R2 float c = a * b; // Register R3 = R1 * R2 float d = c + 1.0f; // Register R4 = R3 + 1.0 out[idx] = d; // Store R4 to memory } ``` **Register pressure** ; the critical tradeoff: ``` Total registers per SM: 65,536 Max threads per SM: 2,048 Registers at 100% occupancy: 65,536 / 2,048 = 32 registers per thread If your kernel uses 64 registers/thread: Threads per SM = 65,536 / 64 = 1,024 threads = 32 warps Occupancy = 32/64 = 50% If your kernel uses 128 registers/thread: Threads per SM = 65,536 / 128 = 512 threads = 16 warps Occupancy = 16/64 = 25% ``` > **Feynman's rule**: *Every register you add above 32 is stealing occupancy from you.* But sometimes that's a good trade ; more registers means fewer memory accesses. Profile, don't guess. ### PTX Register Usage ``` // PTX: Registers are explicitly declared .reg .f32 %f<64>; // 64 FP32 registers: %f0 through %f63 .reg .f64 %fd<16>; // 16 FP64 registers .reg .b32 %r<32>; // 32 32-bit general registers .reg .b64 %rd<16>; // 16 64-bit registers (for addresses) .reg .pred %p<8>; // 8 predicate registers (for branching) // Usage: ld.global.f32 %f0, [%rd0]; // Load from global memory to register add.f32 %f1, %f0, %f2; // Add two registers st.global.f32 [%rd1], %f1; // Store register to global memory ``` --- ## 2.3 Shared Memory & Bank Conflicts Shared memory is on-chip SRAM, accessible by all threads in a **thread block** (CTA). It's 20-30x faster than global memory. ### Physical Structure: 32 Banks ```mermaid graph TD subgraph "Shared Memory: 32 Banks x 4 Bytes Wide" B0["Bank 0
Addr 0, 128, 256..."] B1["Bank 1
Addr 4, 132, 260..."] B2["Bank 2
Addr 8, 136, 264..."] B3["Bank 3
Addr 12, 140, 268..."] DOTS1["..."] B30["Bank 30
Addr 120, 248..."] B31["Bank 31
Addr 124, 252..."] end subgraph "Warp Threads" T0["Thread 0"] --> B0 T1["Thread 1"] --> B1 T2["Thread 2"] --> B2 T3["Thread 3"] --> B3 T30["Thread 30"] --> B30 T31["Thread 31"] --> B31 end style B0 fill:#3498db,stroke:#333,color:#fff style B1 fill:#2980b9,stroke:#333,color:#fff style B2 fill:#2471a3,stroke:#333,color:#fff style B3 fill:#1f618d,stroke:#333,color:#fff style B30 fill:#1a5276,stroke:#333,color:#fff style B31 fill:#154360,stroke:#333,color:#fff ``` **Bank formula**: `bank(address) = (address / 4) % 32` ### The Three Cases ```cuda __shared__ float smem[1024]; // CASE 1: No conflict ; stride-1 access (perfect) float val = smem[threadIdx.x]; // Thread 0 -> Bank 0, Thread 1 -> Bank 1, ..., Thread 31 -> Bank 31 // One cycle. 32 simultaneous reads. // CASE 2: 2-way conflict ; stride-2 access float val = smem[threadIdx.x * 2]; // Thread 0 -> Bank 0, Thread 16 -> Bank 0 (CONFLICT!) // Thread 1 -> Bank 2, Thread 17 -> Bank 2 (CONFLICT!) // Two cycles instead of one. 50% throughput. // CASE 3: 32-way conflict ; stride-32 access (catastrophic) float val = smem[threadIdx.x * 32]; // ALL threads -> Bank 0 // 32 cycles instead of 1. 3.125% throughput. // CASE 4: Broadcast ; all read SAME address (free!) float val = smem[0]; // All 32 threads read address 0 -> Bank 0 broadcasts. One cycle. ``` ### Bank Conflict Diagram ```mermaid graph LR subgraph "No Conflict - stride 1" direction TB TA0["T0"] -->|"addr 0"| BA0["Bank 0"] TA1["T1"] -->|"addr 4"| BA1["Bank 1"] TA2["T2"] -->|"addr 8"| BA2["Bank 2"] TA3["T3"] -->|"addr 12"| BA3["Bank 3"] end subgraph "2-way Conflict - stride 2" direction TB TB0["T0"] -->|"addr 0"| BB0["Bank 0"] TB1["T1"] -->|"addr 8"| BB2["Bank 2"] TB16["T16"] -->|"addr 128"| BB0 TB17["T17"] -->|"addr 136"| BB2 end style BA0 fill:#2ecc71,stroke:#333,color:#fff style BA1 fill:#2ecc71,stroke:#333,color:#fff style BA2 fill:#2ecc71,stroke:#333,color:#fff style BA3 fill:#2ecc71,stroke:#333,color:#fff style BB0 fill:#e74c3c,stroke:#333,color:#fff style BB2 fill:#e74c3c,stroke:#333,color:#fff ``` ### Fixing Bank Conflicts: Padding ```cuda // Problem: column-major access to a 32-wide array __shared__ float tile[32][32]; // smem[row][col] // Accessing column: tile[threadIdx.x][col] -> stride-32 -> 32-way conflict! // Fix: add padding __shared__ float tile[32][32 + 1]; // 33 columns // Now stride is 33, and 33 % 32 = 1 -> perfect stride-1 access! // One wasted float per row, but 32x throughput improvement. ``` --- ## 2.4 Global Memory & Coalescing Global memory (HBM) is accessed through **128-byte cache lines**. The hardware **coalesces** requests from all 32 threads in a warp into the minimum number of transactions. ```mermaid graph TD subgraph "Coalesced Access - 1 transaction" W0["Warp: 32 threads read consecutive floats
Addresses: 0, 4, 8, ..., 124"] CL0["Cache Line: 128 bytes at address 0"] W0 -->|"1 transaction"| CL0 end subgraph "Strided Access - 32 transactions" W1["Warp: 32 threads read stride-32
Addresses: 0, 128, 256, ..., 3968"] CL1["Cache Line at addr 0"] CL2["Cache Line at addr 128"] CL3["Cache Line at addr 256"] CL32["...32 cache lines total"] W1 --> CL1 W1 --> CL2 W1 --> CL3 W1 --> CL32 end style CL0 fill:#2ecc71,stroke:#333,color:#fff style CL1 fill:#e74c3c,stroke:#333,color:#fff style CL2 fill:#e74c3c,stroke:#333,color:#fff style CL3 fill:#e74c3c,stroke:#333,color:#fff ``` ### The Coalescing Hardware Inside each SM's Load/Store Unit (LSU): 1. Collect all 32 addresses from the warp 2. Sort them by 128-byte aligned segment 3. Issue one memory transaction per unique segment 4. Route returned bytes to correct thread registers ```cuda // Perfectly coalesced ; 1 transaction per warp __global__ void coalesced(float* data) { int idx = blockIdx.x * blockDim.x + threadIdx.x; float val = data[idx]; // Threads 0-31 read addresses 0-124 } // Strided ; 32 transactions per warp (32x slower) __global__ void strided(float* data, int stride) { int idx = threadIdx.x * stride; float val = data[idx]; // Each thread hits a different cache line } // Random ; up to 32 transactions per warp __global__ void random_access(float* data, int* indices) { float val = data[indices[threadIdx.x]]; // Unpredictable addresses } ``` ### Coalescing Math For a warp accessing addresses $a_0, a_1, \ldots, a_{31}$: $$\text{Transactions} = \left|\left\lbrace \left\lfloor \frac{a_i}{128} \right\rfloor \mid i \in [0, 31] \right\rbrace\right|$$ i.e., the number of **distinct 128-byte aligned segments** touched. **Best case**: 1 transaction (128 bytes), all data used → **100% efficiency** **Worst case**: 32 transactions (4096 bytes), only 128 bytes used → **3.125% efficiency** --- ## 2.5 Occupancy: The Art of Hiding Latency **Occupancy** = Active warps / Maximum warps per SM ```mermaid graph TD subgraph "Occupancy Limiters" REG["Register Usage
65,536 regs / regs per thread x 32
= max warps from registers"] SMEM_L["Shared Memory Usage
Max shared mem / smem per block
= max blocks then max warps"] BLK["Block Size
Max 32 blocks/SM
Small blocks waste slots"] end REG --> MIN["Take MINIMUM
= Active Warps"] SMEM_L --> MIN BLK --> MIN MIN --> OCC["Occupancy = Active / 64"] style REG fill:#e74c3c,stroke:#333,color:#fff style SMEM_L fill:#3498db,stroke:#333,color:#fff style BLK fill:#2ecc71,stroke:#333,color:#fff style MIN fill:#f1c40f,stroke:#333,color:#000 style OCC fill:#9b59b6,stroke:#333,color:#fff ``` ### Worked Example ``` Kernel: 256 threads/block, 48 regs/thread, 16 KB shared memory GPU: SM90 (Hopper) Register limit: 65,536 regs / (48 regs x 256 threads) = 65,536 / 12,288 = 5.33 -> 5 blocks 5 blocks x 256 threads = 1,280 threads = 40 warps Shared memory limit: 228 KB / 16 KB = 14.25 -> 14 blocks 14 blocks x 256 threads = 3,584 -> but max is 2,048 -> 64 warps Block count limit: 32 blocks max -> 32 x 256 = 8,192 -> capped at 2,048 -> 64 warps Final: min(40, 64, 64) = 40 warps -> Occupancy = 40/64 = 62.5% ``` > **Feynman's warning**: *Don't worship occupancy.* Sometimes 50% occupancy with 64 registers per thread beats 100% occupancy with 32 registers, because you avoid expensive memory spills. The occupancy calculator is a guide, not a god. --- ## 2.6 The Warp Scheduler: Juggling Latency Each SM has **4 warp schedulers**, each managing a pool of warps. Every cycle, each scheduler: ```mermaid graph TD CHECK["Check: Is any warp ready?
Not stalled on memory, barrier, or dependency"] -->|"Yes"| SELECT CHECK -->|"No"| STALL["Scheduler stall
wasted cycle"] SELECT["Select ready warp
Greedy-Then-Oldest policy"] --> FETCH["Fetch instruction
from warp's PC"] FETCH --> DECODE["Decode + Dispatch
to execution unit"] DECODE --> EXEC["Execute on 32 CUDA cores
or other unit"] EXEC --> SCOREBOARD["Update scoreboard
Mark dest regs as pending"] SCOREBOARD --> CHECK style CHECK fill:#3498db,stroke:#333,color:#fff style SELECT fill:#2ecc71,stroke:#333,color:#fff style FETCH fill:#e67e22,stroke:#333,color:#fff style EXEC fill:#e74c3c,stroke:#333,color:#fff style SCOREBOARD fill:#9b59b6,stroke:#333,color:#fff style STALL fill:#c0392b,stroke:#333,color:#fff ``` ### The Scoreboard Each warp has a **scoreboard** ; a bit vector tracking which registers have pending writes: ``` Warp 7 Scoreboard: R0: ready R8: ready R16: PENDING (waiting for LD) R1: ready R9: ready R17: ready R2: PENDING R10: ready R18: ready ... If next instruction is: ADD R5, R2, R3 -> R2 is PENDING -> warp 7 is STALLED -> Scheduler picks another warp ``` This is why GPUs **need many warps**: every memory access stalls a warp for hundreds of cycles. The scheduler swaps to another warp **in zero cycles** (zero-cost context switching, because all warp state is always resident in registers). ### Latency Hiding Math For a **memory-bound** kernel: $$\text{Warps needed} \geq \frac{\text{Memory latency (cycles)}}{\text{Cycles between memory ops}} $$ Example: Memory latency = 400 cycles, one memory op every 8 arithmetic instructions (4 cycles each = 32 cycles): $$\text{Warps needed} \geq \frac{400}{32} = 12.5 \rightarrow 13 \text{ warps per scheduler}$$ With 4 schedulers: **52 warps minimum** → ~81% occupancy needed. --- # Hour 3 ; The Generational Leap: Volta → Ampere → Hopper ## 3.1 Tensor Cores: Why Multiply-Accumulate Is King Deep learning is dominated by one operation: **matrix multiply-accumulate (MMA)**. $$D = A \times B + C$$ Where $A$ is $M \times K$, $B$ is $K \times N$, and $C, D$ are $M \times N$. The arithmetic intensity of matrix multiplication: $$\text{FLOPs} = 2 \times M \times N \times K$$ $$\text{Bytes loaded} = (M \times K + K \times N + M \times N) \times \text{bytes per element}$$ $$\text{Arithmetic intensity} = \frac{2MNK}{(MK + KN + MN) \times b}$$ For large square matrices ($M = N = K$): intensity $\approx \frac{2K}{3b}$, which **grows with matrix size**. This is why GPUs love large GEMMs ; they become compute-bound, not memory-bound. A **Tensor Core** is a specialized circuit that computes a small MMA (e.g., 4x4x4) in a **single clock cycle**, rather than requiring 128 individual FMA (fused multiply-add) operations. ```mermaid graph LR subgraph "Without Tensor Core" direction TB C1["CUDA Core 0: a00 x b00"] C2["CUDA Core 1: a01 x b10"] C3["CUDA Core 2: a02 x b20"] C4["CUDA Core 3: a03 x b30"] ACC["Accumulate: d00 = c00 + Sum"] C1 --> ACC C2 --> ACC C3 --> ACC C4 --> ACC end subgraph "With Tensor Core" direction TB TC["Tensor Core:
Computes ENTIRE
4x4x4 MMA
in ONE cycle
= 128 FMA ops"] end style TC fill:#e74c3c,stroke:#333,color:#fff style C1 fill:#3498db,stroke:#333,color:#fff style C2 fill:#3498db,stroke:#333,color:#fff style C3 fill:#3498db,stroke:#333,color:#fff style C4 fill:#3498db,stroke:#333,color:#fff ``` --- ## 3.2 Volta (SM70): Where Tensor Cores Were Born **Volta** (2017) introduced the **1st generation Tensor Core** and **independent thread scheduling**. ### Tensor Core V1: 4x4x4 FP16 to FP32 Each Tensor Core computes: $D_{4 \times 4} = A_{4 \times 4} \cdot B_{4 \times 4} + C_{4 \times 4}$ - **Input**: FP16 (A, B) - **Accumulate**: FP32 (C, D) - **Per Tensor Core per clock**: 64 FMA operations - **8 Tensor Cores per SM**: 512 FMA ops/SM/clock ### CUDA: WMMA API ```cuda #include using namespace nvcuda; __global__ void volta_wmma_gemm(half* A, half* B, float* C, float* D, int M, int N, int K) { // Declare fragments for a 16x16x16 tile wmma::fragment a_frag; wmma::fragment b_frag; wmma::fragment c_frag; // Initialize accumulator to zero wmma::fill_fragment(c_frag, 0.0f); // Compute warp's tile position int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / 32 / (N/16); int warpN = (blockIdx.x * blockDim.x + threadIdx.x) / 32 % (N/16); int row = warpM * 16; int col = warpN * 16; // Loop over K dimension in tiles of 16 for (int k = 0; k < K; k += 16) { // Load A and B tiles from global memory wmma::load_matrix_sync(a_frag, A + row * K + k, K); wmma::load_matrix_sync(b_frag, B + k * N + col, N); // Tensor Core MMA: C += A x B wmma::mma_sync(c_frag, a_frag, b_frag, c_frag); } // Store result wmma::store_matrix_sync(D + row * N + col, c_frag, N, wmma::mem_row_major); } ``` ### PTX: wmma.mma.sync ``` // PTX for Volta WMMA (16x16x16, FP16 to FP32) // Warp-synchronous ; all 32 threads participate // Declare register fragments .reg .f32 %acc<8>; // 8 accumulator registers per thread (16x16 / 32 threads) .reg .b32 %a<4>; // A fragment (packed FP16) .reg .b32 %b<4>; // B fragment (packed FP16) // Load A fragment from shared memory wmma.load.a.sync.aligned.m16n16k16.shared.row.f16 {%a0, %a1, %a2, %a3}, [smem_a_addr], stride_a; // Load B fragment from shared memory wmma.load.b.sync.aligned.m16n16k16.shared.col.f16 {%b0, %b1, %b2, %b3}, [smem_b_addr], stride_b; // Tensor Core MMA wmma.mma.sync.aligned.m16n16k16.row.col.f32.f16.f16.f32 {%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7}, {%a0, %a1, %a2, %a3}, {%b0, %b1, %b2, %b3}, {%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7}; ``` > **Key insight**: The `wmma.mma.sync` is **warp-synchronous** ; all 32 threads must execute it together. Each thread contributes a piece of the A, B, and C matrices. The Tensor Core hardware does the reduction internally. --- ## 3.3 Ampere (SM80): Async Copy & mbarrier **Ampere** (2020) brought three game-changers: 1. **3rd-gen Tensor Cores** (TF32, BF16, FP64, structured sparsity) 2. **cp.async** ; hardware asynchronous copy from global to shared memory 3. **mbarrier** ; hardware-accelerated synchronization primitive ### cp.async: Bypass the Register File Before Ampere, loading from global to shared memory required: ``` Global Memory -> Register -> Shared Memory (2 steps, registers busy) ``` Ampere's `cp.async` does it in one hardware step: ``` Global Memory -> Shared Memory (1 step, registers free!) ``` ```mermaid graph LR subgraph "Pre-Ampere via registers" GM1["Global Memory"] -->|"LDG to register"| REG1["Register File"] REG1 -->|"STS to shared"| SM1["Shared Memory"] end subgraph "Ampere cp.async bypass registers" GM2["Global Memory"] -->|"cp.async
hardware DMA"| SM2["Shared Memory"] end style GM1 fill:#e74c3c,stroke:#333,color:#fff style REG1 fill:#f1c40f,stroke:#333,color:#000 style SM1 fill:#3498db,stroke:#333,color:#fff style GM2 fill:#e74c3c,stroke:#333,color:#fff style SM2 fill:#3498db,stroke:#333,color:#fff ``` ### CUDA cp.async ```cuda #include __global__ void async_copy_kernel(float4* global_data, int N) { __shared__ float4 smem_buffer[2][64]; // Double buffer int tid = threadIdx.x; int stage = 0; // Stage 0: Issue first async copy __pipeline_memcpy_async(&smem_buffer[0][tid], &global_data[tid], sizeof(float4)); __pipeline_commit(); for (int i = 1; i < N; i++) { int next_stage = stage ^ 1; // Toggle 0 <-> 1 // Issue async copy for NEXT tile __pipeline_memcpy_async(&smem_buffer[next_stage][tid], &global_data[i * 64 + tid], sizeof(float4)); __pipeline_commit(); // Wait for CURRENT tile to be ready __pipeline_wait_prior(1); // Wait until only 1 group pending __syncthreads(); // Compute on current tile compute(smem_buffer[stage]); stage = next_stage; } // Process last tile __pipeline_wait_prior(0); __syncthreads(); compute(smem_buffer[stage]); } ``` ### PTX: cp.async ``` // Copy 16 bytes from global to shared memory, asynchronously cp.async.ca.shared.global [smem_addr], [global_addr], 16; // Commit this group of copies cp.async.commit_group; // Wait until at most N groups are still pending cp.async.wait_group N; // Wait for ALL pending copies cp.async.wait_all; ``` ### Ampere's Structured Sparsity (2:4) Ampere introduced hardware-accelerated **2:4 structured sparsity**: in every group of 4 elements, at least 2 must be zero. ``` Dense: [0.5, 0.0, 0.3, 0.0, 0.7, 0.0, 0.1, 0.0] Sparse: [0.5, 0.3, 0.7, 0.1] + metadata: [0,2,0,2] ``` The Tensor Core skips zero multiplications → **2x throughput**. --- ## 3.4 Deep Dive: mbarrier ; Phase-Based Synchronization > *"Synchronization is the tax you pay for parallelism. mbarrier reduces that tax."* ### Why Not Just `__syncthreads()`? `__syncthreads()` is a **sledgehammer**: it forces ALL threads in a block to reach the same point. It knows nothing about asynchronous operations. It's block-scoped only. **mbarrier** is a **scalpel**: it can track async operations (like `cp.async`), works across CTAs in a cluster, and uses phases for pipelining. | Feature | `__syncthreads()` | `mbarrier` | |---------|-------------------|-----------| | Scope | Block-level only | Flexible (warp, block, cross-CTA in cluster) | | Async awareness | No | Yes ; can track async operations | | Hardware support | Barrier instruction | Dedicated hardware unit in SM | | Phase tracking | No | Yes ; alternating phases | | Completion tracking | Thread arrival only | Thread arrival + async byte count | ### The mbarrier State Machine ```mermaid stateDiagram-v2 [*] --> Init: mbarrier.init count=N Init --> Phase0_Waiting: Phase 0 begins Phase0_Waiting --> Phase0_Waiting: arrive count-- Phase0_Waiting --> Phase0_Complete: count reaches 0 Phase0_Complete --> Phase1_Waiting: Phase flips to 1 Phase1_Waiting --> Phase1_Waiting: arrive count-- Phase1_Waiting --> Phase1_Complete: count reaches 0 Phase1_Complete --> Phase0_Waiting: Phase flips back to 0 ``` > **Phase 0 note**: Threads call `mbarrier.arrive()`, async ops arrive automatically. Waiters spin on `try_wait(phase=0)`. > > **Phase 1 note**: Same mechanism, opposite phase. Allows pipelining stages. ### The mbarrier Object in Memory An mbarrier lives in **shared memory** as an 8-byte (64-bit) opaque object: ``` Bits 63-32: Pending async byte count (for cp.async tracking) Bits 31-16: Arrival count remaining Bit 0: Phase bit (alternates 0 and 1) ``` The hardware manages these bits atomically. ### Complete PTX mbarrier Example: Multi-Stage Pipeline ``` // ============================================ // Multi-stage async pipeline using mbarrier // ============================================ .shared .align 8 .b64 mbar[4]; // 4 mbarrier objects (4 stages) .shared .align 128 .b8 smem_buf[4][TILE_SZ]; // 4 stage buffers // --- Initialization (thread 0 only) --- .reg .pred %is_t0; setp.eq.u32 %is_t0, %tid, 0; @%is_t0 mbarrier.init.shared.b64 [mbar + 0], %block_size; @%is_t0 mbarrier.init.shared.b64 [mbar + 8], %block_size; @%is_t0 mbarrier.init.shared.b64 [mbar + 16], %block_size; @%is_t0 mbarrier.init.shared.b64 [mbar + 24], %block_size; bar.sync 0; // Ensure init is visible to all threads // --- Prologue: Fill the pipeline --- // Stage 0: Issue async copy and arrive cp.async.ca.shared.global [smem_buf + 0 + %tid_offset], [global_addr_0 + %tid_offset], 16; cp.async.commit_group; mbarrier.arrive.expect_tx.shared.b64 %state, [mbar + 0], 16; // Tell the barrier: "expect 16 more bytes from async ops" // Stage 1: Issue async copy cp.async.ca.shared.global [smem_buf + TILE_SZ + %tid_offset], [global_addr_1 + %tid_offset], 16; cp.async.commit_group; mbarrier.arrive.expect_tx.shared.b64 %state, [mbar + 8], 16; // --- Main Loop --- .reg .u32 %stage; mov.u32 %stage, 0; MAIN_LOOP: // Wait for current stage data to be ready .reg .u64 %mbar_addr; // compute %mbar_addr = mbar + (%stage % 4) * 8 .reg .pred %wait_done; TRY_WAIT: mbarrier.try_wait.parity.shared.b64 %wait_done, [%mbar_addr], %phase; @!%wait_done bra TRY_WAIT; // Data is ready! Proceed. // Issue async copy for stage+2 (prefetch ahead) // ... (similar cp.async + mbarrier.arrive.expect_tx) ... // Compute on current stage data ld.shared.f32 %f0, [%current_buf + %tid_offset]; // ... computation ... st.global.f32 [%output + %tid_offset], %f_result; // Advance add.u32 %stage, %stage, 1; setp.lt.u32 %p_loop, %stage, %num_tiles; @%p_loop bra MAIN_LOOP; ``` ### mbarrier with TMA (Hopper+) On Hopper, the **TMA hardware** automatically arrives at an mbarrier when its copy completes ; no thread intervention needed: ```mermaid sequenceDiagram participant Thread as Warp Thread participant TMA as TMA Engine participant MBAR as mbarrier participant SMEM as Shared Memory participant GMEM as Global Memory Thread->>MBAR: mbarrier.arrive.expect_tx(bytes) Thread->>TMA: cp.async.bulk.tensor [smem], [tensorMap], [mbar] Note right of Thread: Thread continues other work! TMA->>GMEM: Read tensor tile GMEM-->>TMA: Data TMA->>SMEM: Write to shared memory TMA->>MBAR: Hardware auto-arrive (tx_count -= bytes) Note right of MBAR: When all arrivals + TX complete: phase flips Thread->>MBAR: mbarrier.try_wait(phase) = complete! ``` --- ## 3.5 Hopper (SM90): TMA, Clusters & WGMMA **Hopper** (2022) was a quantum leap. Three major innovations: ### 1. Thread Block Clusters A **cluster** is a group of up to 8 CTAs (thread blocks) that are **guaranteed to be co-scheduled** on the same GPC. ```mermaid graph TD subgraph GPC0["GPC 0"] subgraph Cluster["Thread Block Cluster"] CTA0["CTA 0
SM 0
Shared Mem 0"] CTA1["CTA 1
SM 1
Shared Mem 1"] CTA2["CTA 2
SM 2
Shared Mem 2"] CTA3["CTA 3
SM 3
Shared Mem 3"] CTA0 <-->|"DSMEM"| CTA1 CTA1 <-->|"DSMEM"| CTA2 CTA2 <-->|"DSMEM"| CTA3 CTA0 <-->|"DSMEM"| CTA3 end end style Cluster fill:#1a1a2e,stroke:#e94560,color:#fff style CTA0 fill:#16213e,stroke:#0f3460,color:#fff style CTA1 fill:#16213e,stroke:#0f3460,color:#fff style CTA2 fill:#16213e,stroke:#0f3460,color:#fff style CTA3 fill:#16213e,stroke:#0f3460,color:#fff ``` **Distributed Shared Memory (DSMEM)**: CTA 0 can directly read/write CTA 1's shared memory ; no global memory round-trip needed. ```cuda __cluster_dims__(4, 1, 1) // 4 CTAs in a cluster __global__ void cluster_kernel() { // Get cluster info auto cluster = cooperative_groups::this_cluster(); extern __shared__ int smem[]; // Access another CTA's shared memory directly! int target_cta = (cluster.block_rank() + 1) % cluster.num_blocks(); int* remote_smem = cluster.map_shared_rank(smem, target_cta); // Direct cross-CTA shared memory access (via DSMEM) int val = *remote_smem; // No global memory needed! } ``` ### 2. Tensor Memory Accelerator (TMA) The TMA is a **dedicated hardware engine** in each SM that moves multi-dimensional tensor tiles between global memory and shared memory. ```mermaid graph LR subgraph "Without TMA - Manual Tiling" direction TB T0["Thread 0: load gmem 0 to reg to smem 0"] T1["Thread 1: load gmem 1 to reg to smem 1"] T2["Thread 2: load gmem 2 to reg to smem 2"] DOTS["...128 threads doing address math"] end subgraph "With TMA - 1 thread 1 instruction" direction TB TMA["TMA Engine:
cp.async.bulk.tensor.2d
[smem], [tensorMap, coords], [mbar]

Handles:
- Address calculation
- Bounds checking
- Swizzling
- Format conversion
- mbarrier notification"] end style TMA fill:#e74c3c,stroke:#333,color:#fff style T0 fill:#95a5a6,stroke:#333,color:#fff style T1 fill:#95a5a6,stroke:#333,color:#fff style T2 fill:#95a5a6,stroke:#333,color:#fff ``` ### TMA Setup (Host Side) ```cuda // Create a tensor map descriptor on the host CUtensorMap tensorMap; CUresult result = cuTensorMapEncodeTiled( &tensorMap, CU_TENSOR_MAP_DATA_TYPE_FLOAT16, // Element type 2, // Number of dimensions globalPtr, // Base pointer in global memory globalDim, // {rows, cols} of global tensor globalStrides, // {stride_row_bytes, stride_col_bytes} boxDim, // {tile_rows, tile_cols} to transfer elementStrides, // Element strides CU_TENSOR_MAP_INTERLEAVE_NONE, // Interleaving CU_TENSOR_MAP_SWIZZLE_128B, // Swizzle pattern CU_TENSOR_MAP_L2_PROMOTION_L2_128B, // L2 policy CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE // Out-of-bounds handling ); ``` ### TMA PTX (Device Side) ``` // Load a 2D tile from global to shared memory // Only ONE thread needs to execute this ; TMA does all the work .reg .pred %is_leader; setp.eq.u32 %is_leader, %tid, 0; @%is_leader cp.async.bulk.tensor.2d.shared::cluster.global.tile .mbarrier::complete_tx::bytes [smem_addr], // Destination in shared memory [tensorMapPtr, {%coord_x, %coord_y}], // Source: tensor map + coordinates [mbar_addr]; // mbarrier to signal on completion // All threads wait for TMA to complete WAIT: mbarrier.try_wait.parity.shared.b64 %wait_result, [mbar_addr], %phase; @!%wait_result bra WAIT; ``` ### TMA Multicast (Load once, deliver to multiple CTAs) ``` // Load tile and multicast to CTAs 0, 1, 2, 3 in a cluster @%is_leader cp.async.bulk.tensor.2d.shared::cluster.global.tile .mbarrier::complete_tx::bytes .multicast::cluster [smem_addr], [tensorMapPtr, {%coord_x, %coord_y}], [mbar_addr], %ctaMask; // Bitmask: 0b1111 = CTAs 0,1,2,3 ``` > **Why TMA is revolutionary**: Without TMA, 128 threads waste time computing addresses, doing bounds checks, and issuing individual loads. With TMA, **one instruction** replaces all of that. The freed-up threads do useful compute instead. --- ## 3.6 WGMMA: The Warpgroup Matrix Machine **WGMMA** (Warpgroup Matrix Multiply-Accumulate) is Hopper's Tensor Core interface. It operates at the **warpgroup** level: 4 warps (128 threads) acting as a unit. ### Why Warpgroups? ```mermaid graph TD subgraph "Volta/Ampere: Warp-level MMA" W0["1 Warp = 32 threads
wmma.mma.sync 16x16x16
= 8,192 FLOPs"] end subgraph "Hopper: Warpgroup-level MMA" WG["4 Warps = 128 threads
wgmma.mma_async 64x256x16
= 524,288 FLOPs

64x more work per instruction!"] end style W0 fill:#3498db,stroke:#333,color:#fff style WG fill:#e74c3c,stroke:#333,color:#fff ``` ### WGMMA Key Innovation **Operand B can come directly from shared memory** ; no need to load it into registers first: ```mermaid graph LR subgraph "Volta wmma" SMEM1["Shared Mem"] -->|"load to reg"| REG1["Registers A"] SMEM1 -->|"load to reg"| REG2["Registers B"] REG1 --> TC1["Tensor Core"] REG2 --> TC1 end subgraph "Hopper wgmma" SMEM2["Shared Mem B"] -->|"direct read"| TC2["Tensor Core"] REG3["Registers A"] --> TC2 end style TC1 fill:#3498db,stroke:#333,color:#fff style TC2 fill:#e74c3c,stroke:#333,color:#fff ``` > **Key**: B reads shared memory directly! This saves register bandwidth. ### WGMMA PTX ``` // ========================================== // Hopper WGMMA: 64x256x16 FP16 to FP32 // ========================================== // Step 1: Fence ; ensure memory ordering wgmma.fence.sync.aligned; // Step 2: Issue the MMA // desc_a = 64-bit descriptor for A matrix in shared memory // desc_b = 64-bit descriptor for B matrix in shared memory wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 {%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7, %acc8, %acc9, %acc10, %acc11, %acc12, %acc13, %acc14, %acc15, %acc16, %acc17, %acc18, %acc19, %acc20, %acc21, %acc22, %acc23, %acc24, %acc25, %acc26, %acc27, %acc28, %acc29, %acc30, %acc31, %acc32, %acc33, %acc34, %acc35, %acc36, %acc37, %acc38, %acc39, %acc40, %acc41, %acc42, %acc43, %acc44, %acc45, %acc46, %acc47, %acc48, %acc49, %acc50, %acc51, %acc52, %acc53, %acc54, %acc55, %acc56, %acc57, %acc58, %acc59, %acc60, %acc61, %acc62, %acc63, %acc64, %acc65, %acc66, %acc67, %acc68, %acc69, %acc70, %acc71, %acc72, %acc73, %acc74, %acc75, %acc76, %acc77, %acc78, %acc79, %acc80, %acc81, %acc82, %acc83, %acc84, %acc85, %acc86, %acc87, %acc88, %acc89, %acc90, %acc91, %acc92, %acc93, %acc94, %acc95, %acc96, %acc97, %acc98, %acc99, %acc100,%acc101,%acc102,%acc103, %acc104,%acc105,%acc106,%acc107, %acc108,%acc109,%acc110,%acc111, %acc112,%acc113,%acc114,%acc115, %acc116,%acc117,%acc118,%acc119, %acc120,%acc121,%acc122,%acc123, %acc124,%acc125,%acc126,%acc127}, %desc_a, // 64-bit shared memory descriptor for A %desc_b, // 64-bit shared memory descriptor for B 1, // Scale D (multiply accumulator by 1) 1, 1, // Scale A, Scale B (negate flags) 0, 0; // Transpose A, Transpose B // Step 3: Commit ; mark this MMA group for tracking wgmma.commit_group.sync.aligned; // Step 4: Wait ; block until MMA completes wgmma.wait_group.sync.aligned 0; // 0 = wait for all groups // Now %acc0..%acc127 contain the 64x256 result matrix ``` ### WGMMA Descriptor Layout The 64-bit descriptor encodes the shared memory tile location: ``` Bits 63-49: Reserved Bits 48-46: Leading dimension mode Bits 45-32: Base address offset (in shared memory, 16B aligned) Bits 31-16: Leading dimension byte offset Bits 15-4: Stride dimension byte offset Bits 3-0: Swizzle mode ``` ### Complete Hopper GEMM Pattern ```mermaid sequenceDiagram participant TMA as TMA Engine participant SMEM as Shared Memory participant WG as Warpgroup of 4 warps participant TC as Tensor Cores Note over TMA,TC: === Stage k (double-buffered) === WG->>TMA: Issue TMA load for tile k+1 TMA->>SMEM: Async bulk copy (tile k+1) WG->>WG: wgmma.fence WG->>TC: wgmma.mma_async (on tile k data) WG->>WG: wgmma.commit_group TMA-->>SMEM: TMA arrives at mbarrier WG->>WG: Wait on mbarrier (tile k+1 ready) WG->>WG: wgmma.wait_group (tile k MMA done) Note over TMA,TC: === Stage k+1 === WG->>TMA: Issue TMA load for tile k+2 WG->>TC: wgmma.mma_async (on tile k+1 data) ``` --- # Hour 4 ; Blackwell: The Fifth Generation ## 4.1 Blackwell Architecture Overview **Blackwell** (2024) ; compute capability **sm_100** (data center) and **sm_120** (desktop). ```mermaid graph TD subgraph GB100["GB100 Full Die"] direction TB SMs["192 SMs
24,576 CUDA Cores
768 Tensor Cores - 5th gen"] L2["96 MB L2 Cache"] HBM["192 GB HBM3e
8 TB/s bandwidth"] NVLINK["18x NVLink 5
1.8 TB/s bidirectional"] subgraph PerSM["Per SM"] CUDA_B["128 FP32 CUDA Cores"] TC_B["4x 5th Gen Tensor Cores"] TMEM_B["TENSOR MEMORY - TMEM
NEW! Dedicated TC storage"] TMA_B["TMA Enhanced"] RF_B["256 KB Register File"] SMEM_B["228 KB Shared Mem / L1"] MBAR_B["16 mbarrier slots"] end end style SMs fill:#e74c3c,stroke:#333,color:#fff style L2 fill:#e67e22,stroke:#333,color:#fff style HBM fill:#f1c40f,stroke:#333,color:#000 style NVLINK fill:#2ecc71,stroke:#333,color:#fff style TMEM_B fill:#e74c3c,stroke:#fff,color:#fff style TC_B fill:#c0392b,stroke:#333,color:#fff ``` ### What's New in Blackwell | Feature | Hopper | Blackwell | Improvement | |---------|--------|-----------|-------------| | Transistors | 80B | 208B | 2.6x | | SMs | 132 | 192 | 1.45x | | Tensor Core gen | 4th | 5th | New ISA | | Peak FP4 (sparse) | N/A | ~9 PFLOPS | New! | | Peak FP8 | 990 TFLOPS | ~1800 TFLOPS | ~1.8x | | HBM Bandwidth | 3.35 TB/s | 8 TB/s | 2.4x | | L2 Cache | 60 MB | 96 MB | 1.6x | | NVLink BW | 900 GB/s | 1.8 TB/s | 2x | | Key TC instruction | wgmma | **tcgen05** | New paradigm | | New memory space | ; | **TMEM** | New! | | New data types | FP8 | **FP4, MX formats** | New! | --- ## 4.2 Tensor Memory (TMEM): A New Address Space This is the single biggest architectural change in Blackwell. **TMEM** is a new, dedicated memory space that lives inside the SM, separate from both registers and shared memory. ```mermaid graph TD subgraph "Hopper Memory Spaces" REG_H["Register File
256 KB
Accumulator lives HERE"] SMEM_H["Shared Memory
228 KB
Operands A, B"] GMEM_H["Global Memory
HBM3"] end subgraph "Blackwell Memory Spaces" REG_BW["Register File
256 KB
General computation"] TMEM_BW["TMEM - NEW
Accumulator lives HERE
Dedicated TC bandwidth"] SMEM_BW["Shared Memory
228 KB
Operands A, B"] GMEM_BW["Global Memory
HBM3e"] end style REG_H fill:#e74c3c,stroke:#333,color:#fff style SMEM_H fill:#3498db,stroke:#333,color:#fff style TMEM_BW fill:#e74c3c,stroke:#fff,color:#fff style REG_BW fill:#f1c40f,stroke:#333,color:#000 style SMEM_BW fill:#3498db,stroke:#333,color:#fff ``` ### Why TMEM Exists On Hopper, WGMMA results accumulate in **registers**. But the register file has limited bandwidth ; reading/writing 128 accumulator registers per warpgroup per MMA creates a bottleneck. The register file is also used for general computation. TMEM solves this by giving the Tensor Cores their own **private, high-bandwidth storage**: - **Accumulators live in TMEM**, not registers - The Tensor Core reads/writes TMEM directly with dedicated pathways - The register file is freed for other work (address computation, control flow) - TMEM has **higher bandwidth** to the Tensor Cores than the register file ### TMEM Access Patterns ``` // TMEM is accessed ONLY via tcgen05 instructions: // Load from shared memory INTO TMEM tcgen05.ld.16x256b [tmem_addr], [smem_addr]; // Store from TMEM to shared memory tcgen05.st.16x256b [smem_addr], [tmem_addr]; // MMA writes results directly to TMEM tcgen05.mma ... [tmem_addr], ...; // Accumulator in TMEM // You CANNOT: // - Load TMEM with regular LD instructions // - Access TMEM from CUDA core ALUs directly // - Use TMEM for general-purpose storage ``` --- ## 4.3 tcgen05: The Fifth-Gen Tensor Core ISA **tcgen05** = **T**ensor **C**ore **Gen**eration **05** ; the instruction set for Blackwell's 5th-gen Tensor Cores. ### Instruction Overview ```mermaid graph TD subgraph "tcgen05 Instruction Family" MMA["tcgen05.mma
Matrix Multiply-Accumulate
The main compute instruction"] LD["tcgen05.ld
Load shared mem to TMEM"] ST["tcgen05.st
Store TMEM to shared mem"] FENCE["tcgen05.fence
Memory ordering"] COMMIT["tcgen05.commit
Signal completion via mbarrier"] end LD -->|"Load operands"| MMA MMA -->|"Results in TMEM"| ST MMA -.->|"Order"| FENCE MMA -.->|"Track completion"| COMMIT style MMA fill:#e74c3c,stroke:#333,color:#fff style LD fill:#3498db,stroke:#333,color:#fff style ST fill:#2ecc71,stroke:#333,color:#fff style FENCE fill:#e67e22,stroke:#333,color:#fff style COMMIT fill:#9b59b6,stroke:#333,color:#fff ``` ### tcgen05.mma ; The Core Instruction ``` tcgen05.mma.cta_group::{1,2}.kind [d_tmem], a_desc, b_desc, idesc, enable; ``` Let's break down every operand: | Operand | Description | |---------|-------------| | `cta_group::{1,2}` | How many CTAs collaborate. `1` = single CTA. `2` = two CTAs in a cluster work together on one MMA. | | `kind` | Data type combination: `.f16`, `.tf32`, `.f8f6f4`, `.mxf8f6f4`, etc. | | `[d_tmem]` | Destination address in **TMEM** ; where the accumulator result is stored | | `a_desc` | 64-bit descriptor for matrix A (shared memory location, layout, size) | | `b_desc` | 64-bit descriptor for matrix B (shared memory location, layout, size) | | `idesc` | Immediate descriptor ; encodes the MMA shape (M, N, K dimensions) | | `enable` | Predicate ; allows conditional execution | ### PTX Example: tcgen05.mma ``` // ============================================= // Blackwell tcgen05.mma: FP16 Matrix Multiply // ============================================= // Prerequisites: A and B tiles in shared memory, TMEM allocated // Step 1: Fence ; ensure previous TC ops are ordered tcgen05.fence::before_thread_sync; // Step 2: Issue the MMA // Compute D[TMEM] = A[smem] x B[smem] + D[TMEM] tcgen05.mma.cta_group::1.kind::f16 [%tmem_d], // Accumulator destination in TMEM %desc_a, // 64-bit descriptor for A tile (in shared memory) %desc_b, // 64-bit descriptor for B tile (in shared memory) %idesc, // Immediate descriptor (shape info) %enable; // Predicate enable // Step 3: Commit ; signal completion via mbarrier tcgen05.commit.cta_group::1.mbarrier::arrive [%mbar_addr]; // Step 4: Wait ; block until MMA completes // (Check the mbarrier that tcgen05.commit arrived at) TRY_WAIT_TC: mbarrier.try_wait.parity.shared.b64 %done, [%mbar_addr], %phase; @!%done bra TRY_WAIT_TC; ``` ### tcgen05 Load/Store ; Moving Data To/From TMEM ``` // Load: Shared Memory to TMEM // Transfers a 16x256-bit tile (16 rows, each 256 bits = 32 bytes) tcgen05.ld.16x256b [%tmem_addr], [%smem_addr]; // Store: TMEM to Shared Memory tcgen05.st.16x256b [%smem_addr], [%tmem_addr]; ``` ### tcgen05 vs WGMMA Comparison ```mermaid graph LR subgraph "Hopper WGMMA Flow" direction TB H1["1. Load A from smem to registers"] H2["2. wgmma.mma_async: A reg, B smem to acc reg"] H3["3. wgmma.wait: acc ready in registers"] H4["4. Store registers to smem/gmem"] H1 --> H2 --> H3 --> H4 end subgraph "Blackwell tcgen05 Flow" direction TB B1["1. TMA loads A,B to smem"] B2["2. tcgen05.mma: A smem, B smem to acc TMEM"] B3["3. tcgen05.commit + mbar wait"] B4["4. tcgen05.st: TMEM to smem"] B1 --> B2 --> B3 --> B4 end style H1 fill:#3498db,stroke:#333,color:#fff style H2 fill:#3498db,stroke:#333,color:#fff style H3 fill:#3498db,stroke:#333,color:#fff style H4 fill:#3498db,stroke:#333,color:#fff style B1 fill:#e74c3c,stroke:#333,color:#fff style B2 fill:#e74c3c,stroke:#333,color:#fff style B3 fill:#e74c3c,stroke:#333,color:#fff style B4 fill:#e74c3c,stroke:#333,color:#fff ``` **Key difference**: In Blackwell, the accumulator **never touches the register file** during the MMA. It lives entirely in TMEM. This frees registers for address computation, control flow, and other work. The register file is no longer the bottleneck for Tensor Core throughput. ### CTA-Group MMA: Two CTAs, One MMA ``` // Two CTAs in a cluster collaborate on a single, larger MMA tcgen05.mma.cta_group::2.kind::f16 [%tmem_d], %desc_a, %desc_b, %idesc, %enable; // Both CTAs contribute portions of A/B and receive portions of D // The hardware coordinates the data sharing via DSMEM // Result: larger effective tile size = better utilization ``` --- ## 4.4 FP4 and Microscaling: Extreme Precision Engineering ### FP4: E2M1 Format ``` +------+----------+----------+ | Sign | Exponent | Mantissa | | 1 bit| 2 bits | 1 bit | +------+----------+----------+ Representable values: 0, +/-0.5, +/-1.0, +/-1.5, +/-2.0, +/-3.0, +/-4.0, +/-6.0 Exponent bias = 1 Value = (-1)^s x 2^(e-1) x (1 + m/2) = (-1)^s x 2^(e-1) x {1.0, 1.5} ``` **Only 16 representable values!** How can this possibly work? ### Microscaling (MX) Formats The trick: **block-level scaling factors**. Instead of each element having its own exponent, a **block** of elements shares one scale: ```mermaid graph TD subgraph "Standard FP16" E0["elem 0: full 16-bit float"] E1["elem 1: full 16-bit float"] E2["elem 2: full 16-bit float"] E3["elem 3: full 16-bit float"] end subgraph "MXFP4 Microscaling" SCALE["Shared Scale Factor
8-bit E8M0 exponent
Per block of 32 elements"] M0["elem 0: 4-bit"] M1["elem 1: 4-bit"] M2["elem 2: 4-bit"] M31["elem 31: 4-bit"] SCALE --> M0 SCALE --> M1 SCALE --> M2 SCALE --> M31 end style SCALE fill:#e74c3c,stroke:#333,color:#fff style M0 fill:#3498db,stroke:#333,color:#fff style M1 fill:#3498db,stroke:#333,color:#fff style M2 fill:#3498db,stroke:#333,color:#fff style M31 fill:#3498db,stroke:#333,color:#fff ``` **Effective value**: $v_i = \text{scale} \times \text{fp4}_i = 2^{\text{shared\\_exp}} \times (-1)^{s_i} \times 2^{(e_i - 1)} \times (1 + m_i/2)$ This gives FP4 the **dynamic range** of ~FP16 with the **storage cost** of 4 bits + amortized scale overhead. ### Why This Matters for Transformers In transformer inference: - Attention scores and FFN weights have **relatively uniform magnitudes within blocks** - A shared scale factor captures the block-level magnitude - Individual FP4 values capture the relative differences - The 2nd-gen **Transformer Engine** automatically manages these scales **Result**: 2x the throughput of FP8, with minimal accuracy loss. ### MX Format Variants on Blackwell | Format | Element bits | Scale | Elements/block | Use case | |--------|-------------|-------|----------------|----------| | MXFP8 | 8 | E8M0 | 32 | High accuracy | | MXFP6 | 6 | E8M0 | 32 | Balanced | | MXFP4 | 4 | E8M0 | 32 | Max throughput | --- ## 4.5 Putting It All Together: A Blackwell GEMM Kernel Here's the complete data flow for a high-performance GEMM on Blackwell: ```mermaid graph TD subgraph "Global Memory HBM3e" GA["Matrix A
M x K"] GB["Matrix B
K x N"] end subgraph "TMA Engine" TMA_A["TMA Load A tile"] TMA_B["TMA Load B tile"] end subgraph "Shared Memory" SA["A tile buffer
double-buffered"] SB["B tile buffer
double-buffered"] end subgraph "5th Gen Tensor Core" MMA["tcgen05.mma"] end subgraph "TMEM" ACC["Accumulator D
Lives in TMEM,
never touches registers!"] end subgraph "Output" SC["Result tile in smem"] GD["Matrix D: M x N
Global Memory"] end GA -->|"TMA descriptor"| TMA_A GB -->|"TMA descriptor"| TMA_B TMA_A -->|"async bulk copy"| SA TMA_B -->|"async bulk copy"| SB SA -->|"a_desc"| MMA SB -->|"b_desc"| MMA MMA -->|"accumulate"| ACC ACC -->|"tcgen05.st"| SC SC -->|"TMA store"| GD style MMA fill:#e74c3c,stroke:#333,color:#fff style ACC fill:#c0392b,stroke:#333,color:#fff style TMA_A fill:#2ecc71,stroke:#333,color:#fff style TMA_B fill:#2ecc71,stroke:#333,color:#fff ``` ### CUDA/PTX Pseudocode: Blackwell GEMM ```cuda // ============================================= // Blackwell GEMM Kernel (Conceptual Structure) // Uses: TMA + tcgen05.mma + TMEM + mbarrier // ============================================= #define TILE_M 128 #define TILE_N 256 #define TILE_K 64 #define STAGES 4 __cluster_dims__(2, 1, 1) // 2 CTAs per cluster __global__ void blackwell_gemm( const __grid_constant__ CUtensorMap tensorMapA, const __grid_constant__ CUtensorMap tensorMapB, half* D, int M, int N, int K) { extern __shared__ char smem[]; // Partition shared memory: double-buffered A & B tiles + mbarriers constexpr int TILE_A_BYTES = TILE_M * TILE_K * sizeof(half); constexpr int TILE_B_BYTES = TILE_K * TILE_N * sizeof(half); half* smem_A[2] = { (half*)(smem), (half*)(smem + TILE_A_BYTES) }; half* smem_B[2] = { (half*)(smem + 2 * TILE_A_BYTES), (half*)(smem + 2 * TILE_A_BYTES + TILE_B_BYTES) }; // mbarrier array for pipeline stages __shared__ __align__(8) uint64_t mbar_load[STAGES]; __shared__ __align__(8) uint64_t mbar_compute; // ============================================= // Initialize mbarriers // ============================================= if (threadIdx.x == 0) { for (int i = 0; i < STAGES; i++) asm volatile("mbarrier.init.shared.b64 [%0], %1;" :: "l"(&mbar_load[i]), "r"(1)); asm volatile("mbarrier.init.shared.b64 [%0], %1;" :: "l"(&mbar_compute), "r"(blockDim.x)); } __syncthreads(); // Compute block tile coordinates int block_m = blockIdx.x * TILE_M; int block_n = blockIdx.y * TILE_N; // ============================================= // PROLOGUE: Fill pipeline with first tiles // ============================================= if (threadIdx.x == 0) { // TMA load first A and B tiles asm volatile( "cp.async.bulk.tensor.2d.shared::cluster.global.tile" ".mbarrier::complete_tx::bytes" " [%0], [%1, {%2, %3}], [%4];" :: "l"(smem_A[0]), "l"(&tensorMapA), "r"(0), "r"(block_m), "l"(&mbar_load[0]) ); asm volatile( "cp.async.bulk.tensor.2d.shared::cluster.global.tile" ".mbarrier::complete_tx::bytes" " [%0], [%1, {%2, %3}], [%4];" :: "l"(smem_B[0]), "l"(&tensorMapB), "r"(block_n), "r"(0), "l"(&mbar_load[0]) ); } // ============================================= // MAIN LOOP: Iterate over K dimension // ============================================= int num_k_tiles = K / TILE_K; for (int k_tile = 0; k_tile < num_k_tiles; k_tile++) { int buf = k_tile % 2; // Wait for current tile's TMA load to complete asm volatile( "{\n\t" ".reg .pred p;\n\t" "WAIT_LOAD_%=:\n\t" "mbarrier.try_wait.parity.shared.b64 p, [%0], %1;\n\t" "@!p bra WAIT_LOAD_%=;\n\t" "}\n\t" :: "l"(&mbar_load[k_tile % STAGES]), "r"(k_tile / STAGES % 2) ); // Issue TMA load for NEXT tile (prefetch) if (threadIdx.x == 0 && k_tile + 1 < num_k_tiles) { int next_k = (k_tile + 1) * TILE_K; int next_buf = (k_tile + 1) % 2; // TMA load next A and B tiles... // (similar to prologue, targeting smem_A/B[next_buf]) } // ----- tcgen05 MMA ----- asm volatile( "tcgen05.fence::before_thread_sync;\n\t" "tcgen05.mma.cta_group::1.kind::f16" " [%0], %1, %2, %3, 1;\n\t" "tcgen05.commit.cta_group::1" ".mbarrier::arrive [%4];\n\t" :: "l"(/* tmem_addr */), "l"(/* desc_a */), "l"(/* desc_b */), "r"(/* idesc */), "l"(&mbar_compute) ); // Wait for MMA to complete asm volatile( "{\n\t" ".reg .pred p;\n\t" "WAIT_MMA_%=:\n\t" "mbarrier.try_wait.parity.shared.b64 p, [%0], %1;\n\t" "@!p bra WAIT_MMA_%=;\n\t" "}\n\t" :: "l"(&mbar_compute), "r"(k_tile % 2) ); } // ============================================= // EPILOGUE: Store results from TMEM // ============================================= // Move accumulator from TMEM to shared memory asm volatile( "tcgen05.st.16x256b [%0], [%1];\n\t" :: "l"(/* smem_result */), "l"(/* tmem_addr */) ); __syncthreads(); // Store from shared memory to global memory // (each thread stores its portion) int tid = threadIdx.x; int elems_per_thread = (TILE_M * TILE_N) / blockDim.x; half* smem_out = (half*)(smem); // reuse shared memory for (int i = 0; i < elems_per_thread; i++) { int idx = tid * elems_per_thread + i; int row = block_m + idx / TILE_N; int col = block_n + idx % TILE_N; if (row < M && col < N) { D[row * N + col] = smem_out[idx]; } } } ``` --- ## 4.6 The Full Picture: Architecture Comparison ### Evolution of GPU Tensor Core Programming ```mermaid graph TD V["Volta 2017
wmma.mma.sync
Warp-level: 32 threads
16x16x16 tiles
FP16 only
Data: Reg to TC to Reg"] A["Ampere 2020
mma.sync + cp.async
Warp-level: 32 threads
Async global to shared copy
mbarrier sync
TF32, BF16, FP64, Sparsity
Data: Reg to TC to Reg"] H["Hopper 2022
wgmma.mma_async
Warpgroup-level: 128 threads
TMA hardware engine
Thread Block Clusters
DSMEM, FP8
Data: Smem/Reg to TC to Reg"] B["Blackwell 2024
tcgen05.mma
CTA-group level: 1-2 CTAs
TMEM new address space
FP4, MX formats
Enhanced TMA
CTA-group collaboration
Data: Smem to TC to TMEM"] V --> A --> H --> B style V fill:#6c5ce7,stroke:#333,color:#fff style A fill:#00b894,stroke:#333,color:#fff style H fill:#fdcb6e,stroke:#333,color:#000 style B fill:#e17055,stroke:#333,color:#fff ``` ### Peak Performance Across Generations | Generation | FP16 Tensor (TFLOPS) | FP8 Tensor (TFLOPS) | FP4 Tensor (TFLOPS) | HBM BW (TB/s) | |-----------|---------------------|---------------------|---------------------|---------------| | V100 (Volta) | 125 | ; | ; | 0.9 | | A100 (Ampere) | 312 | ; | ; | 2.0 | | H100 (Hopper) | 990 | 1,979 | ; | 3.35 | | B200 (Blackwell) | ~1,800 | ~3,600 | ~9,000 (sparse) | 8.0 | ### The Memory Wall: Compute Grows Faster Than Memory ```mermaid graph TD COMPUTE["Compute Growth:
~4x per generation
V100 to B200: ~72x"] --> GAP MEMORY["Memory BW Growth:
~2x per generation
V100 to B200: ~9x"] --> GAP GAP["The Gap:
Compute outpaces memory ~8:1
We must reduce data movement!"] GAP --> SOL1["1. Lower precision
FP16 to FP8 to FP4 = less data"] GAP --> SOL2["2. Larger on-chip memory
96 to 228 KB smem"] GAP --> SOL3["3. TMA
Eliminates wasted thread work"] GAP --> SOL4["4. TMEM
Eliminates register file bottleneck"] GAP --> SOL5["5. Async everything
Overlap compute and memory"] style GAP fill:#e74c3c,stroke:#333,color:#fff style SOL1 fill:#2ecc71,stroke:#333,color:#fff style SOL2 fill:#2ecc71,stroke:#333,color:#fff style SOL3 fill:#2ecc71,stroke:#333,color:#fff style SOL4 fill:#2ecc71,stroke:#333,color:#fff style SOL5 fill:#2ecc71,stroke:#333,color:#fff ``` --- # Appendix A ; PTX Quick Reference ### Memory Spaces ``` // PTX memory space qualifiers: .global // GPU HBM / GDDR (slowest, largest) .shared // Per-SM shared memory .local // Per-thread local memory (actually in global, cached) .const // Constant memory (cached, broadcast) .param // Kernel parameters .reg // Registers (fastest) // Blackwell only: // TMEM: accessed via tcgen05 instructions (not a PTX address space qualifier) ``` ### Essential Instructions ``` // ===== Arithmetic ===== add.f32 %f0, %f1, %f2; // f0 = f1 + f2 mul.f32 %f0, %f1, %f2; // f0 = f1 * f2 fma.rn.f32 %f0, %f1, %f2, %f3; // f0 = f1 * f2 + f3 (fused, round-nearest) mad.lo.u32 %r0, %r1, %r2, %r3; // r0 = r1 * r2 + r3 (integer) // ===== Memory ===== ld.global.f32 %f0, [%rd0]; // Load from global st.global.f32 [%rd0], %f0; // Store to global ld.shared.f32 %f0, [smem_addr]; // Load from shared st.shared.f32 [smem_addr], %f0; // Store to shared // ===== Async Copy (Ampere+) ===== cp.async.ca.shared.global [dst], [src], 16; // 16 bytes, global to shared cp.async.commit_group; // Commit group cp.async.wait_group N; // Wait for Nth group // ===== TMA (Hopper+) ===== cp.async.bulk.tensor.2d.shared::cluster.global.tile .mbarrier::complete_tx::bytes [smem], [tensorMap, {x, y}], [mbar]; // TMA 2D tile load // ===== mbarrier ===== mbarrier.init.shared.b64 [addr], count; // Initialize mbarrier.arrive.shared.b64 state, [addr]; // Arrive mbarrier.arrive.expect_tx.shared.b64 state, [addr], tx_count; // Arrive + expect async bytes mbarrier.try_wait.parity.shared.b64 pred, [addr], phase; // Non-blocking wait // ===== Tensor Core (Volta/Ampere) ===== wmma.load.a.sync.aligned.m16n16k16.shared.row.f16 {regs}, [addr], stride; wmma.mma.sync.aligned.m16n16k16.row.col.f32.f16.f16.f32 {d_regs}, {a_regs}, {b_regs}, {c_regs}; // ===== WGMMA (Hopper) ===== wgmma.fence.sync.aligned; wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 {acc_regs}, desc_a, desc_b, scale_d, scale_a, scale_b, trans_a, trans_b; wgmma.commit_group.sync.aligned; wgmma.wait_group.sync.aligned N; // ===== tcgen05 (Blackwell) ===== tcgen05.fence::before_thread_sync; tcgen05.mma.cta_group::1.kind::f16 [tmem_d], desc_a, desc_b, idesc, enable; tcgen05.commit.cta_group::1.mbarrier::arrive [mbar]; tcgen05.ld.16x256b [tmem], [smem]; tcgen05.st.16x256b [smem], [tmem]; // ===== Warp Shuffle (all generations) ===== shfl.sync.bfly.b32 %r0, %r1, lane_mask, 0x1f; // Butterfly shuffle shfl.sync.down.b32 %r0, %r1, delta, 0x1f; // Shift down shfl.sync.up.b32 %r0, %r1, delta, 0x0; // Shift up shfl.sync.idx.b32 %r0, %r1, src_lane, 0x1f; // Indexed shuffle // ===== Control Flow ===== setp.lt.f32 %p0, %f0, %f1; // p0 = (f0 < f1) @%p0 bra TARGET; // Conditional branch bar.sync 0; // __syncthreads() ``` --- # Appendix B ; Occupancy Calculator Cheat Sheet ### Quick Reference Table (SM90 ; Hopper) | Regs/Thread | Max Threads/SM | Max Warps | Occupancy | Notes | |-------------|---------------|-----------|-----------|-------| | 16 | 2048 | 64 | 100% | Very few regs ; likely register spills | | 32 | 2048 | 64 | 100% | Sweet spot for simple kernels | | 48 | 1536 | 48 | 75% | | | 64 | 1024 | 32 | 50% | Common for complex kernels | | 80 | 768 | 24 | 37.5% | | | 96 | 640 | 20 | 31.25% | | | 128 | 512 | 16 | 25% | Register-heavy kernel | | 255 | 256 | 8 | 12.5% | Maximum registers allowed | ### Formulas $$\text{Warps from registers} = \left\lfloor \frac{65536}{\text{regs/thread} \times 32} \right\rfloor$$ $$\text{Warps from shared mem} = \min\left( \left\lfloor \frac{\text{max smem per SM}}{\text{smem per block}} \right\rfloor \times \frac{\text{threads per block}}{32}, \; 64 \right)$$ $$\text{Occupancy} = \frac{\min(\text{warps from regs}, \text{warps from smem}, \text{warps from blocks})}{64}$$ --- > *"The thing that makes physics beautiful is that it's simple. The same is true of GPUs ; once you understand that everything is about hiding latency and maximizing throughput, the whole architecture falls into place like a jigsaw puzzle. Every generation, NVIDIA finds a new piece of latency to hide. Registers were the bottleneck for Tensor Cores? Invent TMEM. Threads wasted on address math? Invent TMA. Synchronization too coarse? Invent mbarrier. The physics hasn't changed ; electrons through transistors. But the engineering is magnificent."* > > ; Your friend, Dick Feynman (if he wrote GPU code) --- *End of the Four-Hour Lecture. Go build something.*