# Darshan Baslani - Full Blog Text

This file aggregates the readable source for every blog post on the site.

# Hello, Layout! ; Visualizing Memory in CuTe

- URL: https://www.dcbaslani.xyz/blog/01_hello_layout/
- Markdown: https://www.dcbaslani.xyz/blog/01_hello_layout/index.md
- Text: https://www.dcbaslani.xyz/blog/01_hello_layout/index.txt
- Published: 2025-06-15
- Topic: CUDA
- Description: Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.

# Hello, Layout! ;  Visualizing Memory in CuTe

**Difficulty:** Beginner
**Prerequisites:** Basic CUDA (kernel launch, `__global__`), C/C++ pointers

## 1. The Problem (The "Why")

You have a flat `float*` pointer ;  1D memory. But your algorithm thinks in rows and columns: a matrix, a tile, a 2D block of pixels. Right now you're hand-writing index math like `ptr[row * WIDTH + col]` and praying you got the constant right. Every time the layout changes (transpose? padding? different tile size?), you rewrite the formula and re-introduce bugs.

CuTe's `Layout` solves this by making the coordinate-to-address mapping a **first-class object**. You declare the shape (how many rows, how many columns) and the stride (the distance between elements), and the Layout *computes* the flat index for you. Change from row-major to column-major? Swap the strides. Reshape from 2D to 3D? Add a mode. The algorithm code never changes.

> **B200 Note:** On Hopper and Blackwell GPUs, the TMA (Tensor Memory Accelerator) hardware needs a *descriptor* that encodes shape and stride information. Understanding Layouts is the **only** way to program TMA correctly.

## 2. The Mental Model (The Visual)

A `Layout` is a **lens** that lets you view flat, 1D memory as if it were a multidimensional grid.

Take `Shape<2, 4>` ;  that's 2 rows and 4 columns, so 8 elements total. The `Stride` decides *how* those coordinates map to physical addresses.

### Column-Major (The Default): `(2,4):(1,2)`

Stride `(1, 2)` means: move 1 row down → address +1. Move 1 column right → address +2.

```text
Physical Memory (1D):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │
└───┴───┴───┴───┴───┴───┴───┴───┘

Layout maps coordinates → addresses:

        col 0   col 1   col 2   col 3
       ┌───────┬───────┬───────┬───────┐
row 0  │   0   │   2   │   4   │   6   │
       ├───────┼───────┼───────┼───────┤
row 1  │   1   │   3   │   5   │   7   │
       └───────┴───────┴───────┴───────┘

address = row * 1  +  col * 2
          ───────     ───────
          stride[0]   stride[1]
```

Notice the columns are stored contiguously in memory: addresses `0,1` are column 0, addresses `2,3` are column 1, etc. That's **column-major** ;  just like Fortran and MATLAB.

### Row-Major: `(2,4):(4,1)`

Stride `(4, 1)` means: move 1 row down → address +4. Move 1 column right → address +1.

```text
        col 0   col 1   col 2   col 3
       ┌───────┬───────┬───────┬───────┐
row 0  │   0   │   1   │   2   │   3   │
       ├───────┼───────┼───────┼───────┤
row 1  │   4   │   5   │   6   │   7   │
       └───────┴───────┴───────┴───────┘

address = row * 4  +  col * 1
          ───────     ───────
          stride[0]   stride[1]
```

Now rows are contiguous: addresses `0,1,2,3` are row 0. That's **row-major** ;  the C/C++ convention.

> **Key Insight:** Same shape, same data, same 8 elements. The *only* difference is the stride. That's the entire power of a Layout: **the shape says "what," the stride says "where."**

## 3. The Solution (The Code)

A complete CUDA kernel that creates Layouts and prints the coordinate-to-address mapping.

```cpp
#include <cute/layout.hpp>
#include <cute/tensor.hpp>

using namespace cute;

// Print a rank-2 layout as a grid of (row, col) -> address
template <class Shape, class Stride>
__device__ void print_layout_2d(Layout<Shape, Stride> const& layout)
{
    // Only let thread 0 print
    if (threadIdx.x != 0 || blockIdx.x != 0) return;

    printf("Layout: ");
    print(layout);
    printf("\n");

    for (int row = 0; row < size<0>(layout); ++row) {
        for (int col = 0; col < size<1>(layout); ++col) {
            printf("  (%d,%d)->%d", row, col, int(layout(row, col)));
        }
        printf("\n");
    }
    printf("\n");
}

__global__ void hello_layout_kernel()
{
    // 1. Column-major (default when stride is omitted)
    auto col_major = make_layout(make_shape(Int<2>{}, Int<4>{}));
    printf("=== Column-Major ===\n");
    print_layout_2d(col_major);

    // 2. Row-major (use LayoutRight tag)
    auto row_major = make_layout(make_shape(Int<2>{}, Int<4>{}), LayoutRight{});
    printf("=== Row-Major ===\n");
    print_layout_2d(row_major);

    // 3. Custom stride
    auto custom = make_layout(make_shape(Int<2>{}, Int<4>{}),
                              make_stride(Int<8>{}, Int<1>{}));
    printf("=== Custom Stride (2,4):(8,1) ===\n");
    print_layout_2d(custom);
}

int main()
{
    hello_layout_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}
```

**Expected Output:**

```text
=== Column-Major ===
Layout: (_2,_4):(_1,_2)
  (0,0)->0  (0,1)->2  (0,2)->4  (0,3)->6
  (1,0)->1  (1,1)->3  (1,2)->5  (1,3)->7

=== Row-Major ===
Layout: (_2,_4):(_4,_1)
  (0,0)->0  (0,1)->1  (0,2)->2  (0,3)->3
  (1,0)->4  (1,1)->5  (1,2)->6  (1,3)->7

=== Custom Stride (2,4):(8,1) ===
Layout: (_2,_4):(_8,_1)
  (0,0)->0  (0,1)->1  (0,2)->2  (0,3)->3
  (1,0)->8  (1,1)->9  (1,2)->10  (1,3)->11
```

## 4. Step-by-Step Explanation

**Line: `auto col_major = make_layout(make_shape(Int<2>{}, Int<4>{}));`**

- `make_shape(Int<2>{}, Int<4>{})` creates a compile-time shape of 2 rows by 4 columns. `Int<N>{}` is CuTe's static integer ;  the compiler knows the value, so the stride math becomes zero-cost.
- `make_layout(shape)` with no stride argument defaults to `LayoutLeft` (column-major). CuTe computes the stride as an exclusive prefix product of the shape from left to right: stride₀ = 1, stride₁ = 1 × 2 = 2. Result: `(2,4):(1,2)`.

**Line: `auto row_major = make_layout(make_shape(Int<2>{}, Int<4>{}), LayoutRight{});`**

- The `LayoutRight{}` tag generates strides as an exclusive prefix product from *right to left*: stride₁ = 1, stride₀ = 1 × 4 = 4. Result: `(2,4):(4,1)`.

**Line: `auto custom = make_layout(make_shape(...), make_stride(Int<8>{}, Int<1>{}));`**

- Here we provide an explicit stride. Row stride of 8 means there's a gap between rows ;  this is how you express a matrix that lives inside a larger allocation (e.g., a 2×4 submatrix of an 8-column matrix).

**Line: `layout(row, col)`**

- This is the layout's core operation: it takes a 2D coordinate and returns the flat 1D index. The formula is `row × stride[0] + col × stride[1]`. For column-major: `1×row + 2×col`. For row-major: `4×row + 1×col`.

**Line: `size<0>(layout)` / `size<1>(layout)`**

- `size<0>` returns the extent of mode 0 (rows = 2). `size<1>` returns mode 1 (columns = 4). This drives the loop bounds. Because the shape is static, the compiler can fully unroll these loops.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** A Layout is a *spreadsheet formula* for memory. The Shape is how many rows and columns the spreadsheet has. The Stride is the formula in each cell that says "to find my data, take my row number × *this* + my column number × *that*." Change the formula, change the order data is fetched from RAM ;  but the spreadsheet looks identical to the user.

**Hardware Note:** On B200/Hopper GPUs, when you set up a TMA descriptor with `make_tma_copy`, you must provide a Layout that describes the tensor's shape and strides in global memory. If you get the strides wrong, the TMA engine will fetch garbage. The concepts from this tutorial ;  `make_shape`, `make_stride`, column-major vs. row-major ;  are *exactly* what the TMA descriptor encodes.

**CuTe Notation Cheat-Sheet:**

| Notation | Meaning |
| :--- | :--- |
| `(2,4):(1,2)` | Shape `(2,4)`, Stride `(1,2)` ;  column-major |
| `(2,4):(4,1)` | Shape `(2,4)`, Stride `(4,1)` ;  row-major |
| `_N` in output | Static (compile-time) integer |
| `N` in output | Dynamic (run-time) integer |
| `Int<N>{}` | C++ way to write a static integer |
| `LayoutLeft` | Column-major stride generation (default) |
| `LayoutRight` | Row-major stride generation |

---

# The Art of Slicing ; Partitioning Data Across Blocks and Threads

- URL: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/
- Markdown: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/index.md
- Text: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/index.txt
- Published: 2025-06-20
- Topic: CUDA
- Description: How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.

# The Art of Slicing ;  Partitioning Data Across Blocks and Threads

**Difficulty:** Beginner
**Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), CUDA thread blocks (`blockIdx`, `threadIdx`)

## 1. The Problem (The "Why")

You have a big matrix ;  say 512×512 ;  sitting in global memory. Your GPU launches a grid of thread blocks, each of which should work on a small tile (e.g., 128×128). Inside each block, 256 threads need to split *that* tile further so every thread handles its own little piece.

Without CuTe, you write something like:

```cpp
int row_start = blockIdx.x * TILE_M;
int col_start = blockIdx.y * TILE_N;
int thr_row   = threadIdx.x % THR_M;
int thr_col   = threadIdx.x / THR_M;
// ... and then more math for every matrix, every stage, every layout change
```

This is fragile. Change the tile size? Rewrite the math. Transpose a matrix? Rewrite it again. Add a K-dimension for GEMM? Rewrite it *again*.

CuTe replaces all of this with two composable operations:

| Operation | Level | What it does |
| :--- | :--- | :--- |
| `local_tile` | Block (CTA) | "Give this block its tile from the big matrix." |
| `local_partition` | Thread | "Give this thread its elements from the tile." |

Both work on any shape, any stride, any number of dimensions.

> **B200 Note:** When you program the TMA engine on Hopper/Blackwell, you still need `local_tile` to select *which* tile the TMA should fetch. The TMA descriptor handles the copy, but you tell it *where* to point using the exact same tiling logic described here.

## 2. The Mental Model (The Visual)

### Step 1: The Big Picture ;  Tiling for CTAs

Imagine an 8×8 matrix divided into 4×4 tiles. Four thread blocks share the work:

```text
                       Global Matrix (8×8)
         col 0  col 1  col 2  col 3  col 4  col 5  col 6  col 7
       ┌───────────────────────────┬───────────────────────────┐
row 0  │                           │                           │
row 1  │      Block (0,0)          │      Block (0,1)          │
row 2  │      rows 0-3, cols 0-3   │      rows 0-3, cols 4-7   │
row 3  │                           │                           │
       ├───────────────────────────┼───────────────────────────┤
row 4  │                           │                           │
row 5  │      Block (1,0)          │      Block (1,1)          │
row 6  │      rows 4-7, cols 0-3   │      rows 4-7, cols 4-7   │
row 7  │                           │                           │
       └───────────────────────────┴───────────────────────────┘

  local_tile(matrix, Shape<4,4>{}, make_coord(blockIdx.x, blockIdx.y))
  → extracts one 4×4 tile for each block
```

`local_tile` applies a **tiler** (here `Shape<4,4>`) to produce tiles, then uses the block coordinate to pick the right one. Under the hood it calls `zipped_divide`, which reshapes the matrix into `((Tile), (Rest))` ;  tile elements in the first mode, tile indices in the second ;  then slices the rest-mode with your coordinate.

### Step 2: Inside a Tile ;  Partitioning for Threads

Now zoom into Block (0,0)'s 4×4 tile. We have 4 threads arranged as a 2×2 layout. Each thread gets a 2×2 chunk:

```text
            Block (0,0)'s Tile (4×4)

         col 0   col 1   col 2   col 3
       ┌─────────────────┬─────────────────┐
row 0  │  T0      T0     │  T1      T1     │
row 1  │  T0      T0     │  T1      T1     │
       ├─────────────────┼─────────────────┤
row 2  │  T2      T2     │  T3      T3     │
row 3  │  T2      T2     │  T3      T3     │
       └─────────────────┴─────────────────┘

  Thread layout: (2,2):(1,2)  →  4 threads in column-major
  Tile shape:    (4,4)
  Each thread:   (4/2, 4/2) = (2,2) elements

  local_partition(tile, Layout<Shape<2,2>>{}, threadIdx.x)
  → Thread 0 gets the top-left 2×2 subtensor
```

`local_partition` takes a **thread layout** ;  a Layout that maps coordinates to thread indices ;  and uses `zipped_divide` to split the tile into thread-sized pieces, then slices into the tile-mode with the thread's *coordinate* (derived from its flat index via the thread layout).

> **Key Insight:** `local_tile` slices the **rest** (picks which tile). `local_partition` slices the **tile** (picks which thread's elements). They are two sides of the same `zipped_divide` coin.

## 3. The Solution (The Code)

A CUDA kernel that tiles a matrix across blocks, partitions each tile across threads, and prints what each agent owns.

```cpp
#include <cute/tensor.hpp>

using namespace cute;

template <class Shape_MN, class CtaTiler, class ThreadLayout>
__global__ void partition_kernel(Shape_MN shape_mn,
                                 CtaTiler cta_tiler,
                                 ThreadLayout thr_layout)
{
    // 1. Build the full matrix as a counting tensor (value at (i,j) = flat index)
    //    In real code this would be make_tensor(make_gmem_ptr(ptr), shape, stride)
    auto matrix = make_counting_tensor(make_layout(shape_mn));          // (M,N)

    // 2. CTA-level: each block picks its tile
    auto cta_coord = make_coord(blockIdx.x, blockIdx.y);
    auto cta_tile  = local_tile(matrix, cta_tiler, cta_coord);         // (BLK_M,BLK_N)

    // 3. Thread-level: each thread picks its elements from the tile
    auto thr_tile  = local_partition(cta_tile, thr_layout, threadIdx.x); // (THR_M,THR_N)

    // 4. Print from Block (0,0), Thread 0 only
    if (blockIdx.x == 0 && blockIdx.y == 0 && threadIdx.x == 0) {
        printf("=== Full matrix layout ===\n");
        print(matrix.layout()); printf("\n\n");

        printf("=== Block (0,0) tile ===\n");
        print(cta_tile.layout()); printf("\n");
        for (int m = 0; m < size<0>(cta_tile); ++m) {
            for (int n = 0; n < size<1>(cta_tile); ++n) {
                printf("%3d ", int(cta_tile(m, n)));
            }
            printf("\n");
        }

        printf("\n=== Block (0,0), Thread 0 partition ===\n");
        print(thr_tile.layout()); printf("\n");
        printf("Thread 0 owns elements: ");
        for (int i = 0; i < size(thr_tile); ++i) {
            printf("%d ", int(thr_tile(i)));
        }
        printf("\n");
    }
}

int main()
{
    // 8×8 matrix, tiled into 4×4 blocks, 4 threads per block (2×2)
    auto shape_mn   = make_shape(Int<8>{}, Int<8>{});
    auto cta_tiler  = make_shape(Int<4>{}, Int<4>{});
    auto thr_layout = make_layout(make_shape(Int<2>{}, Int<2>{}));     // (m,n)->thr_idx

    dim3 grid(size(ceil_div(get<0>(shape_mn), get<0>(cta_tiler))),
              size(ceil_div(get<1>(shape_mn), get<1>(cta_tiler))));    // 2×2 blocks
    dim3 block(size(thr_layout));                                      // 4 threads

    partition_kernel<<<grid, block>>>(shape_mn, cta_tiler, thr_layout);
    cudaDeviceSynchronize();
    return 0;
}
```

**Expected Output:**

```text
=== Full matrix layout ===
(_8,_8):(_1,_8)

=== Block (0,0) tile ===
(_4,_4):(_1,_8)
  0   8  16  24
  1   9  17  25
  2  10  18  26
  3  11  19  27

=== Block (0,0), Thread 0 partition ===
(_2,_2):(_1,_8)
Thread 0 owns elements: 0 1 8 9
```

Block (0,0) gets the top-left 4×4 tile (addresses 0–3 down column 0, 8–11 down column 1, etc. ;  column-major). Thread 0 gets the top-left 2×2 corner of that tile.

## 4. Step-by-Step Explanation

**Line: `auto matrix = make_counting_tensor(make_layout(shape_mn));`**

- `make_counting_tensor` creates a "virtual" tensor where the value at each coordinate equals its flat index. No actual memory ;  it's a trick for visualizing the Layout's mapping. In a real kernel, replace this with `make_tensor(make_gmem_ptr(ptr), shape, stride)`.

**Line: `auto cta_tile = local_tile(matrix, cta_tiler, cta_coord);`**

- `local_tile` is CuTe's **inner partition**. Under the hood:
  1. `zipped_divide(matrix, cta_tiler)` reshapes the `(8,8)` matrix into `((4,4),(2,2))` ;  a 4×4 tile mode and a 2×2 "rest" mode (2 tiles per row, 2 tiles per column).
  2. The `cta_coord` slices the rest-mode: `(blockIdx.x, blockIdx.y)` picks one tile.
- Block (0,0) gets the tile at rest-coordinate (0,0) ;  the top-left 4×4 region.

**Line: `auto thr_tile = local_partition(cta_tile, thr_layout, threadIdx.x);`**

- `local_partition` is CuTe's **outer partition**. Under the hood:
  1. It takes the *shape* of `thr_layout` ;  here `(2,2)` ;  and uses it as a tiler on the `(4,4)` tile, producing `((2,2),(2,2))` ;  2×2 elements per thread, tiled over 2×2 threads.
  2. It converts `threadIdx.x` into a coordinate using `thr_layout`. Thread 0 → coord `(0,0)`, Thread 1 → `(1,0)`, Thread 2 → `(0,1)`, Thread 3 → `(1,1)`.
  3. That coordinate slices the tile-mode, so each thread gets the matching 2×2 block of elements.

**Line: `ceil_div(get<0>(shape_mn), get<0>(cta_tiler))`**

- `ceil_div(8, 4) = 2`. This gives us 2 blocks in the M-dimension (and 2 in N), for a total of 4 blocks.

**`cosize` (used in real kernels for shared memory):**

- `cosize(layout)` returns the size of the layout's codomain ;  the number of unique addresses needed. Use it to allocate shared memory: `__shared__ float smem[cosize_v<SmemLayout>];`.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** Think of partitioning as **dealing cards at a poker table**:

1. **`local_tile`** = The casino floor manager assigns a *table* (tile) to each *dealer* (CTA). "Table 0, you get the top-left quarter of the deck."
2. **`local_partition`** = The dealer at each table divides the cards among the *players* (threads). "Player 0, you get these 4 cards."
3. **`zipped_divide`** = The reorganization step that makes dealing possible. It splits the deck into `((hand), (tables))` so you can ask for any specific table's hand in one shot.

The key invariant: **every element is owned by exactly one block and exactly one thread within that block.** No overlap, no gaps (assuming tile sizes evenly divide the matrix ;  see the [predication tutorial](https://www.dcbaslani.xyz/blog/0y_predication/) for handling remainders).

**Hardware Note:** On Hopper/B200, the tile coordinate you compute with `local_tile` is the *same* coordinate you pass to `cute::copy` with a TMA atom. The TMA engine uses the descriptor's shape/stride (from Tutorial 01) and this coordinate to DMA the right tile into shared memory ;  zero address math in registers.

**The `local_tile` / `local_partition` Cheat-Sheet:**

| Function | Level | Slices into... | You provide... | You get... |
| :--- | :--- | :--- | :--- | :--- |
| `local_tile` | CTA | Rest (which tile) | Tile shape + block coord | The tile data |
| `local_partition` | Thread | Tile (which thread) | Thread layout + thread idx | That thread's elements |

**Under the hood ;  both use `zipped_divide`:**

```text
zipped_divide(data, tiler)  →  ((Tile), (Rest))
                                  │         │
                         local_partition  local_tile
                         slices here      slices here
```

---

# The Naive Copy ; Scalar vs. Vectorized Memory Movement

- URL: https://www.dcbaslani.xyz/blog/03_the_naive_copy/
- Markdown: https://www.dcbaslani.xyz/blog/03_the_naive_copy/index.md
- Text: https://www.dcbaslani.xyz/blog/03_the_naive_copy/index.txt
- Published: 2026-02-22
- Topic: CUDA
- Description: Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.

# The Naive Copy ;  Scalar vs. Vectorized Memory Movement

**Difficulty:** Beginner
**Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 02: The Art of Slicing](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/), basic CUDA memory model

## 1. The Problem (The "Why")

You've partitioned your matrix beautifully. Every thread knows which elements it owns. Now it needs to actually *move* data ;  from global memory to shared memory, or from shared memory to registers. The simplest approach: a `for` loop that copies one `float` at a time.

```cpp
// The "obvious" way ;  one element per iteration
for (int i = 0; i < N; ++i) {
    dst[i] = src[i];
}
```

This works. It's also leaving **75% of your memory bandwidth on the table.**

Why? Because the GPU's memory bus is 128 bits wide. A single `float` is 32 bits. Every load instruction fetches 128 bits from DRAM anyway ;  but if you only asked for 32 bits, the other 96 bits are thrown away. You're paying for a 4-lane highway but driving in one lane.

The fix: **vectorized loads**. Instead of `LDG.32` (load 32 bits), the compiler emits `LDG.128` (load 128 bits = 4 floats at once). Same number of memory transactions, 4× the useful data per transaction.

CuTe handles this automatically ;  *if* your data meets two conditions. This tutorial shows you what those conditions are and how to make sure your copies always hit the fast path.

> **B200 Note:** On Hopper/Blackwell, vectorized copies are the minimum bar. The real prize is `cp.async` and TMA (Tutorials 04 and 08), which bypass registers entirely. But even those advanced paths require the same alignment and contiguity fundamentals covered here.

## 2. The Mental Model (The Visual)

### Scalar Copy: One Element at a Time

```text
Global Memory (128-bit bus):
┌─────────────────────────────────────┐
│  float₀  float₁  float₂  float₃   │  ← 128 bits fetched from DRAM
└─────────────────────────────────────┘
     ↓
   LDG.32 grabs only float₀ (32 bits)
   The other 96 bits? Wasted.

   Thread issues 4 separate loads:
   LDG.32 → float₀
   LDG.32 → float₁
   LDG.32 → float₂
   LDG.32 → float₃
   = 4 instructions, 4 transactions
```

### Vectorized Copy: Four Elements at Once

```text
Global Memory (128-bit bus):
┌─────────────────────────────────────┐
│  float₀  float₁  float₂  float₃   │  ← 128 bits fetched from DRAM
└─────────────────────────────────────┘
     ↓
   LDG.128 grabs ALL FOUR floats in one shot

   Thread issues 1 load:
   LDG.128 → float₀, float₁, float₂, float₃
   = 1 instruction, 1 transaction
```

### The Two Requirements for Vectorization

```text
Requirement 1: CONTIGUITY (stride-1)
   ✅ Contiguous:  [ f₀ | f₁ | f₂ | f₃ ]   stride = 1
                      +1   +1   +1
   ❌ Strided:     [ f₀ |    | f₁ |    ]   stride = 2
                      +2        +2           Can't bundle into one load

Requirement 2: ALIGNMENT (address divisible by vector width)
   ✅ Aligned:     addr 0x0000  →  [ f₀ f₁ f₂ f₃ ]   0 % 16 == 0
   ❌ Misaligned:  addr 0x0004  →  [ f₁ f₂ f₃ f₄ ]   4 % 16 ≠ 0
                   Can't do 128-bit load from a 32-bit boundary
```

> **The Dolly Rule:** Think of vectorization like moving boxes on a loading dock. `UniversalCopy` hand-carries one box at a time ;  always works, always slow. `AutoVectorizingCopy` is the smart worker who checks: "Are these boxes side-by-side (contiguous) and does the stack start at a dolly-sized slot (aligned)?" If yes, load them all onto the dolly in one trip. If not, fall back to one-at-a-time.

## 3. The Solution (The Code)

A CUDA kernel that copies a tile from global to shared memory using CuTe's `copy()`, then measures effective bandwidth for both scalar and vectorized paths.

```cpp
#include <cute/tensor.hpp>
#include <cute/algorithm/copy.hpp>
#include <cute/arch/copy.hpp>

#include <cstdio>
#include <chrono>

using namespace cute;

// ─── Kernel: copy a tile from global → shared memory ───
template <class CopyOp, class GmemLayout, class SmemLayout>
__global__ void copy_kernel(float const* __restrict__ g_ptr,
                            GmemLayout   gmem_layout,
                            SmemLayout   smem_layout)
{
    extern __shared__ float smem[];

    // Wrap raw pointers in CuTe Tensors
    auto g_tensor = make_tensor(make_gmem_ptr(g_ptr),  gmem_layout);  // (M, N)
    auto s_tensor = make_tensor(make_smem_ptr(smem),   smem_layout);  // (M, N)

    // Each thread gets its partition of the tile
    auto thr_g = local_partition(g_tensor, Layout<Shape<_8, _32>>{}, threadIdx.x);
    auto thr_s = local_partition(s_tensor, Layout<Shape<_8, _32>>{}, threadIdx.x);

    // ── The actual copy ──
    copy(CopyOp{}, thr_g, thr_s);
}

// ─── Benchmark harness ───
template <class CopyOp>
float benchmark_copy(const char* label, int M, int N, int iters)
{
    size_t bytes = M * N * sizeof(float);

    float *d_src;
    cudaMalloc(&d_src, bytes);
    cudaMemset(d_src, 1, bytes);  // fill with something

    // Layouts: column-major for both gmem and smem
    auto gmem_layout = make_layout(make_shape(M, N), LayoutLeft{});
    auto smem_layout = make_layout(make_shape(M, N), LayoutLeft{});

    int smem_bytes = M * N * sizeof(float);
    dim3 block(256);
    dim3 grid(1);

    // Warmup
    copy_kernel<CopyOp><<<grid, block, smem_bytes>>>(d_src, gmem_layout, smem_layout);
    cudaDeviceSynchronize();

    // Timed loop
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    for (int i = 0; i < iters; ++i) {
        copy_kernel<CopyOp><<<grid, block, smem_bytes>>>(d_src, gmem_layout, smem_layout);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms = 0;
    cudaEventElapsedTime(&ms, start, stop);

    float gb_per_sec = (float(bytes) * iters / 1e9) / (ms / 1e3);
    printf("%-40s  %8.2f GB/s   (%6.3f ms for %d iters)\n", label, gb_per_sec, ms, iters);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_src);
    return gb_per_sec;
}

int main()
{
    int M = 128, N = 256, iters = 1000;

    printf("Copying %dx%d tile (%zu bytes) from Global → Shared Memory\n\n", M, N, M*N*sizeof(float));

    // 1. Scalar: UniversalCopy copies one float (32 bits) per call
    benchmark_copy<UniversalCopy<float>>(
        "UniversalCopy<float>  (scalar, 32-bit)", M, N, iters);

    // 2. Vectorized: AutoVectorizingCopy bundles up to 128 bits
    benchmark_copy<AutoVectorizingCopyWithAssumedAlignment<128>>(
        "AutoVectorizingCopy   (vector, 128-bit)", M, N, iters);

    // 3. For reference: explicit 64-bit vectorization
    benchmark_copy<AutoVectorizingCopyWithAssumedAlignment<64>>(
        "AutoVectorizingCopy   (vector,  64-bit)", M, N, iters);

    return 0;
}
```

**Expected Output (approximate ;  numbers vary by GPU):**

```text
Copying 128x256 tile (131072 bytes) from Global → Shared Memory

UniversalCopy<float>  (scalar, 32-bit)      85.20 GB/s   ( 1.539 ms for 1000 iters)
AutoVectorizingCopy   (vector, 128-bit)     310.45 GB/s   ( 0.422 ms for 1000 iters)
AutoVectorizingCopy   (vector,  64-bit)     205.30 GB/s   ( 0.638 ms for 1000 iters)
```

The vectorized 128-bit copy is roughly **3–4× faster** than scalar. The 64-bit variant lands in between.

## 4. Step-by-Step Explanation

**Line: `auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout);`**

- `make_gmem_ptr(g_ptr)` wraps the raw pointer and tags it as global memory. CuTe uses this tag to select the right load instruction (`LDG` vs. `LDS` vs. register move). Without it, CuTe can't distinguish memory spaces.

**Line: `auto thr_g = local_partition(g_tensor, Layout<Shape<_8, _32>>{}, threadIdx.x);`**

- The thread layout `(8, 32)` means 8 rows × 32 columns = 256 threads. This is the `thr_layout` from Tutorial 02. Each thread gets a `(M/8, N/32)` = `(16, 8)` subtensor ;  128 floats to copy.

**Line: `copy(CopyOp{}, thr_g, thr_s);`**

- This is the core call. CuTe dispatches based on `CopyOp`:
  - **`UniversalCopy<float>`**: Copies one `float` at a time. The inner loop is `dst(i) = src(i)` ;  a simple scalar assignment. The compiler emits `LDG.32` + `STS.32` per element.
  - **`AutoVectorizingCopyWithAssumedAlignment<128>`**: CuTe inspects the source and destination layouts at compile time and asks three questions:

```text
   1. max_common_vector(src, dst):  How many elements are contiguous
                                    (stride-1) in BOTH tensors?
                                    For column-major (16,8):(1,16) → 16 elements.

   2. max_alignment(src, dst):      What's the natural alignment?
                                    GCD of shape and stride components.

   3. gcd(contiguous_bits, alignment_bits, MaxVecBits):
                                    The actual vector width to use.
                                    min(16 * 32, alignment, 128) = 128 bits.
```

  When the result is 128 bits and both tensors qualify, CuTe calls `recast<uint128_t>(tensor)` to reinterpret 4 floats as a single 128-bit integer, then copies those. The compiler emits `LDG.128` + `STS.128` ;  one instruction moves 4 floats.

**What happens when vectorization fails:**

If you use a strided layout (e.g., stride-2 between elements), `max_common_vector` returns 1. CuTe falls back to element-by-element copy ;  same speed as `UniversalCopy`. Your code doesn't crash, it just silently takes the slow path. This is why understanding contiguity matters.

**Line: `copy(src, dst)` (no CopyOp argument)**

- When you call `copy()` *without* specifying a copy operation, CuTe picks one for you:
  - Static layout? → `AutoVectorizingCopyWithAssumedAlignment<128>` (assumes 128-bit aligned pointers)
  - Dynamic layout? → `AutoVectorizingCopyWithAssumedAlignment<8>` (conservative ;  no alignment assumed)
- **Takeaway:** The default `copy(src, dst)` already tries to vectorize. You don't need to pass `AutoVectorizingCopy` explicitly unless you want to control `MaxVecBits`.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** Copying data is like **moving boxes on a loading dock**. `UniversalCopy` hand-carries one box at a time ;  it always works, but it's slow. `AutoVectorizingCopy` is the smart worker who checks: "Are these boxes side-by-side on the shelf (contiguous, stride-1)? Does the stack start at a slot that fits my dolly (aligned to 128-bit boundary)?" If yes, load them onto the dolly and move 4 boxes in one trip. If not, fall back to hand-carrying.

**The Three Laws of Auto-Vectorization:**

1. **Contiguity** ;  Both source and destination must have stride-1 elements in memory. Column-major `(M,N):(1,M)` is contiguous along mode-0. Row-major `(M,N):(N,1)` is contiguous along mode-1. Irregular strides (e.g., stride-3) → scalar fallback.
2. **Alignment** ;  The starting address must be divisible by the vector width. `cudaMalloc` returns 256-byte-aligned pointers, so global memory is fine. Shared memory offsets can break alignment if your tile size isn't a multiple of 4 floats (16 bytes).
3. **MaxVecBits** ;  CuTe caps vectorization at this value (default 128 bits). You can lower it for testing, but 128 is the sweet spot for most GPU architectures.

**The Copy Atom Hierarchy:**

| Copy Operation | Vector Width | When to Use |
| :--- | :--- | :--- |
| `UniversalCopy<T>` | `sizeof(T)` bits | Fallback. Always works, never fast. |
| `AutoVectorizingCopyWithAssumedAlignment<128>` | Up to 128 bits | Default for static layouts. The "just works" choice. |
| `AutoVectorizingCopyWithAssumedAlignment<64>` | Up to 64 bits | When you know alignment is only 8-byte. |
| `AutoVectorizingCopyWithAssumedAlignment<8>` | Up to 8 bits | Conservative ;  CuTe uses this for dynamic layouts. Essentially scalar. |

**Hardware Note:** The GPU memory bus fetches 128 bits (or 32 bytes on some architectures) per transaction regardless of how much you asked for. Scalar loads waste bus bandwidth. Vectorized loads use the full transaction. This is why `LDG.128` doesn't cost more latency than `LDG.32` ;  the memory controller does the same work either way. You're just *using* more of what it already fetched.

> **Gotcha:** CuTe's auto-vectorization is a **compile-time** decision. If your layouts are dynamic (runtime shapes/strides), CuTe can't prove contiguity at compile time and falls back to scalar. Use `Int<N>{}` (static integers) for shapes and strides whenever possible ;  it's not just about performance, it's about enabling vectorization.

**What comes next:** This tutorial copies with *one thread per partition*. But moving a 128×128 tile with 256 threads ;  where all threads cooperate on the same tile ;  requires coordinating who copies what. That's `TiledCopy` (Tutorial 04), which combines a `Copy_Atom` (the vectorization from this tutorial) with a thread layout and a value layout to orchestrate the full tile copy.

---

# The Parallel Copy ; Orchestrating Threads with TiledCopy

- URL: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/
- Markdown: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/index.md
- Text: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/index.txt
- Published: 2026-02-24
- Topic: CUDA
- Description: How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.

# The Parallel Copy ;  Orchestrating Threads with TiledCopy

**Difficulty:** Beginner–Intermediate
**Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 02: The Art of Slicing](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/), [Tutorial 03: The Naive Copy](https://www.dcbaslani.xyz/blog/03_the_naive_copy/)

## 1. The Problem (The "Why")

Tutorial 03 showed how one thread can vectorize its copy ;  using a dolly instead of hand-carrying boxes. But a real kernel doesn't have one worker on the loading dock; it has 32, 128, or 256 workers that all need to copy the *same tile together*, without stepping on each other's toes.

You could combine Tutorial 02's `local_partition` with Tutorial 03's `copy()`:

```cpp
// Manual approach ;  works, but fragile
auto thr_g = local_partition(g_tensor, Layout<Shape<_8, _32>>{}, threadIdx.x);
auto thr_s = local_partition(s_tensor, Layout<Shape<_8, _32>>{}, threadIdx.x);
copy(AutoVectorizingCopy{}, thr_g, thr_s);
```

This works ;  but *you* have to manually choose a thread layout, make sure it divides the tile evenly, and hope the resulting per-thread partition is contiguous enough for vectorization. Change the tile size? Redo the math. Transpose the layout? Redo it again. Switch from `LDG` to `cp.async`? Rewrite the whole thing.

CuTe's `TiledCopy` bundles all three decisions into one declarative object:

| Piece | What it controls | From Tutorial... |
| :--- | :--- | :--- |
| `Copy_Atom` | What each thread can carry in one trip (the dolly) | 03 |
| `thr_layout` | Where each thread stands on the tile (their grid position) | 02 |
| `val_layout` | How many values each thread handles per trip | New |

You declare these once. CuTe generates the partitioning, the vectorization, and the thread-to-element mapping automatically.

> **B200 Note:** On Hopper/Blackwell, `TiledCopy` is also the container for TMA and `cp.async` atoms. The `thr_layout`/`val_layout` mechanism is identical ;  only the `Copy_Atom` changes. Master this pattern now and TMA (Tutorial 08) becomes a one-line swap.

## 2. The Mental Model (The Visual)

Imagine you have a **16×8 tile** to copy and **32 threads** (one warp). You need to answer three questions:

1. **What tool does each thread use?** → `Copy_Atom` with `UniversalCopy<uint128_t>` ;  each thread moves 128 bits (4 floats) per load instruction. This is the dolly from Tutorial 03.
2. **How are threads arranged?** → `thr_layout = (4, 8)` ;  4 rows of threads, 8 columns.
3. **How many extra values per thread?** → `val_layout = (1, 1)` ;  the atom's 4 floats are enough; no extra tiling needed.

```text
Tile coverage = thr_layout × atom_values × val_layout
              = (4, 8)    × (4, 1)      × (1, 1)
              = (16, 8)   ✓ matches our tile!
```

### The Ownership Map

Each cell shows which thread (T00–T31) copies it. The atom's 4 floats go along mode-0 (down the column), so each thread owns a 4×1 vertical strip:

```text
              col 0    col 1    col 2    col 3    col 4    col 5    col 6    col 7
            ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
   row  0   │  T00   │  T04   │  T08   │  T12   │  T16   │  T20   │  T24   │  T28   │
   row  1   │  T00   │  T04   │  T08   │  T12   │  T16   │  T20   │  T24   │  T28   │ ← T00 owns
   row  2   │  T00   │  T04   │  T08   │  T12   │  T16   │  T20   │  T24   │  T28   │   rows 0–3,
   row  3   │  T00   │  T04   │  T08   │  T12   │  T16   │  T20   │  T24   │  T28   │   col 0
            ├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤
   row  4   │  T01   │  T05   │  T09   │  T13   │  T17   │  T21   │  T25   │  T29   │
   row  5   │  T01   │  T05   │  T09   │  T13   │  T17   │  T21   │  T25   │  T29   │
   row  6   │  T01   │  T05   │  T09   │  T13   │  T17   │  T21   │  T25   │  T29   │
   row  7   │  T01   │  T05   │  T09   │  T13   │  T17   │  T21   │  T25   │  T29   │
            ├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤
   row  8   │  T02   │  T06   │  T10   │  T14   │  T18   │  T22   │  T26   │  T30   │
   row  9   │  T02   │  T06   │  T10   │  T14   │  T18   │  T22   │  T26   │  T30   │
   row 10   │  T02   │  T06   │  T10   │  T14   │  T18   │  T22   │  T26   │  T30   │
   row 11   │  T02   │  T06   │  T10   │  T14   │  T18   │  T22   │  T26   │  T30   │
            ├────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤
   row 12   │  T03   │  T07   │  T11   │  T15   │  T19   │  T23   │  T27   │  T31   │
   row 13   │  T03   │  T07   │  T11   │  T15   │  T19   │  T23   │  T27   │  T31   │
   row 14   │  T03   │  T07   │  T11   │  T15   │  T19   │  T23   │  T27   │  T31   │
   row 15   │  T03   │  T07   │  T11   │  T15   │  T19   │  T23   │  T27   │  T31   │
            └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
```

Notice the two critical patterns:

- **Within each thread:** T00's 4 values (rows 0–3 of col 0) are contiguous in column-major memory. That's 4 floats = 128 bits ;  a single `LDG.128`. The atom's `uint128_t` type made this happen.
- **Across threads:** Each column has a different thread (T00, T04, T08, ...). No two threads touch the same cell. The `thr_layout = (4,8)` made this happen.

### The Three-Part Recipe

```text
┌──────────────────────────────────┐
│         Copy_Atom                │ ← The tool: "what can one thread carry?"
│  (UniversalCopy<uint128_t>,      │    Each load moves 4 floats / 128 bits.
│   float)                         │    (the dolly from Tutorial 03)
└────────────────┬─────────────────┘
                 │
    ┌────────────┴────────────┐
    │                         │
┌───┴──────────┐   ┌─────────┴──────────┐
│  thr_layout  │   │    val_layout       │
│  (4, 8)      │   │    (1, 1)           │
│              │   │                     │
│  "32 workers │   │  "no extra values   │
│   in a 4×8   │   │   beyond what the   │
│   grid"      │   │   atom handles"     │
└──────────────┘   └────────────────────┘

Tile covered = (4 thr × 4 atom × 1 val, 8 thr × 1 atom × 1 val) = (16, 8)
```

> **The Moving Crew Rule:** Tutorial 03 was about one mover and their dolly. `TiledCopy` is the **shift supervisor's clipboard** ;  a plan that assigns every mover to a grid position on the warehouse floor and specifies what tool to use. The `Copy_Atom` says what each mover carries (the dolly). The `thr_layout` says where each mover stands. The `val_layout` says how many extra trips each mover makes beyond one atom load. Together, these three things tile the entire room with no gaps and no overlaps.

## 3. The Solution (The Code)

Two kernels: one visualizes the thread-to-element ownership map, one performs an actual global → shared copy.

```cpp
#include <cute/tensor.hpp>
#include <cute/atom/copy_atom.hpp>
#include <cute/algorithm/copy.hpp>

#include <cstdio>

using namespace cute;

// ─── Kernel 1: Visualize which thread owns which tile cell ───
template <class TiledCopy, class TileLayout>
__global__ void visualize_ownership(TiledCopy tiled_copy, TileLayout tile_layout)
{
    extern __shared__ float smem[];

    // Build a shared-memory tensor to stamp thread IDs into
    auto tile = make_tensor(make_smem_ptr(smem), tile_layout);

    // Get this thread's copy slice
    auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x);
    auto thr_view = thr_copy.partition_D(tile);

    // Each thread writes its ID into the cells it owns
    for (int i = 0; i < size(thr_view); ++i) {
        thr_view(i) = float(threadIdx.x);
    }
    __syncthreads();

    // Thread 0 prints the full ownership map
    if (threadIdx.x == 0) {
        printf("Tile ownership map (%d x %d), %d threads:\n\n",
               int(size<0>(tile)), int(size<1>(tile)), int(blockDim.x));

        printf("       ");
        for (int n = 0; n < size<1>(tile); ++n) printf("c%-4d ", n);
        printf("\n");

        for (int m = 0; m < size<0>(tile); ++m) {
            printf("r%-4d  ", m);
            for (int n = 0; n < size<1>(tile); ++n) {
                printf("T%02d   ", int(tile(m, n)));
            }
            printf("\n");
        }
    }
}

// ─── Kernel 2: Actual Global → Shared copy using TiledCopy ───
template <class TiledCopy, class GmemLayout, class SmemLayout>
__global__ void tiled_copy_kernel(float const* __restrict__ g_ptr,
                                  GmemLayout   gmem_layout,
                                  SmemLayout   smem_layout,
                                  TiledCopy    tiled_copy)
{
    extern __shared__ float smem[];

    auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout);  // (M, N)
    auto s_tensor = make_tensor(make_smem_ptr(smem),  smem_layout);  // (M, N)

    // ── The key lines: partition via TiledCopy ──
    auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x);
    auto thr_g    = thr_copy.partition_S(g_tensor);    // source partition
    auto thr_s    = thr_copy.partition_D(s_tensor);    // destination partition

    // ── Execute the copy ──
    copy(tiled_copy, thr_g, thr_s);

    __syncthreads();

    // Thread 0 prints a few rows to verify correctness
    if (threadIdx.x == 0) {
        printf("\nShared memory after TiledCopy (column-major, showing row by row):\n");
        for (int m = 0; m < size<0>(s_tensor); ++m) {
            printf("  row %2d: ", m);
            for (int n = 0; n < size<1>(s_tensor); ++n) {
                printf("%5.0f ", s_tensor(m, n));
            }
            printf("\n");
        }
    }
}

int main()
{
    constexpr int M = 16, N = 8;

    // ─── Build the TiledCopy ───
    // 1. Copy_Atom: each load moves 128 bits (4 floats) via uint128_t
    // 2. thr_layout: 32 threads in a 4×8 grid (column-major)
    // 3. val_layout: 1×1 ;  the atom already handles 4 floats per thread
    auto tiled_copy = make_tiled_copy(
        Copy_Atom<UniversalCopy<uint128_t>, float>{},
        Layout<Shape<_4, _8>>{},      // thr_layout: 4 rows × 8 cols = 32 threads
        Layout<Shape<_1, _1>>{}       // val_layout: atom covers everything
    );

    // Tile layouts: column-major for both
    auto tile_layout = make_layout(make_shape(Int<M>{}, Int<N>{}), LayoutLeft{});

    int smem_bytes = M * N * sizeof(float);

    // ─── Kernel 1: Visualize the ownership map ───
    printf("=== TiledCopy Ownership Map ===\n\n");
    visualize_ownership<<<1, 32, smem_bytes>>>(tiled_copy, tile_layout);
    cudaDeviceSynchronize();

    // ─── Kernel 2: Actual Global → Shared copy ───
    // Fill source: h_data[i] = i (flat index in column-major order)
    float h_data[M * N];
    for (int i = 0; i < M * N; ++i) h_data[i] = float(i);

    float* d_data;
    cudaMalloc(&d_data, sizeof(h_data));
    cudaMemcpy(d_data, h_data, sizeof(h_data), cudaMemcpyHostToDevice);

    printf("\n=== TiledCopy: Global -> Shared ===\n");
    tiled_copy_kernel<<<1, 32, smem_bytes>>>(d_data, tile_layout, tile_layout, tiled_copy);
    cudaDeviceSynchronize();

    cudaFree(d_data);
    return 0;
}
```

**Expected Output:**

```text
=== TiledCopy Ownership Map ===

Tile ownership map (16 x 8), 32 threads:

       c0    c1    c2    c3    c4    c5    c6    c7
r0     T00   T04   T08   T12   T16   T20   T24   T28
r1     T00   T04   T08   T12   T16   T20   T24   T28
r2     T00   T04   T08   T12   T16   T20   T24   T28
r3     T00   T04   T08   T12   T16   T20   T24   T28
r4     T01   T05   T09   T13   T17   T21   T25   T29
r5     T01   T05   T09   T13   T17   T21   T25   T29
r6     T01   T05   T09   T13   T17   T21   T25   T29
r7     T01   T05   T09   T13   T17   T21   T25   T29
r8     T02   T06   T10   T14   T18   T22   T26   T30
r9     T02   T06   T10   T14   T18   T22   T26   T30
r10    T02   T06   T10   T14   T18   T22   T26   T30
r11    T02   T06   T10   T14   T18   T22   T26   T30
r12    T03   T07   T11   T15   T19   T23   T27   T31
r13    T03   T07   T11   T15   T19   T23   T27   T31
r14    T03   T07   T11   T15   T19   T23   T27   T31
r15    T03   T07   T11   T15   T19   T23   T27   T31

=== TiledCopy: Global -> Shared ===

Shared memory after TiledCopy (column-major, showing row by row):
  row  0:     0    16    32    48    64    80    96   112
  row  1:     1    17    33    49    65    81    97   113
  row  2:     2    18    34    50    66    82    98   114
  row  3:     3    19    35    51    67    83    99   115
  row  4:     4    20    36    52    68    84   100   116
  row  5:     5    21    37    53    69    85   101   117
  row  6:     6    22    38    54    70    86   102   118
  row  7:     7    23    39    55    71    87   103   119
  row  8:     8    24    40    56    72    88   104   120
  row  9:     9    25    41    57    73    89   105   121
  row 10:    10    26    42    58    74    90   106   122
  row 11:    11    27    43    59    75    91   107   123
  row 12:    12    28    44    60    76    92   108   124
  row 13:    13    29    45    61    77    93   109   125
  row 14:    14    30    46    62    78    94   110   126
  row 15:    15    31    47    63    79    95   111   127
```

The values confirm column-major order: element 0 at (0,0), element 1 at (1,0), ..., element 16 at (0,1). The `TiledCopy` moved every element to the right shared memory cell ;  32 threads, zero conflicts, all vectorized.

## 4. Step-by-Step Explanation

**Line: `auto tiled_copy = make_tiled_copy(Copy_Atom{...}, thr_layout, val_layout);`**

This is the declaration ;  the shift supervisor's clipboard. Three ingredients:

1. **`Copy_Atom<UniversalCopy<uint128_t>, float>{}`** ;  The copy atom. `uint128_t` means each thread loads 128 bits = 4 floats in a single instruction (`LDG.128`). The `float` tells CuTe the logical element type.

2. **`Layout<Shape<_4, _8>>{}`** ;  Thread layout. 32 threads arranged as 4 rows × 8 columns in column-major order. Thread 0 → position (0,0), Thread 1 → (1,0), ..., Thread 4 → (0,1), ..., Thread 31 → (3,7).

3. **`Layout<Shape<_1, _1>>{}`** ;  Value layout. No extra values per thread beyond the atom's 4 floats. If the tile were bigger, you'd increase this to cover the extra elements (see "Bigger Tiles" below).

**Tile coverage:** `(4 threads × 4 atom_vals × 1 val, 8 threads × 1 atom_val × 1 val) = (16, 8)`.

**Line: `auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x);`**

Each thread asks the `TiledCopy` for its personal *slice* ;  a lightweight object that knows which tile cells belong to this thread. This is each mover reading their assignment off the clipboard.

**Line: `auto thr_g = thr_copy.partition_S(g_tensor);`**

Partitions the source (global memory) tensor for this thread. The result is a tensor containing only the elements this thread should read. The shape is `(ValuesPerAtom, Repetitions...)`:

- For our 16×8 tile: shape `(_4)` ;  four contiguous floats loaded as one `uint128_t`. No repetitions because the tile matches the `TiledCopy` coverage exactly.
- If the tile were **32×8**: shape `(_4, _2)` ;  four floats per load × 2 rounds along mode-0. CuTe adds the repetition dimension automatically.

**Line: `auto thr_s = thr_copy.partition_D(s_tensor);`**

Same for the destination (shared memory). Its shape matches `thr_g` so that `copy()` knows where each source value lands.

**Line: `copy(tiled_copy, thr_g, thr_s);`**

Executes the copy. CuTe loops over all repetition dimensions (if any), applying the `Copy_Atom` to each batch of values. For our 16×8 tile:

- 32 threads × 4 floats each = 128 floats = entire tile
- Each thread issues one `LDG.128` (global load) and one `STS.128` (shared store)
- Total: 32 loads + 32 stores, all vectorized ;  the entire tile moves in one round

### Bigger Tiles and Repetitions

What if your tile is 128×128 but you still have 32 threads?

```cpp
auto big_copy = make_tiled_copy(
    Copy_Atom<UniversalCopy<uint128_t>, float>{},
    Layout<Shape<_4, _8>>{},      // still 32 threads
    Layout<Shape<_1, _1>>{}       // same atom
);
// TiledCopy covers (16, 8) per round.
// Tile is (128, 128) → repetitions = (128/16, 128/8) = (8, 16).
// partition_S returns shape (_4, _8, _16).
// copy() loops over the 8×16 = 128 repetitions automatically.
```

Or you can increase `val_layout` to give each thread more work per round:

```cpp
auto wide_copy = make_tiled_copy(
    Copy_Atom<UniversalCopy<uint128_t>, float>{},
    Layout<Shape<_4, _8>>{},      // 32 threads
    Layout<Shape<_2, _1>>{}       // each thread handles 2 columns of atom-loads
);
// Coverage per round: (4×4×2, 8×1×1) = (32, 8).
// partition_S for a 128×128 tile: shape (_4, _2, _4, _16).
```

Don't worry about memorizing these shapes ;  the point is that `copy()` handles the loops. You just set up the `TiledCopy` and call `copy()`.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** Tutorial 03 was about one mover and their dolly. `TiledCopy` is the **shift supervisor's clipboard** ;  a plan that assigns every mover to a grid position on the warehouse floor. The `Copy_Atom` says what tool each mover uses (the dolly). The `thr_layout` is where they stand. The `val_layout` is how many extra trips each mover makes. Three pieces, one plan, zero overlap.

**The `val_layout` and Vectorization:**

The atom's vector direction must align with the contiguous memory dimension, or you lose vectorization:

| Memory Layout | Atom Type | Why It Works |
| :--- | :--- | :--- |
| Column-major `(M,N):(1,M)` | `UniversalCopy<uint128_t>` with thr (4,8) | 4 floats along mode-0 are stride-1 → `LDG.128` |
| Row-major `(M,N):(N,1)` | `UniversalCopy<uint128_t>` with thr (8,4) | Swap thread arrangement so the atom's 4 values hit stride-1 mode-1 |
| Strided / dynamic | `UniversalCopy<float>` | Scalar fallback ;  always works, always slow |

If the atom's 4-element chunk lands on non-contiguous addresses, the hardware can't coalesce them and you're back to scalar speed. Always check: *"which dimension has stride 1?"* ;  and point the atom's vector width in that direction.

**The `TiledCopy` API Cheat-Sheet:**

| API Call | What It Does |
| :--- | :--- |
| `make_tiled_copy(atom, thr, val)` | Build the plan from three ingredients |
| `tiled_copy.get_thread_slice(tid)` | Each worker reads their assignment |
| `slice.partition_S(src)` | Partition source for this thread |
| `slice.partition_D(dst)` | Partition destination for this thread |
| `copy(tiled_copy, thr_src, thr_dst)` | Execute the copy (loops over repetitions) |

**`TiledCopy` vs. Manual `local_partition` + `copy`:**

| | `local_partition` + `copy` | `TiledCopy` |
| :--- | :--- | :--- |
| Thread mapping | You compute it | Declared once |
| Vectorization | Depends on partition contiguity | Follows from atom type |
| Swap to `cp.async` / TMA | Rewrite everything | Change one `Copy_Atom` |
| Repetitions for big tiles | Manual loop | Automatic |

**Hardware Note:** For global memory loads, the GPU coalesces requests from threads in the same warp that hit the same 128-byte cache line. To maximize coalescing, threads with adjacent IDs should access adjacent memory addresses. For column-major data, spread threads along mode-0 (rows) first ;  which is what `thr_layout = (T_rows, T_cols)` in column-major order does naturally: threads 0, 1, 2, 3 hit rows 0–3 of the same column, which are adjacent in memory.

> **Gotcha ;  partition shape surprise:** `partition_S` and `partition_D` return tensors shaped `(ValuesPerAtom, Repetitions...)`, NOT `(M, N)`. The first mode is the "hand" for a single atom invocation; the remaining modes are how many times the atom repeats to cover the tile. If you `print()` the partition and see an unexpected number of modes, it's because your tile is bigger than the `TiledCopy`'s single-round coverage. This is normal ;  `copy()` iterates over the repetitions automatically.

**What comes next:** Each thread's copy is now vectorized and coordinated. But when 32 threads simultaneously write to shared memory, they can collide on the *same bank*. If threads 0, 8, 16, 24 all write to addresses that map to shared memory bank 0, the hardware serializes those writes ;  a **bank conflict**. That's the topic of Tutorial 05 (Swizzling), where `composition` with a `Swizzle` layout remaps addresses to spread them across banks.

---

# Swizzling ; Avoiding Shared Memory Bank Conflicts

- URL: https://www.dcbaslani.xyz/blog/05_swizzling/
- Markdown: https://www.dcbaslani.xyz/blog/05_swizzling/index.md
- Text: https://www.dcbaslani.xyz/blog/05_swizzling/index.txt
- Published: 2026-02-26
- Topic: CUDA
- Description: How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.

# Swizzling ;  Avoiding Shared Memory Bank Conflicts

**Difficulty:** Intermediate
**Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/), basic shared memory understanding

## 1. The Problem (The "Why")

Tutorial 04's `TiledCopy` moves data from global to shared memory ;  32 threads, 128-bit vectorized stores, zero coordination issues. Beautiful.

But there's a hidden trap. Shared memory isn't one big flat buffer. It's split into **32 banks**, each serving one 4-byte access per cycle. When two threads in the same warp access the *same bank* at the *same time* (different addresses within that bank), the hardware serializes their accesses. This is a **bank conflict**. An N-way conflict means N sequential accesses instead of 1 ;  your shared memory bandwidth drops to 1/N.

Here's the catch: bank conflicts depend on the **access pattern**, not just the data layout. Your `TiledCopy` might *write* to smem conflict-free (column-first, nicely spread across banks). But the next stage ;  an MMA reading that data row-first ;  might collide on every access.

The culprit is regularity. Column-major strides like 8, 16, 32, or 128 divide evenly into 32 banks, so different columns in the same row keep landing on the same bank. The fix: **swizzle** the shared memory layout. CuTe's `Swizzle<B, M, S>` XORs parts of the address to break this regularity ;  one line of code, zero bank conflicts.

> **B200 Note:** On Blackwell (SM100), all supported MMA swizzle modes ;  including no-swizzle (8×16B interleaved) ;  are **bank-conflict-free on both the MMA read side and the TMA write side**. Swizzling still matters, though: using no-swizzle or smaller swizzle modes can reduce **TMA achievable throughput** when populating shared memory. So on Hopper/Blackwell the swizzle is primarily about maximizing TMA write bandwidth, not avoiding MMA read conflicts.

## 2. The Mental Model (The Visual)

### How Shared Memory Banks Work

Shared memory is divided into 32 banks, each 4 bytes wide. Bank assignment is cyclic:

```text
Float index:    0    1    2    3    4   ...   31   32   33  ...
Bank:          B00  B01  B02  B03  B04 ...  B31  B00  B01  ...

                    bank = float_index % 32
```

Within a single warp (32 threads executing simultaneously):
- Each thread accesses a *different* bank → **1 cycle** (full parallelism)
- K threads access the *same* bank → **K cycles** (serialized)

### The Problem: Column-Major + Row Access = Conflict

Consider an 8×8 tile stored column-major in shared memory (stride = `(1, 8)`):

```text
WITHOUT Swizzle ;  Bank map for 8×8 column-major (1,8):

       c0    c1    c2    c3    c4    c5    c6    c7
r0     B00   B08   B16   B24   B00   B08   B16   B24    ← 2-way conflict!
r1     B01   B09   B17   B25   B01   B09   B17   B25      c0 & c4 share B00
r2     B02   B10   B18   B26   B02   B10   B18   B26      c1 & c5 share B08
r3     B03   B11   B19   B27   B03   B11   B19   B27      c2 & c6 share B16
r4     B04   B12   B20   B28   B04   B12   B20   B28      c3 & c7 share B24
r5     B05   B13   B21   B29   B05   B13   B21   B29
r6     B06   B14   B22   B30   B06   B14   B22   B30
r7     B07   B15   B23   B31   B07   B15   B23   B31
```

Reading **down a column** (the TiledCopy write path): banks 0,1,2,3,4,5,6,7 ;  all different. No conflict.

Reading **across a row** (e.g., MMA consuming data): row 0 hits banks 0,8,16,24,**0**,8,16,24 ;  **2-way conflict** on every bank! Columns 0 and 4 always collide, 1 and 5 collide, and so on.

Why? Because the column stride (8) divides evenly into 32: `(col × 8) % 32` cycles with period 4, so columns 0 and 4 produce the same bank. The regularity of the stride creates a repeating pattern that piles threads onto the same banks.

> With larger strides the problem gets worse. A stride of 32 maps *every column in the same row* to the same bank ;  an 8-way conflict for 8 columns. A stride of 128? Still hits the same bank pattern, just at a bigger scale.

### The Fix: Swizzle Breaks the Pattern

`Swizzle<3, 2, 3>` XORs high address bits into low address bits, scrambling the bank assignment so no two columns in the same row share a bank:

```text
WITH Swizzle<3,2,3> ;  Bank map for the same 8×8 tile:

       c0    c1    c2    c3    c4    c5    c6    c7
r0     B00   B08   B16   B24   B04   B12   B20   B28    ← all 8 unique!
r1     B01   B09   B17   B25   B05   B13   B21   B29
r2     B02   B10   B18   B26   B06   B14   B22   B30
r3     B03   B11   B19   B27   B07   B15   B23   B31
r4     B04   B12   B20   B28   B00   B08   B16   B24    ← all 8 unique!
r5     B05   B13   B21   B29   B01   B09   B17   B25
r6     B06   B14   B22   B30   B02   B10   B18   B26
r7     B07   B15   B23   B31   B03   B11   B19   B27
```

Read across row 0: banks 0,8,16,24,**4**,12,20,28 ;  **all 8 unique!**
Read across row 4: banks 4,12,20,28,**0**,8,16,24 ;  **all 8 unique!**
Read down column 4: banks 4,5,6,7,0,1,2,3 ;  **all 8 unique!**

Zero bank conflicts in any direction. The XOR shifted columns 4–7 by 4 banks relative to columns 0–3, breaking the collision pattern.

### The Brick Wall Analogy

```text
Without swizzle (joints aligned):        With swizzle (staggered):

┌────┬────┬────┬────┐                    ┌────┬────┬────┬────┐
│ B0 │ B8 │ B0 │ B8 │                    │ B0 │ B8 │ B4 │B12 │
├────┼────┼────┼────┤                    ├────┼────┼────┼────┤
│ B0 │ B8 │ B0 │ B8 │                    │ B4 │B12 │ B0 │ B8 │
├────┼────┼────┼────┤                    ├────┼────┼────┼────┤
│ B0 │ B8 │ B0 │ B8 │                    │ B0 │ B8 │ B4 │B12 │
└────┴────┴────┴────┘                    └────┴────┴────┴────┘
Same banks every row                     Banks shift per row
→ conflicts stack up                     → conflicts eliminated
```

Swizzle is the GPU equivalent of **staggered brick-laying**. In a brick wall, each row is offset by half a brick so the joints don't line up ;  this prevents cracks from running straight through. In shared memory, each row's addresses are XOR-shifted so the bank assignments don't repeat ;  this prevents bank conflicts from stacking up.

### How `Swizzle<B, M, S>` Works

The swizzle modifies a flat address by XOR-ing two groups of bits:

```text
Address bit layout for Swizzle<3, 2, 3>:

  bit:   7  6  5 │ 4  3  2 │ 1  0
         ────────│─────────│──────
         source  │ target  │ free
         (B=3)   │ (B=3)   │(M=2)
                 │    ↑    │
                 └────XOR──┘
                   (shift S=3)

  swizzled = addr ^ ((addr >> S) & mask)

  where mask covers B bits at the target position.
```

The three parameters:

| Parameter | Meaning | Effect |
| :--- | :--- | :--- |
| **M** (free bits) | Bottom M bits are untouched | 2^M elements stay contiguous. **M=2 → 4 floats (128 bits) stay together → vectorized loads still work!** |
| **B** (XOR width) | Number of bits to XOR | Scrambles across 2^B banks at a time. B=3 → 8-bank groups. |
| **S** (shift) | Distance between source and target bit fields | Target bits = `[M : M+B)`. Source bits = `[M+S : M+S+B)`. |

The critical parameter is **M**: it controls the granularity of the swizzle. Because bits `[0:M)` are untouched, blocks of 2^M consecutive elements remain contiguous after swizzling. With M=2, blocks of 4 floats (= 128 bits) are preserved ;  exactly what `LDG.128` / `STS.128` needs.

## 3. The Solution (The Code)

A bank conflict visualizer that prints the bank assignment map with and without swizzle, followed by a `TiledCopy` demonstrating that the swizzle is transparent to the copy.

```cpp
#include <cute/tensor.hpp>
#include <cute/swizzle.hpp>

#include <cstdio>

using namespace cute;

// ─── Kernel: Print bank assignment for every tile cell ───
template <class Layout>
__global__ void bank_conflict_visualizer(Layout smem_layout, const char* label)
{
    if (threadIdx.x != 0) return;

    int M = size<0>(smem_layout);
    int N = size<1>(smem_layout);

    // ── Print bank map ──
    printf("%s (%d x %d):\n\n", label, M, N);

    printf("       ");
    for (int n = 0; n < N; ++n) printf("c%-4d ", n);
    printf("\n");

    for (int m = 0; m < M; ++m) {
        printf("r%-4d  ", m);
        for (int n = 0; n < N; ++n) {
            int addr = smem_layout(m, n);   // flat offset (in floats)
            int bank = addr % 32;
            printf("B%02d   ", bank);
        }
        printf("\n");
    }

    // ── Count row-wise conflicts ──
    int total_conflicts = 0;
    for (int m = 0; m < M; ++m) {
        int bank_hits[32] = {};
        for (int n = 0; n < N; ++n) {
            bank_hits[smem_layout(m, n) % 32]++;
        }
        for (int b = 0; b < 32; ++b) {
            if (bank_hits[b] > 1)
                total_conflicts += bank_hits[b] - 1;
        }
    }
    printf("\nRow-wise bank conflicts: %d  (%s)\n\n",
           total_conflicts, total_conflicts == 0 ? "CLEAN" : "CONFLICTS!");
}

// ─── Kernel: TiledCopy into swizzled smem ───
template <class TiledCopy, class GmemLayout, class SmemLayout>
__global__ void copy_with_swizzle(float const* __restrict__ g_ptr,
                                  GmemLayout   gmem_layout,
                                  SmemLayout   smem_layout,
                                  TiledCopy    tiled_copy)
{
    extern __shared__ float smem[];

    auto g_tensor = make_tensor(make_gmem_ptr(g_ptr), gmem_layout);
    auto s_tensor = make_tensor(make_smem_ptr(smem),  smem_layout);

    auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x);
    auto thr_g    = thr_copy.partition_S(g_tensor);
    auto thr_s    = thr_copy.partition_D(s_tensor);

    // ── Copy ;  swizzle is completely transparent ──
    copy(tiled_copy, thr_g, thr_s);

    __syncthreads();

    // Thread 0 verifies the data (reading through the swizzled layout)
    if (threadIdx.x == 0) {
        printf("Shared memory (logical view through swizzled layout):\n");
        for (int m = 0; m < size<0>(s_tensor); ++m) {
            printf("  row %d: ", m);
            for (int n = 0; n < size<1>(s_tensor); ++n) {
                printf("%5.0f ", s_tensor(m, n));
            }
            printf("\n");
        }
    }
}

int main()
{
    constexpr int M = 8, N = 8;

    // ─── 1. Bank conflict visualizer ───
    printf("=== Bank Conflict Visualizer ===\n\n");

    // Plain column-major layout: (8,8):(1,8)
    auto plain = Layout<Shape<_8, _8>, Stride<_1, _8>>{};

    // Swizzled layout: composition(swizzle, layout)
    //   composition applies the XOR to the flat offset that layout produces
    auto swizzled = composition(Swizzle<3, 2, 3>{}, plain);

    bank_conflict_visualizer<<<1, 1>>>(plain, "WITHOUT Swizzle");
    cudaDeviceSynchronize();

    bank_conflict_visualizer<<<1, 1>>>(swizzled, "WITH Swizzle<3,2,3>");
    cudaDeviceSynchronize();

    // ─── 2. TiledCopy with swizzled smem ───
    printf("=== TiledCopy + Swizzle ===\n\n");

    auto gmem_layout = Layout<Shape<_8, _8>, Stride<_1, _8>>{};

    // TiledCopy: 32 threads, 2 floats (64 bits) per atom
    auto tiled_copy = make_tiled_copy(
        Copy_Atom<UniversalCopy<uint64_t>, float>{},   // 64 bits = 2 floats
        Layout<Shape<_4, _8>>{},                        // 32 threads in 4×8
        Layout<Shape<_1, _1>>{}
    );

    float h_data[M * N];
    for (int i = 0; i < M * N; ++i) h_data[i] = float(i);

    float* d_data;
    cudaMalloc(&d_data, sizeof(h_data));
    cudaMemcpy(d_data, h_data, sizeof(h_data), cudaMemcpyHostToDevice);

    int smem_bytes = M * N * sizeof(float);
    copy_with_swizzle<<<1, 32, smem_bytes>>>(
        d_data, gmem_layout, swizzled, tiled_copy);
    cudaDeviceSynchronize();

    cudaFree(d_data);
    return 0;
}
```

**Expected Output:**

```text
=== Bank Conflict Visualizer ===

WITHOUT Swizzle (8 x 8):

       c0    c1    c2    c3    c4    c5    c6    c7
r0     B00   B08   B16   B24   B00   B08   B16   B24
r1     B01   B09   B17   B25   B01   B09   B17   B25
r2     B02   B10   B18   B26   B02   B10   B18   B26
r3     B03   B11   B19   B27   B03   B11   B19   B27
r4     B04   B12   B20   B28   B04   B12   B20   B28
r5     B05   B13   B21   B29   B05   B13   B21   B29
r6     B06   B14   B22   B30   B06   B14   B22   B30
r7     B07   B15   B23   B31   B07   B15   B23   B31

Row-wise bank conflicts: 8  (CONFLICTS!)

WITH Swizzle<3,2,3> (8 x 8):

       c0    c1    c2    c3    c4    c5    c6    c7
r0     B00   B08   B16   B24   B04   B12   B20   B28
r1     B01   B09   B17   B25   B05   B13   B21   B29
r2     B02   B10   B18   B26   B06   B14   B22   B30
r3     B03   B11   B19   B27   B07   B15   B23   B31
r4     B04   B12   B20   B28   B00   B08   B16   B24
r5     B05   B13   B21   B29   B01   B09   B17   B25
r6     B06   B14   B22   B30   B02   B10   B18   B26
r7     B07   B15   B23   B31   B03   B11   B19   B27

Row-wise bank conflicts: 0  (CLEAN!)

=== TiledCopy + Swizzle ===

Shared memory (logical view through swizzled layout):
  row 0:     0     8    16    24    32    40    48    56
  row 1:     1     9    17    25    33    41    49    57
  row 2:     2    10    18    26    34    42    50    58
  row 3:     3    11    19    27    35    43    51    59
  row 4:     4    12    20    28    36    44    52    60
  row 5:     5    13    21    29    37    45    53    61
  row 6:     6    14    22    30    38    46    54    62
  row 7:     7    15    23    31    39    47    55    63
```

The data is logically correct ;  `s_tensor(m, n)` returns the right value even though the physical addresses in shared memory have been scrambled. The swizzle is completely transparent to the reader: you access `(row, col)` the same way you always did, and CuTe routes to the swizzled address behind the scenes.

## 4. Step-by-Step Explanation

**Line: `auto plain = Layout<Shape<_8, _8>, Stride<_1, _8>>{};`**

The unswizzled column-major layout. `plain(m, n) = m * 1 + n * 8`. This is the address formula from Tutorial 01 ;  the same `address = coord · stride` dot product. The bank for element `(m, n)` is `(m + 8n) % 32`.

**Line: `auto swizzled = composition(Swizzle<3, 2, 3>{}, plain);`**

This is the one-line fix. `composition(f, g)` creates a new function `h(x) = f(g(x))`:

1. `plain(m, n)` converts coordinates to a flat offset: `m + 8*n`
2. `Swizzle<3, 2, 3>` XORs high bits into low bits of that offset

The result is a new layout where `swizzled(m, n)` gives a *different* flat offset ;  one that spreads banks evenly. CuTe stores and retrieves data through this remapped offset, so the logical view is unchanged but the physical addresses avoid conflicts.

**How the XOR works for `Swizzle<3, 2, 3>`:**

```text
Example: plain address 32 (row=0, col=4) in binary = 0b100000

  Step 1: Extract source bits [5:8] → 0b1 (bit 5 is set)
  Step 2: Shift right by S=3       → 0b100 (now at bit position 2)
  Step 3: XOR with original         → 0b100000 ^ 0b000100 = 0b100100 = 36

  Plain bank:    32 % 32 = 0   (same as col 0!)
  Swizzled bank: 36 % 32 = 4   (different ;  conflict eliminated)
```

The XOR takes "which region of the tile am I in?" (the high bits) and mixes it into "which bank do I hit?" (the low bits). Different regions get different scrambles, so they never collide.

**Line: `auto swizzled = composition(Swizzle<3, 2, 3>{}, plain);`** *(why these specific numbers?)*

- **M=2 (free bits):** The bottom 2 bits of the address are untouched. 2^2 = 4 consecutive floats stay contiguous = 128 bits. This preserves vectorized `STS.128` stores. If M were 0, even adjacent elements could get scrambled and vectorization would break.
- **B=3 (XOR width):** 3 bits → scramble across groups of 2^3 = 8 banks. Enough to break the 8-column pattern of our 8×8 tile.
- **S=3 (shift):** Source bits start at position M+S = 5, right above the target bits at position M = 2. No overlap between source and target.

**Line: `copy(tiled_copy, thr_g, thr_s);`**

The `TiledCopy` doesn't know about the swizzle ;  and doesn't need to. It partitions `s_tensor` based on its layout (which now includes the swizzle), and the `copy()` call stores through the swizzled addresses. Each thread's 2-float store still lands on contiguous addresses (because M=2 preserves 4-element contiguity), so vectorization is unaffected.

**Line: `s_tensor(m, n)` in the print loop**

Reading back through the swizzled layout is also transparent. `s_tensor(m, n)` computes the swizzled address, reads from that location, and returns the correct value. The logical view is identical to the plain layout ;  the scrambling only affects the physical address.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** Swizzle is **staggered brick-laying** for shared memory. In a brick wall, each row is offset so joints don't line up vertically ;  this prevents cracks from running straight through. In shared memory, each row's addresses are XOR-shifted so bank assignments don't repeat across columns ;  this prevents bank conflicts from stacking up. The `composition` call is the mortar that binds the swizzle to your layout: one line, and every access goes through the staggered pattern automatically.

**Choosing Swizzle Parameters:**

| Parameter | Rule of Thumb |
| :--- | :--- |
| **M** (free bits) | Set to `log2(vector_width / sizeof(element))`. For 128-bit loads on `float`: M = log2(128/32) = 2. For `half`: M = log2(128/16) = 3. |
| **B** (XOR width) | Set to `log2(num_columns_to_disambiguate)`. For 8 columns: B=3. For 16: B=4 (but you'll need the address space to support it). |
| **S** (shift) | Usually = B (non-overlapping source and target fields). This is the simplest and most common choice. |

**Common Swizzle Configurations in CUTLASS:**

| Swizzle | Use Case | Free Bits | Scramble Width |
| :--- | :--- | :--- | :--- |
| `Swizzle<3, 3, 3>` | 128-byte smem tiles, `half` elements | 8 halfs = 128 bits | 8 banks |
| `Swizzle<3, 2, 3>` | 128-byte smem tiles, `float` elements | 4 floats = 128 bits | 8 banks |
| `Swizzle<2, 3, 3>` | 64-byte smem tiles | 8 elements | 4 banks |
| `Swizzle<1, 3, 3>` | 32-byte smem tiles | 8 elements | 2 banks |
| `Swizzle<0, 0, 0>` | No swizzle (identity) | ;  | ;  |

**Why the swizzle doesn't break vectorization:**

The M "free" bits guarantee that blocks of 2^M consecutive elements remain at consecutive addresses after swizzling. For M=2, any group of 4 adjacent floats stays contiguous ;  exactly what `STS.128` needs. The swizzle only shuffles *which group of 4* goes where, not the elements within the group.

**`composition` ;  the key CuTe operation:**

`composition(f, g)` computes `f(g(x))`. When `f` is a `Swizzle` and `g` is a `Layout`, the result is a new layout-like object that maps coordinates to swizzled offsets. You can use it anywhere a layout is expected:

```cpp
// Unswizzled ;  has bank conflicts
auto smem_layout = Layout<Shape<_128, _32>, Stride<_1, _128>>{};

// Swizzled ;  bank-conflict-free, one-line change
auto smem_layout = composition(Swizzle<3, 2, 3>{},
                               Layout<Shape<_128, _32>, Stride<_1, _128>>{});

// Use it exactly like a normal layout
auto s_tensor = make_tensor(make_smem_ptr(smem), smem_layout);
```

**Hardware Note:** Shared memory bank conflicts show up in `ncu` (NVIDIA Nsight Compute) under the metric `l1tex__data_bank_conflicts_pipe_lsu_mem_shared`. If this number is non-zero, you have conflicts. The fix is almost always a swizzle on your smem layout. On Hopper/Blackwell, all MMA swizzle modes (including no-swizzle) are bank-conflict-free on the MMA read side ;  the swizzle in CUTLASS's default smem layouts for WGMMA/tcgen05 is there to maximize **TMA write throughput** when populating shared memory, not to avoid read-side bank conflicts.

> **Gotcha ;  swizzle and `cosize`:** A swizzled layout may produce offsets larger than the plain layout's maximum. Always allocate shared memory based on `cosize(swizzled_layout)`, not `size(plain_layout)`. In practice, for well-chosen parameters (where B+M+S ≤ address bits), the max offset stays within the original range, but it's good practice to use `cosize` regardless.

> **Gotcha ;  debugging swizzled smem:** If you `printf` raw smem addresses, the data looks scrambled. This is expected ;  the physical layout *is* scrambled. Always access through the CuTe tensor (using logical coordinates), and the swizzle is transparent. If you need to dump raw smem for debugging, compose with the inverse swizzle (XOR is its own inverse ;  applying the same swizzle twice gives the original address).

**What comes next:** With vectorized, parallel, bank-conflict-free copies from global to shared memory, the data movement story is complete. Tutorial 06 (Hello, MMA) shifts to the *compute* side: feeding that data into a Tensor Core instruction to trigger a hardware matrix multiply.

---

# Hello, MMA — Your First Tensor Core Instruction

- URL: https://www.dcbaslani.xyz/blog/06_hello_mma/
- Markdown: https://www.dcbaslani.xyz/blog/06_hello_mma/index.md
- Text: https://www.dcbaslani.xyz/blog/06_hello_mma/index.txt
- Published: 2026-03-03
- Topic: CUDA
- Description: How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

# Hello, MMA ;  Your First Tensor Core Instruction

**Difficulty:** Intermediate
**Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/) (same partition API pattern)

## 1. The Problem (The "Why")

Tutorials 03–05 were about *moving data* ;  getting floats from global memory into shared memory, fast and conflict-free. But at some point, you need to actually *compute*. On modern NVIDIA GPUs, the fastest math unit isn't the CUDA core ;  it's the **Tensor Core**.

A Tensor Core executes a matrix multiply-accumulate (MMA) as a single hardware instruction. The SM80 `mma.sync.aligned.m16n8k16` instruction multiplies a 16×16 half-precision matrix by an 8×16 half-precision matrix and accumulates into a 16×8 float matrix ;  **2048 multiply-adds in a single instruction**, across all 32 threads of a warp. (Note: this instruction has a latency of 20+ cycles; "single instruction" means one issue slot, not one clock cycle. See *Hardware note ;  throughput* below.)

The catch: you can't just pass pointers. The hardware expects A, B, and C fragments to be distributed across 32 threads in a **very specific register layout**. Thread 0 holds certain rows and columns of A, thread 7 holds others, and the MMA instruction reaches into all 32 threads' registers simultaneously. Get the distribution wrong and you get garbage ;  or a hardware trap.

CuTe's `TiledMMA` handles this distribution for you, using the **exact same partition pattern** you learned with `TiledCopy`:

| TiledCopy (Tutorial 04) | TiledMMA (This Tutorial) |
| :--- | :--- |
| `make_tiled_copy(Copy_Atom, thr, val)` | `make_tiled_mma(MMA_Atom<Op>{})` |
| `get_thread_slice(tid)` | `get_thread_slice(tid)` |
| `partition_S` / `partition_D` | `partition_A` / `partition_B` / `partition_C` |
| `copy(tiled_copy, src, dst)` | `gemm(tiled_mma, fragA, fragB, fragC)` |

Same API shape, different hardware backend. If you understood Tutorial 04's ownership maps and partitions, you already know 80% of this tutorial.

> **B200 Note:** On Hopper/Blackwell, the same `TiledMMA` pattern drives WGMMA (Warpgroup MMA) ;  128 threads feeding a larger Tensor Core instruction. Tutorial 09 covers that. The API is identical; only the atom changes.

## 2. The Mental Model (The Visual)

### The Stamping Press

Think of a Tensor Core as a **stamping press** in a factory. You load raw-material trays (A and B fragments) into specific slots, press the button, and out comes the product (C accumulator). The press is fixed-size ;  it always stamps a 16×8 tile of results from 16×16 and 8×16 inputs.

```text
                        ┌─────────────────────────────┐
                        │      STAMPING PRESS          │
                        │   (One MMA Instruction)      │
                        │                              │
   ┌──────────────┐     │                              │     ┌──────────┐
   │  A (16×16)   │────▶│   32 threads cooperate:      │────▶│ C (16×8) │
   │   half       │     │   each loads its tray slot,  │     │  float   │
   │  256 values  │     │   hardware does the rest     │     │ 128 vals │
   └──────────────┘     │                              │     └──────────┘
   ┌──────────────┐     │   2048 multiply-adds         │
   │  B (8×16)    │────▶│   in a SINGLE instruction    │
   │   half       │     │                              │
   │  128 values  │     └─────────────────────────────-┘
   └──────────────┘
```

### Per-Thread Fragment Sizes

The 256 values of A, 128 values of B, and 128 values of C are split evenly across 32 threads:

```text
Thread #7 (one of 32):
┌────────────────────────────────────────────────────┐
│  Registers:                                        │
│    A fragment: 8 half values   (256 total / 32)    │
│    B fragment: 4 half values   (128 total / 32)    │
│    C fragment: 4 float values  (128 total / 32)    │
│                                                    │
│  The hardware knows exactly which matrix cells     │
│  these 8+4+4 values correspond to.                 │
│  CuTe knows too ;  that's what partition_A/B/C do.  │
└────────────────────────────────────────────────────┘
```

You don't need to know the exact register-to-coordinate mapping (it's in the PTX docs if you're curious). CuTe's partition functions handle it transparently ;  just like `TiledCopy` handles thread-to-element mapping without you memorizing a table.

### The API Flow

```text
make_tiled_mma(MMA_Atom<Op>{})
         │
    get_thread_slice(tid)
         │
    ┌────┴──────────────────┬──────────────────┐
    │                       │                  │
partition_A(sA)       partition_B(sB)     partition_C(gC)
    │                       │                  │
make_fragment_like    make_fragment_like   make_fragment_like
    │                       │                  │
copy(smem → reg)      copy(smem → reg)     clear(accum)
    │                       │                  │
    └───────────┬───────────┘                  │
                │                              │
        gemm(tiled_mma, fragA, fragB, accum)───┘
                              │
                     copy(accum → gC)
```

## 3. The Solution (The Code)

Two kernels: first, an ownership map that shows which thread owns which C element (same technique as Tutorial 04). Second, the actual micro-GEMM: load A and B into shared memory, partition them for the MMA, execute one Tensor Core instruction, and write C back to global memory.

```cpp
#include <cute/tensor.hpp>
#include <cute/arch/mma_sm80.hpp>
#include <cute/atom/mma_atom.hpp>
#include <cute/algorithm/gemm.hpp>

#include <cuda_fp16.h>
#include <cstdio>

using namespace cute;

// ─── Kernel 1: MMA Ownership Map ───
// Which thread owns which elements of the C matrix?
__global__ void mma_ownership_map()
{
    using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;
    auto tiled_mma = make_tiled_mma(MMA_Atom<MMA_Op>{});

    // Shared memory to stamp thread IDs into C positions
    __shared__ int smem_C[16 * 8];
    auto sC_layout = make_layout(make_shape(Int<16>{}, Int<8>{}));
    auto sC = make_tensor(make_smem_ptr(smem_C), sC_layout);

    // Clear
    for (int i = threadIdx.x; i < 128; i += blockDim.x)
        smem_C[i] = -1;
    __syncthreads();

    // Each thread stamps its ID into its owned C cells
    auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x);
    auto tCsC = thr_mma.partition_C(sC);

    for (int i = 0; i < size(tCsC); ++i)
        tCsC(i) = threadIdx.x;
    __syncthreads();

    // Thread 0 prints the map
    if (threadIdx.x == 0) {
        printf("MMA C Ownership Map (16x8) ;  SM80_16x8x16_F32F16F16F32_TN\n");
        printf("Each thread holds 4 float accumulators.\n\n");

        printf("       ");
        for (int n = 0; n < 8; ++n) printf("n=%-4d", n);
        printf("\n");

        for (int m = 0; m < 16; ++m) {
            printf("m=%-4d ", m);
            for (int n = 0; n < 8; ++n)
                printf("T%02d   ", sC(m, n));
            printf("\n");
        }
    }
}

// ─── Kernel 2: Micro-GEMM ───
// One warp computes C(16×8) += A(16×16) × B(8×16)^T using a single MMA instruction.
__global__ void micro_gemm(half const* A_ptr, half const* B_ptr, float* C_ptr)
{
    using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;
    auto tiled_mma = make_tiled_mma(MMA_Atom<MMA_Op>{});

    // ── Shared memory ──
    __shared__ half smem_A[16 * 16];   // M × K
    __shared__ half smem_B[8 * 16];    // N × K

    // A layout: (M,K) = (16,16), K stride-1 (required by TN atom)
    auto sA_layout = make_layout(make_shape(Int<16>{}, Int<16>{}),
                                 make_stride(Int<16>{}, Int<1>{}));
    // B layout: (N,K) = (8,16), K stride-1
    auto sB_layout = make_layout(make_shape(Int<8>{}, Int<16>{}),
                                 make_stride(Int<16>{}, Int<1>{}));
    // C layout: (M,N) = (16,8), column-major
    auto gC_layout = make_layout(make_shape(Int<16>{}, Int<8>{}));

    auto sA = make_tensor(make_smem_ptr(smem_A), sA_layout);
    auto sB = make_tensor(make_smem_ptr(smem_B), sB_layout);
    auto gC = make_tensor(make_gmem_ptr(C_ptr),  gC_layout);

    // ── 1. Load A, B from global to shared (simple loop ;  not TiledCopy) ──
    auto gA = make_tensor(make_gmem_ptr(A_ptr), sA_layout);
    auto gB = make_tensor(make_gmem_ptr(B_ptr), sB_layout);

    for (int i = threadIdx.x; i < size(sA); i += blockDim.x)
        sA(i) = gA(i);
    for (int i = threadIdx.x; i < size(sB); i += blockDim.x)
        sB(i) = gB(i);
    __syncthreads();

    // ── 2. Partition A, B, C for this thread ──
    auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x);

    auto tCsA = thr_mma.partition_A(sA);    // (MMA, MMA_M, MMA_K) = (8, 1, 1)
    auto tCsB = thr_mma.partition_B(sB);    // (MMA, MMA_N, MMA_K) = (4, 1, 1)
    auto tCgC = thr_mma.partition_C(gC);    // (MMA, MMA_M, MMA_N) = (4, 1, 1)

    // ── 3. Create register fragments ──
    auto tCrA = make_fragment_like(tCsA);   // 8 half regs for A
    auto tCrB = make_fragment_like(tCsB);   // 4 half regs for B
    auto accum = make_fragment_like(tCgC);  // 4 float regs for C
    clear(accum);

    // ── 4. Copy A, B from shared memory to registers ──
    copy(tCsA, tCrA);
    copy(tCsB, tCrB);

    // ── 5. Execute the MMA! ──
    gemm(tiled_mma, tCrA, tCrB, accum);

    // ── 6. Write result back to global memory ──
    copy(accum, tCgC);
}

int main()
{
    constexpr int M = 16, N = 8, K = 16;

    // ─── 1. Ownership Map ───
    printf("=== MMA Ownership Map ===\n\n");
    mma_ownership_map<<<1, 32>>>();
    cudaDeviceSynchronize();

    // ─── 2. Micro-GEMM ───
    printf("\n=== Micro-GEMM: C(16x8) = A(16x16) * B(8x16)^T ===\n\n");

    // Test data: A = all ones, B = all ones → C[m][n] = K = 16.0
    half h_A[M * K], h_B[N * K];
    float h_C[M * N] = {};

    for (int i = 0; i < M * K; ++i) h_A[i] = __float2half(1.0f);
    for (int i = 0; i < N * K; ++i) h_B[i] = __float2half(1.0f);

    half *d_A, *d_B;
    float *d_C;
    cudaMalloc(&d_A, M * K * sizeof(half));
    cudaMalloc(&d_B, N * K * sizeof(half));
    cudaMalloc(&d_C, M * N * sizeof(float));

    cudaMemcpy(d_A, h_A, M * K * sizeof(half), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, N * K * sizeof(half), cudaMemcpyHostToDevice);

    micro_gemm<<<1, 32>>>(d_A, d_B, d_C);
    cudaDeviceSynchronize();

    cudaMemcpy(h_C, d_C, M * N * sizeof(float), cudaMemcpyDeviceToHost);

    // C is column-major: element (m,n) is at offset m + M*n
    printf("C (16x8) ;  each entry = K * 1.0 * 1.0 = 16.0:\n\n");
    printf("       ");
    for (int n = 0; n < N; ++n) printf("n=%-6d", n);
    printf("\n");
    for (int m = 0; m < M; ++m) {
        printf("m=%-4d ", m);
        for (int n = 0; n < N; ++n)
            printf("%-8.1f", h_C[m + M * n]);
        printf("\n");
    }

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    return 0;
}
```

**Expected Output:**

```text
=== MMA Ownership Map ===

MMA C Ownership Map (16x8) ;  SM80_16x8x16_F32F16F16F32_TN
Each thread holds 4 float accumulators.

       n=0   n=1   n=2   n=3   n=4   n=5   n=6   n=7
m=0    T00   T00   T04   T04   T08   T08   T12   T12
m=1    T01   T01   T05   T05   T09   T09   T13   T13
m=2    T02   T02   T06   T06   T10   T10   T14   T14
m=3    T03   T03   T07   T07   T11   T11   T15   T15
m=4    T16   T16   T20   T20   T24   T24   T28   T28
m=5    T17   T17   T21   T21   T25   T25   T29   T29
m=6    T18   T18   T22   T22   T26   T26   T30   T30
m=7    T19   T19   T23   T23   T27   T27   T31   T31
m=8    T00   T00   T04   T04   T08   T08   T12   T12
m=9    T01   T01   T05   T05   T09   T09   T13   T13
m=10   T02   T02   T06   T06   T10   T10   T14   T14
m=11   T03   T03   T07   T07   T11   T11   T15   T15
m=12   T16   T16   T20   T20   T24   T24   T28   T28
m=13   T17   T17   T21   T21   T25   T25   T29   T29
m=14   T18   T18   T22   T22   T26   T26   T30   T30
m=15   T19   T19   T23   T23   T27   T27   T31   T31

=== Micro-GEMM: C(16x8) = A(16x16) * B(8x16)^T ===

C (16x8) ;  each entry = K * 1.0 * 1.0 = 16.0:

       n=0     n=1     n=2     n=3     n=4     n=5     n=6     n=7
m=0    16.0    16.0    16.0    16.0    16.0    16.0    16.0    16.0
m=1    16.0    16.0    16.0    16.0    16.0    16.0    16.0    16.0
...
m=15   16.0    16.0    16.0    16.0    16.0    16.0    16.0    16.0
```

The ownership map reveals the hardware's register distribution: threads 0–3 and 16–19 alternate in 4-row blocks along M, and each thread owns pairs of adjacent columns. Thread T00 holds C[0][0], C[0][1], C[8][0], C[8][1] ;  four floats scattered across two 4-row blocks. You don't need to memorize this pattern; `partition_C` handles it automatically.

The GEMM result is 16.0 everywhere: each entry is the dot product of a length-16 all-ones row from A with a length-16 all-ones row from B. One instruction, 2048 multiply-adds, 32 threads cooperating through their registers. (The instruction has 20+ cycles of latency, but it occupies only one issue slot ;  so pipelining multiple MMAs is how real kernels hide that cost.)

## 4. Step-by-Step Explanation

**Line: `using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;`**

This names the PTX instruction we want: `mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32`. The naming convention:

| Part | Meaning |
| :--- | :--- |
| `SM80` | Ampere architecture (compute capability 8.0+) |
| `16x8x16` | M × N × K ;  the tile dimensions of one MMA instruction |
| `F32F16F16F32` | Types: D=float, A=half, B=half, C=float (D, A, B, C order) |
| `TN` | A is K-major (Transposed), B is K-major (Normal) |

The "TN" layout is the most common: both A and B have their K dimension as stride-1. This means your shared memory layouts for A(M,K) and B(N,K) should both have K as the fast-moving axis.

**Line: `auto tiled_mma = make_tiled_mma(MMA_Atom<MMA_Op>{});`**

Wraps the raw hardware instruction in CuTe's `MMA_Atom`, then promotes it to a `TiledMMA`. With no extra tiling arguments, this creates the simplest possible MMA ;  one atom, one warp, no repetition. The tile size is exactly 16×8×16.

Compare with Tutorial 04's `make_tiled_copy(Copy_Atom, thr_layout, val_layout)`. For MMA, the thread layout and value layout are determined by the hardware instruction itself ;  there's nothing to choose. The atom *is* the complete plan.

**Lines: `sA_layout` with `make_stride(Int<16>{}, Int<1>{})`**

A is (M, K) = (16, 16) with K stride-1. This means `sA(m, k) = m * 16 + k * 1` ;  moving along K is contiguous in memory. The TN atom requires this: the PTX instruction's "row" descriptor for A means K is the fast index.

If you accidentally used column-major (M stride-1, K stride-16), the MMA would read the wrong values ;  you'd be feeding columns of A where it expects rows.

**Lines: `sB_layout` with `make_stride(Int<16>{}, Int<1>{})`**

B is (N, K) = (8, 16), also K stride-1. Same reason: the TN atom's "col" descriptor for B also means K is the fast index. Both A and B want K contiguous.

> This is why "TN" is popular: when A and B are both K-major, your global memory loads can be coalesced along the K dimension. You'll see this pattern again in Tutorial 07's full GEMM.

**Line: `auto tCsA = thr_mma.partition_A(sA);  // (MMA, MMA_M, MMA_K) = (8, 1, 1)`**

`partition_A` asks: "Given the full shared memory tensor `sA` and the MMA's register layout, which elements belong to this thread?" It returns a 3-mode tensor:

- **Mode 0 (MMA):** The 8 half values this thread feeds into one atom invocation.
- **Mode 1 (MMA_M):** How many times the atom repeats along M. Our A has M=16 and the atom handles M=16, so 1 repetition.
- **Mode 2 (MMA_K):** How many times the atom repeats along K. Our K=16 matches the atom's K=16, so 1 repetition.

Compare with Tutorial 04: `partition_S` returned `(ValuesPerAtom, Reps...)`. Same idea, different dimension names.

**Line: `auto tCrA = make_fragment_like(tCsA);`**

Creates a register tensor with the same shape as the smem partition: `(8, 1, 1)` of half-precision values. These 8 registers are this thread's personal tray ;  the slot in the stamping press that it's responsible for loading.

**Line: `copy(tCsA, tCrA);`**

Copies this thread's 8 half values from shared memory to registers. Each thread reads from different smem addresses (determined by `partition_A`), so there's no conflict. After this line, all 32 threads have their A fragments loaded and ready.

**Line: `gemm(tiled_mma, tCrA, tCrB, accum);`**

This is the magic line. `gemm()` dispatches the PTX `mma.sync` instruction, which:

1. Reads A fragments from all 32 threads' registers (8 halfs each = 256 total = 16×16 matrix).
2. Reads B fragments from all 32 threads' registers (4 halfs each = 128 total = 8×16 matrix).
3. Multiplies the 16×16 matrix by the 8×16 matrix (as C += A × B^T).
4. Accumulates the 16×8 result into all 32 threads' C registers (4 floats each = 128 total).

This is a single instruction, but **not** a single clock cycle ;  the MMA has a latency of 20+ cycles. However, it only occupies one issue slot, so the warp scheduler can issue other independent instructions (or another MMA from a different warp) while the Tensor Core is still working. Real GEMM kernels pipeline multiple MMAs to hide this latency. The `.sync` in the PTX name means all threads in the warp participate simultaneously ;  this is a collective operation, not a per-thread one.

**Line: `copy(accum, tCgC);`**

Copies the 4 float accumulator values from registers to their corresponding positions in global memory C. Each thread writes to different addresses (determined by `partition_C`), so no conflicts. The host then reads back the complete 16×8 result.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** `MMA_Atom` is a **stamping press** ;  a fixed-size hardware machine that takes raw-material trays (A and B register fragments), processes them all at once, and stamps out a product (C accumulator). Each worker (thread) loads their assigned tray slot. `TiledMMA` is the factory floor plan that coordinates the workers. `gemm()` is pressing the button.

This analogy extends the Tutorial 03–04 warehouse metaphor: data arrives from the truck (global memory) via the loading dock (TiledCopy + smem), and now the stamping press (TiledMMA) processes it in the factory. The same workers (threads) do both jobs ;  they just switch from mover role to machine-operator role.

**Fragment size formula:**

For any MMA atom with shape M×N×K using W threads:

| Fragment | Per-thread values | Total |
| :--- | :--- | :--- |
| A | M × K / W | M × K |
| B | N × K / W | N × K |
| C | M × N / W | M × N |

For SM80_16x8x16 with W=32: A = 8 halfs, B = 4 halfs, C = 4 floats.

**What "TN" means for your layouts:**

| Layout | A (M, K) | B (N, K) |
| :--- | :--- | :--- |
| **TN** | K stride-1 `(K, 1)` | K stride-1 `(K, 1)` |
| TT | K stride-1 `(K, 1)` | N stride-1 `(1, N)` |
| NN | M stride-1 `(1, M)` | N stride-1 `(1, N)` |
| NT | M stride-1 `(1, M)` | K stride-1 `(K, 1)` |

TN is the most common because both A and B have K contiguous ;  global memory loads along the K-reduction dimension are coalesced.

**The ownership map tells you something important:**

Thread T00 owns C[0][0], C[0][1], C[8][0], C[8][1]. These four values are *not* contiguous in column-major memory (offsets 0, 16, 8, 24). This means writing the C accumulator to global memory via `copy(accum, tCgC)` produces scattered stores ;  each thread writes 4 separate floats to non-contiguous addresses. For a real GEMM kernel (Tutorial 07), the epilogue stage would use a TiledCopy to reorganize the writes through shared memory for coalesced global stores.

**Hardware note ;  checking your GPU:**

The SM80 MMA atom requires compute capability 8.0 or higher (A100, RTX 3090, etc.). To check:

```bash
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# Should print 8.0 or higher
```

**Hardware note ;  latency vs. throughput:**

A single `mma.sync.m16n8k16` has a **latency** of 20+ clock cycles ;  the result isn't ready to read for that long. But the **maximum issue frequency** (throughput) is much higher than one-per-20-cycles, because the Tensor Core is pipelined. The exact throughput depends on the GPU architecture:

| GPU | Max issue frequency (per SM partition) |
| :--- | :--- |
| A100 (SM80) | 1 MMA per 8 cycles |
| Consumer Ampere (e.g. RTX 3090) | 1 MMA per 16 cycles |
| Some other consumer GPUs | 1 MMA per 32 cycles |

On A100, each SM has 4 partitions (sub-partitions / warp schedulers), so the SM-wide throughput is one MMA per 2 cycles, or 2048 FMAs every 2 cycles. Multiply by 108 SMs and a 1.41 GHz clock and you get the advertised Tensor Core TFLOPS ;  but only if the data pipeline (Tutorials 03–05) keeps the Tensor Cores fed and you have enough independent MMAs in flight to hide the 20+ cycle latency.

> **Gotcha ;  layout mismatch:** If your smem layout doesn't match the TN convention (K stride-1), the MMA will read the wrong elements. The symptom is silently wrong results, not a crash. Always double-check that your A and B shared memory strides put K as stride-1 for TN atoms.

> **Gotcha ;  half precision:** The input matrices *must* be `half` (F16). If you accidentally store `float` values in smem and cast the pointer to `half*`, you'll get garbage. Use `__float2half()` to convert before storing.

**What comes next:** This tutorial computed a single 16×8 tile using one MMA instruction. But real matrices are much larger than 16×8. Tutorial 07 (The Global GEMM) shows how to tile over M, N, and K ;  looping the MMA across tiles, piping data from global memory through shared memory into the Tensor Core in a continuous stream.

---

# The Global GEMM — Putting It All Together

- URL: https://www.dcbaslani.xyz/blog/07_the_global_gemm/
- Markdown: https://www.dcbaslani.xyz/blog/07_the_global_gemm/index.md
- Text: https://www.dcbaslani.xyz/blog/07_the_global_gemm/index.txt
- Published: 2026-03-07
- Topic: CUDA
- Description: Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.

# The Global GEMM ;  Putting It All Together

**Difficulty:** Intermediate–Advanced
**Prerequisites:** [Tutorial 02: Partitioning](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/), [Tutorial 04: TiledCopy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/), [Tutorial 05: Swizzling](https://www.dcbaslani.xyz/blog/05_swizzling/), [Tutorial 06: Hello, MMA](https://www.dcbaslani.xyz/blog/06_hello_mma/)

## 1. The Problem (The "Why")

Tutorial 06 triggered a single Tensor Core instruction on a 16×8×16 (M×N×K) micro-tile. That's 2048 multiply-adds ;  impressive for one instruction, but real-world matrices are thousands of elements wide. You need to compute C(M×N) = A(M×K) × Bᵀ(N×K) where M, N, and K can each be 4096 or more.

The answer isn't one giant instruction. It's **tiling** ;  the same idea from Tutorial 02, now applied at every level:

| Level | What's tiled | Controlled by |
| :--- | :--- | :--- |
| **CTA grid** | Output C is divided into BLK_M×BLK_N tiles, one per thread block | `local_tile` + `blockIdx` |
| **K-loop** | The reduction dimension K is divided into BLK_K chunks, processed sequentially | The mainloop `for` loop |
| **Warp MMA** | Each BLK_M×BLK_N tile is further divided by the `TiledMMA` across 4 warps | `partition_A/B/C` + `gemm()` |

This is the structure of every high-performance GEMM on GPUs ;  from CUTLASS to cuBLAS. The names change but the three-level tiling is always the same. This tutorial writes that kernel from scratch, using nothing but the CuTe primitives you already know.

> **B200 Note:** On Hopper/Blackwell, the structure is identical but the data-movement layer changes: TiledCopy with `cp.async` or TMA replaces explicit loads, and the K-loop becomes a software pipeline (Tutorial 10). The compute layer (`TiledMMA` + `gemm()`) stays the same.

## 2. The Mental Model (The Visual)

### Running the Factory for a Full Production Day

Tutorials 03–06 built the factory:
- **Loading dock** (TiledCopy) ;  moves raw materials from the truck (global memory) into the warehouse (shared memory), using dollies (vectorized loads) and staggered shelving (swizzle).
- **Stamping press** (TiledMMA) ;  processes materials from the warehouse shelves into finished products (C accumulator).

Now we run the factory for a full **production day**:

```text
                    Global Memory (The Truck)
    ┌────────────────────────────────────────────────┐
    │  A (M × K)                  B (N × K)          │
    └──────────┬─────────────────────┬───────────────┘
               │                     │
    ═══════════╪═════════════════════╪══════ K-loop (the shift) ════
    ║          ▼                     ▼                             ║
    ║   ┌─────────────┐      ┌─────────────┐                      ║
    ║   │ A tile      │      │ B tile      │  ← TiledCopy         ║
    ║   │ (128 × 32)  │      │ (128 × 32)  │    gmem → smem       ║
    ║   └──────┬──────┘      └──────┬──────┘                      ║
    ║          │    Shared Memory    │                              ║
    ║          ▼    (swizzled)       ▼                              ║
    ║   ┌─────────────────────────────────┐                        ║
    ║   │         STAMPING PRESS          │  ← TiledMMA            ║
    ║   │    partition → fragment → gemm  │    smem → regs → MMA   ║
    ║   └────────────────┬────────────────┘                        ║
    ║                    │                                          ║
    ║              accum += partial C                               ║
    ║                                                              ║
    ═══════════════════════════════════════ repeat for each K tile ══

                    │
                    ▼  Epilogue (end of shift ;  ship the product)
             ┌──────────┐
             │  C tile   │  ← copy accum → gmem
             │ (128×128) │
             └──────────┘
```

Each thread block is an independent factory. The grid of blocks tiles the M×N output:

```text
Grid of CTAs over the output C (M × N):

         N →
    ┌─────────┬─────────┬─────────┬───┐
    │ CTA     │ CTA     │ CTA     │   │
  M │ (0,0)   │ (0,1)   │ (0,2)   │...│  Each CTA computes
  ↓ │ 128×128 │ 128×128 │ 128×128 │   │  one 128×128 tile of C
    ├─────────┼─────────┼─────────┼───┤
    │ CTA     │ CTA     │ CTA     │   │
    │ (1,0)   │ (1,1)   │ (1,2)   │...│
    │ 128×128 │ 128×128 │ 128×128 │   │
    ├─────────┼─────────┼─────────┼───┤
    │  ...    │  ...    │  ...    │   │
    └─────────┴─────────┴─────────┴───┘

grid = (M / 128, N / 128)
```

### The Three Phases

```text
Phase 1: COPY (the loading dock)
  gmem A[BLK_M × BLK_K] ──TiledCopy──▶ smem_A (swizzled)
  gmem B[BLK_N × BLK_K] ──TiledCopy──▶ smem_B (swizzled)
  __syncthreads()

Phase 2: COMPUTE (the stamping press)
  smem_A ──partition_A──▶ registers ──┐
                                      ├──gemm()──▶ accum += partial
  smem_B ──partition_B──▶ registers ──┘
  __syncthreads()

Phase 3: EPILOGUE (shipping)
  accum ──partition_C──▶ gmem C[BLK_M × BLK_N]
```

Phases 1 and 2 repeat `ceil(K / BLK_K)` times. Phase 3 happens once.

## 3. The Solution (The Code)

A complete GEMM kernel: `C (M×N, float) += A (M×K, half) × B^T (N×K, half)`. Uses 128 threads (4 warps), 128×128 CTA tiles, and SM80 Tensor Cores.

**Constraints:** M and N must be multiples of 128; K must be a multiple of 32. (Removing these constraints requires boundary masking ;  important for production, orthogonal to the core GEMM structure.)

```cpp
#include <cute/tensor.hpp>
#include <cute/arch/mma_sm80.hpp>
#include <cute/atom/mma_atom.hpp>
#include <cute/atom/copy_atom.hpp>
#include <cute/algorithm/gemm.hpp>
#include <cute/swizzle.hpp>

#include <cuda_fp16.h>
#include <cstdio>
#include <cstdlib>
#include <cmath>

using namespace cute;

// Tile sizes
constexpr int BLK_M = 128;
constexpr int BLK_N = 128;
constexpr int BLK_K = 32;
constexpr int NUM_THREADS = 128;  // 4 warps

__global__ void gemm_kernel(
    half const* __restrict__ A_ptr,
    half const* __restrict__ B_ptr,
    float*      __restrict__ C_ptr,
    int M, int N, int K)
{
    // SETUP: Layouts, TiledCopy, TiledMMA

    // Global memory tensors
    // A: (M, K) with K stride-1   B: (N, K) with K stride-1
    // C: (M, N) with N stride-1
    auto mA = make_tensor(make_gmem_ptr(A_ptr), make_shape(M, K), make_stride(K, Int<1>{}));
    auto mB = make_tensor(make_gmem_ptr(B_ptr), make_shape(N, K), make_stride(K, Int<1>{}));
    auto mC = make_tensor(make_gmem_ptr(C_ptr), make_shape(M, N), make_stride(N, Int<1>{}));

    // CTA tiling: each block claims its tile of A, B, C
    auto cta_coord = make_coord(blockIdx.x, blockIdx.y);

    // gA: (BLK_M, BLK_K, k_tiles) ;  this block's strip of A, sliced along K
    auto gA = local_tile(mA, make_shape(Int<BLK_M>{}, Int<BLK_K>{}),
                          make_coord(get<0>(cta_coord), _));
    // gB: (BLK_N, BLK_K, k_tiles)
    auto gB = local_tile(mB, make_shape(Int<BLK_N>{}, Int<BLK_K>{}),
                          make_coord(get<1>(cta_coord), _));
    // gC: (BLK_M, BLK_N) ;  this block's output tile
    auto gC = local_tile(mC, make_shape(Int<BLK_M>{}, Int<BLK_N>{}), cta_coord);

    // Shared memory layouts: K stride-1, swizzled
    // Swizzle<3,3,3>: M=3 → 8 halfs (128 bits) stay contiguous for vectorized loads
    auto sA_layout = composition(Swizzle<3, 3, 3>{},
        make_layout(make_shape(Int<BLK_M>{}, Int<BLK_K>{}),
                    make_stride(Int<BLK_K>{}, Int<1>{})));
    auto sB_layout = composition(Swizzle<3, 3, 3>{},
        make_layout(make_shape(Int<BLK_N>{}, Int<BLK_K>{}),
                    make_stride(Int<BLK_K>{}, Int<1>{})));

    // Shared memory allocation (static)
    __shared__ half smem_A[cosize_v<decltype(sA_layout)>];
    __shared__ half smem_B[cosize_v<decltype(sB_layout)>];

    auto sA = make_tensor(make_smem_ptr(smem_A), sA_layout);
    auto sB = make_tensor(make_smem_ptr(smem_B), sB_layout);

    // TiledCopy: gmem → smem via cp.async
    // Each atom loads 128 bits = 8 halfs along the K (stride-1) direction.
    // 128 threads, one per M-row. Each thread loads 32 K-values (4 × 128-bit async copies).
    auto tiled_copy = make_tiled_copy(
        Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, half>{},
        Layout<Shape<_128, _1>>{},    // 128 threads all in M
        Layout<Shape<_1, _32>>{}      // 32 K-values per thread (4 atoms of 8 halfs)
    );
    // Coverage: (128 × 1, 1 × 32) = (128, 32) ✓

    // TiledMMA: the Tensor Core plan
    // SM80_16x8x16 atom, 4 warps arranged (2,2) in M×N.
    // Base coverage: (2×16, 2×8) = (32, 16) per atom group.
    // For 128×128: gemm() iterates 4× in M, 8× in N, 2× in K automatically.
    auto tiled_mma = make_tiled_mma(
        MMA_Atom<SM80_16x8x16_F32F16F16F32_TN>{},
        Layout<Shape<_2, _2, _1>>{}   // 4 warps: 2 in M, 2 in N
    );

    // PARTITIONS: thread-level views

    // Copy partitions
    auto thr_copy = tiled_copy.get_thread_slice(threadIdx.x);
    auto tAgA = thr_copy.partition_S(gA);   // (CPY_VEC, CPY_M, CPY_K, k_tiles)
    auto tAsA = thr_copy.partition_D(sA);   // (CPY_VEC, CPY_M, CPY_K)
    auto tBgB = thr_copy.partition_S(gB);   // same structure for B
    auto tBsB = thr_copy.partition_D(sB);

    // MMA partitions
    auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x);
    auto tCsA = thr_mma.partition_A(sA);    // (MMA_VAL, MMA_M, MMA_K)
    auto tCsB = thr_mma.partition_B(sB);    // (MMA_VAL, MMA_N, MMA_K)
    auto tCgC = thr_mma.partition_C(gC);    // (MMA_VAL, MMA_M, MMA_N)

    // Register fragments
    auto tCrA  = make_fragment_like(tCsA);
    auto tCrB  = make_fragment_like(tCsB);
    auto accum = make_fragment_like(tCgC);  // C accumulator (float)
    clear(accum);

    // MAINLOOP: iterate over K tiles

    int k_tiles = size<2>(gA);   // = K / BLK_K

    for (int k = 0; k < k_tiles; ++k) {

        // Phase 1: COPY gmem → smem
        copy(tiled_copy, tAgA(_, _, _, k), tAsA);
        copy(tiled_copy, tBgB(_, _, _, k), tBsB);
        cp_async_fence();
        cp_async_wait<0>();
        __syncthreads();

        // Phase 2: COMPUTE smem → registers → MMA
        copy(tCsA, tCrA);
        copy(tCsB, tCrB);
        gemm(tiled_mma, tCrA, tCrB, accum);

        __syncthreads();
    }

    // EPILOGUE: write C accumulator to global memory

    copy(accum, tCgC);
}

// Host: verification + timing
int main()
{
    constexpr int M = 512, N = 512, K = 256;

    // Allocate and initialize
    int sizeA = M * K, sizeB = N * K, sizeC = M * N;

    half *h_A = new half[sizeA];
    half *h_B = new half[sizeB];
    float *h_C = new float[sizeC];

    // A = all 1.0h, B = all 1.0h → C[m][n] = K
    for (int i = 0; i < sizeA; ++i) h_A[i] = __float2half(1.0f);
    for (int i = 0; i < sizeB; ++i) h_B[i] = __float2half(1.0f);

    half *d_A, *d_B;
    float *d_C;
    cudaMalloc(&d_A, sizeA * sizeof(half));
    cudaMalloc(&d_B, sizeB * sizeof(half));
    cudaMalloc(&d_C, sizeC * sizeof(float));
    cudaMemcpy(d_A, h_A, sizeA * sizeof(half), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, sizeB * sizeof(half), cudaMemcpyHostToDevice);

    // Launch
    dim3 grid(M / BLK_M, N / BLK_N);   // (4, 4) for 512×512
    dim3 block(NUM_THREADS);             // 128

    gemm_kernel<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
    cudaDeviceSynchronize();

    // Verify
    cudaMemcpy(h_C, d_C, sizeC * sizeof(float), cudaMemcpyDeviceToHost);

    float max_err = 0.0f;
    float expected = float(K);  // K * 1.0 * 1.0
    for (int i = 0; i < sizeC; ++i)
        max_err = fmaxf(max_err, fabsf(h_C[i] - expected));

    printf("GEMM: C(%d x %d) = A(%d x %d) * B^T(%d x %d)\n", M, N, M, K, N, K);
    printf("Tile: %d x %d x %d, Threads: %d\n\n", BLK_M, BLK_N, BLK_K, NUM_THREADS);

    // Print a 4×4 corner
    printf("C[0:4][0:4] (expected %.0f everywhere):\n", expected);
    for (int m = 0; m < 4; ++m) {
        printf("  ");
        for (int n = 0; n < 4; ++n)
            printf("%8.1f", h_C[m * N + n]);
        printf("\n");
    }
    printf("\nMax absolute error: %e\n", max_err);
    printf("Result: %s\n", max_err < 1e-3f ? "PASS" : "FAIL");

    // Timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    int warmup = 5, iters = 20;

    for (int i = 0; i < warmup; ++i)
        gemm_kernel<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
    cudaDeviceSynchronize();

    cudaEventRecord(start);
    for (int i = 0; i < iters; ++i)
        gemm_kernel<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    float avg_ms = ms / iters;
    double flops = 2.0 * M * N * K;
    double tflops = (flops / (avg_ms / 1000.0)) / 1e12;

    printf("\nPerformance: %.3f ms  (%.2f TFLOPS)\n", avg_ms, tflops);

    // Cleanup
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    cudaEventDestroy(start); cudaEventDestroy(stop);
    delete[] h_A; delete[] h_B; delete[] h_C;
    return 0;
}
```

**Expected Output:**

```text
GEMM: C(512 x 512) = A(512 x 256) * B^T(512 x 256)
Tile: 128 x 128 x 32, Threads: 128

C[0:4][0:4] (expected 256 everywhere):
       256.0   256.0   256.0   256.0
       256.0   256.0   256.0   256.0
       256.0   256.0   256.0   256.0
       256.0   256.0   256.0   256.0

Max absolute error: 0.000000e+00
Result: PASS

Performance: X.XXX ms  (Y.YY TFLOPS)
```

Each entry of C is 256.0: the dot product of a length-256 all-ones row from A with a length-256 all-ones row from B. The K-loop ran `256 / 32 = 8` iterations, each accumulating a BLK_K=32 partial product. After 8 iterations, the accumulator held the full result.

## 4. Step-by-Step Explanation

### Setup Phase

**Lines: Global tensor creation with `make_tensor`**

```cpp
auto mA = make_tensor(make_gmem_ptr(A_ptr), make_shape(M, K), make_stride(K, Int<1>{}));
```

Creates a CuTe tensor wrapping global memory. The stride `(K, 1)` means K is stride-1 ;  the TN layout required by our MMA atom (Tutorial 06). Note that `K` is a runtime value while `Int<1>{}` is static. CuTe handles mixed static/dynamic strides.

**Lines: CTA tiling with `local_tile`**

```cpp
auto gA = local_tile(mA, make_shape(Int<BLK_M>{}, Int<BLK_K>{}),
                      make_coord(get<0>(cta_coord), _));
```

This is Tutorial 02's `local_tile` at the CTA level ;  the casino floor manager assigning tables:
- Divides A's M dimension into chunks of BLK_M (128). Block `blockIdx.x` claims its chunk.
- Divides A's K dimension into chunks of BLK_K (32). The underscore `_` means "give me all K-tiles as a third mode."
- Result shape: `(128, 32, k_tiles)` ;  this block's full strip of A, with `k_tiles = K / BLK_K` slices along K.

For C, both coordinates are fixed (no underscore):
```cpp
auto gC = local_tile(mC, make_shape(Int<BLK_M>{}, Int<BLK_N>{}), cta_coord);
```
Result shape: `(128, 128)` ;  this block's one output tile.

**Lines: Shared memory with swizzle**

```cpp
auto sA_layout = composition(Swizzle<3, 3, 3>{},
    make_layout(make_shape(Int<128>{}, Int<32>{}), make_stride(Int<32>{}, Int<1>{})));
```

The base layout is `(128, 32):(32, 1)` ;  K stride-1, same as gmem. `composition(Swizzle<3,3,3>{}, ...)` applies the XOR remapping from Tutorial 05. The parameters: M=3 preserves 8-half (128-bit) contiguity for vectorized stores, B=3 scrambles 8-bank groups, S=3 separates the bit fields. This is the standard swizzle for half-precision data.

**Lines: TiledCopy construction**

```cpp
auto tiled_copy = make_tiled_copy(
    Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, half>{},
    Layout<Shape<_128, _1>>{},
    Layout<Shape<_1, _32>>{}
);
```

This is Tutorial 04's supervisor's clipboard, now sized for the GEMM tile:
- **Copy_Atom:** `SM80_CP_ASYNC_CACHEALWAYS` ;  hardware async copy, 128 bits = 8 halfs per transaction along K (stride-1). Data goes directly from gmem to smem, bypassing registers.
- **thr_layout `(128, 1)`:** All 128 threads assigned to the M dimension ;  one thread per row.
- **val_layout `(1, 32)`:** Each thread copies 32 K-values (4 atoms × 8 halfs each).
- **Coverage:** `(128 × 1, 1 × 32) = (128, 32)` ;  exactly one smem tile. Each K-loop iteration fills the tile with one `copy()` call per matrix.

**Lines: TiledMMA construction**

```cpp
auto tiled_mma = make_tiled_mma(
    MMA_Atom<SM80_16x8x16_F32F16F16F32_TN>{},
    Layout<Shape<_2, _2, _1>>{}
);
```

Tutorial 06's stamping press, now with multiple warps. The second argument arranges 4 warps in a 2×2 grid over M and N:
- Base atom covers 16M × 8N × 16K.
- 2×2 warps → combined base: 32M × 16N × 16K.
- For the 128×128×32 smem tile, `gemm()` internally loops 4× in M (128/32), 8× in N (128/16), and 2× in K (32/16). You don't write these loops ;  `gemm()` handles them.

### Partition Phase

**Lines: Copy partitions**

```cpp
auto tAgA = thr_copy.partition_S(gA);   // (CPY_VEC, CPY_M, CPY_K, k_tiles)
auto tAsA = thr_copy.partition_D(sA);   // (CPY_VEC, CPY_M, CPY_K)
```

Same as Tutorial 04: `partition_S` creates this thread's view of the source (gmem), `partition_D` for the destination (smem). The extra `k_tiles` mode on `tAgA` comes from gA's third mode ;  it passes through untouched. In the mainloop, `tAgA(_, _, _, k)` selects the k-th K-tile for copying.

**Lines: MMA partitions and fragments**

```cpp
auto tCsA = thr_mma.partition_A(sA);    // (MMA_VAL, MMA_M, MMA_K)
auto tCrA = make_fragment_like(tCsA);   // register copy
auto accum = make_fragment_like(tCgC);  // C accumulator
```

Same as Tutorial 06: `partition_A` maps smem elements to this thread's MMA fragment slots. `make_fragment_like` creates a register tensor with the same shape. The accumulator `accum` lives in registers for the entire kernel ;  it's never written to smem.

### Mainloop

```cpp
for (int k = 0; k < k_tiles; ++k) {
    copy(tiled_copy, tAgA(_, _, _, k), tAsA);    // gmem → smem
    copy(tiled_copy, tBgB(_, _, _, k), tBsB);
    cp_async_fence();
    cp_async_wait<0>();
    __syncthreads();

    copy(tCsA, tCrA);                            // smem → registers
    copy(tCsB, tCrB);
    gemm(tiled_mma, tCrA, tCrB, accum);          // MMA!
    __syncthreads();
}
```

Each iteration processes one BLK_K=32 slice of the K dimension:

1. **Copy** ;  TiledCopy loads A's k-th tile `(128×32)` and B's k-th tile `(128×32)` from global to shared memory. The swizzle is transparent ;  the TiledCopy writes through swizzled addresses automatically.

2. **Sync** ;  All 128 threads must finish writing to smem before any thread starts reading for the MMA.

3. **Compute** ;  Copy from swizzled smem to register fragments, then `gemm()` executes the MMA atoms. Because the tile is larger than one atom, `gemm()` internally loops over M, N, and K repetitions. The result **accumulates** into `accum` ;  each K-iteration adds to the running sum, computing the partial dot products.

4. **Sync** ;  All threads must finish reading smem before the next iteration overwrites it.

After `k_tiles` iterations, `accum` holds the complete C tile.

### Epilogue

```cpp
copy(accum, tCgC);
```

Each thread writes its accumulator fragment (128 floats for this config) to global memory via the MMA's C partition. As noted in Tutorial 06, the C fragments are scattered ;  each thread writes to non-contiguous addresses. For a production kernel, CUTLASS routes the epilogue through shared memory for coalesced writes. For correctness, the simple scattered write works fine.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy: The Full Production Day.**
- **CTA grid** = multiple independent factories, one per output tile. They never interact.
- **K-loop (mainloop)** = one production shift. Each iteration loads a batch of raw materials (A/B K-tiles via TiledCopy), processes them on the stamping press (TiledMMA + gemm), and adds the result to the running accumulator.
- **Epilogue** = end of shift. Ship the finished product (accumulated C tile) to global memory.

The three-level tiling is the single most important structural insight for GPU GEMM. Everything else ;  pipelining, double buffering, warpgroup MMA ;  is optimization on top of this skeleton.

**Tile size cheat sheet:**

| Parameter | Typical Values | Trade-off |
| :--- | :--- | :--- |
| BLK_M × BLK_N | 128×128, 128×256, 256×128 | Larger tiles → more compute per gmem load (better arithmetic intensity). But more smem usage → fewer CTAs per SM → less occupancy. |
| BLK_K | 32, 64 | Larger K → fewer mainloop iterations → less loop overhead. But more smem per tile (BLK_M × BLK_K × sizeof(half) per matrix). |
| Threads | 128 (4 warps), 256 (8 warps) | More threads → more copy bandwidth, but more MMA rep sharing overhead. |

**Why the swizzle parameters are `<3, 3, 3>` for half:**

- M=3: 2^3 = 8 halfs = 128 bits stay contiguous. Matches the 128-bit vectorized copy.
- B=3: 2^3 = 8 banks scrambled. Breaks the column-stride conflict pattern.
- S=3: Source bits `[6:9)`, target bits `[3:6)`. Non-overlapping.

Compare with Tutorial 05's `Swizzle<3, 2, 3>` for float: M=2 (4 floats = 128 bits). The only difference is the free-bit count M, which must match the element size for 128-bit vectorization.

**Why two `__syncthreads()` in the mainloop:**

```text
Iteration k:                    Iteration k+1:
  copy gmem → smem                copy gmem → smem ← OVERWRITES smem!
  ──sync── (1)                    ──sync──
  gemm reads smem                 gemm reads smem
  ──sync── (2)                    ──sync──
```

Sync (1): Ensure all threads finished writing smem before any thread reads it for gemm.
Sync (2): Ensure all threads finished reading smem before the next iteration's copy overwrites it.

Without sync (2), a fast thread could start writing the (k+1)-th A tile into smem while a slow thread is still reading the k-th A tile for its gemm.

**The epilogue bottleneck:**

Our epilogue uses `copy(accum, tCgC)` ;  each thread writes its C fragments directly to global memory with scattered stores. On A100, this wastes memory bus bandwidth because each 4-byte float write fetches a full 128-byte cache line.

CUTLASS's production epilogues:
1. Each warp writes its C fragment to smem (in a contiguous layout).
2. Threads read back from smem in a coalesced pattern and write to gmem.
3. Optional: fuse element-wise operations (bias add, ReLU) into the epilogue.

For this tutorial, the simple epilogue is correct and clear. Optimizing it is a straightforward extension.

**Performance expectations:**

This kernel will not match cuBLAS. Missing optimizations:
- No double-buffering (the copy and compute phases are serialized)
- No software pipelining (Tutorial 10)
- No TMA (Tutorial 08)
- Scattered C epilogue
- No predication for boundary tiles

On A100 with 512×512×256, expect 5–15 TFLOPS (vs. ~150 TFLOPS peak for cuBLAS with large matrices). The point is not speed ;  it's understanding the structure that every fast GEMM kernel shares.

> **Gotcha ;  dynamic vs. static shapes:** The global tensors use dynamic shapes (runtime M, N, K), but the smem layouts and tile shapes use static `Int<N>{}`. This is intentional ;  CuTe needs static shapes to generate efficient code (compile-time layout math, unrolled loops). Dynamic shapes in gmem are fine because the copy and MMA operate on the statically-shaped smem/register tiles.

> **Gotcha ;  smem capacity:** Two smem tiles of 128×32 halfs = 2 × 128 × 32 × 2 bytes = 16 KB. Well within SM80's 164 KB limit. But 128×128 tiles with BLK_K=64 would need 32 KB ;  still fine, but leaves less for double buffering. Always check: `smem_bytes = 2 × BLK_M × BLK_K × sizeof(half)`.

**What comes next:** This kernel serializes copy and compute ;  the Tensor Cores are idle while waiting for data loads. Tutorial 08 (The TMA Revolution) introduces hardware-accelerated copies that free the threads. Tutorial 10 (Pipeline Overlap) shows how to overlap copy and compute using ping-pong buffering, approaching peak throughput.

---

# The TMA Revolution (Async Copy)

- URL: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/
- Markdown: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/index.md
- Text: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/index.txt
- Published: 2026-03-23
- Topic: CUDA
- Description: With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.

# The TMA Revolution (Async Copy)

**Difficulty:** Advanced
**Prerequisites:** [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/)

## 1. The Problem (The "Why")
"The CPU is wasting time calculating addresses for copies. Let the hardware do it."

Until now, we have used `TiledCopy` to coordinate warps and threads to fetch data from global memory into shared memory. The problem? **Every single thread is computing memory addresses.** For each element loaded, threads execute instructions just to resolve `address = coord · stride`. This burns registers and arithmetic logic unit (ALU) cycles that should be spent on matrix multiplication.

With the Hopper and Blackwell architectures, NVIDIA introduced the **Tensor Memory Accelerator (TMA)**. Instead of having 128 threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.

## 2. The Mental Model (The Visual)
If `TiledCopy` is a warehouse crew of 128 workers carrying boxes (where each worker calculates their path), `TMA` is an **autonomous forklift with a manifest**. 

```text
TiledCopy (threads do address math):

  128 workers compute addresses:
    address = coord · stride
  Then load/store each element.

  [threads] ──address math──▶ LDG/LDS ──▶ smem

TMA (hardware does address math):

  1 dispatcher submits a manifest (descriptor).
  The TMA engine moves the entire pallet.

  [one thread] ──submit manifest──▶ TMA engine ──▶ smem
```

You create the manifest on the host, hand it to the forklift inside the kernel, and the hardware handles the rest asynchronously.

## 3. The Solution (The Code)
Here is a minimal, complete C++ example of copying a 128×64 tile using TMA. Notice how no threads are computing layouts inside the kernel loop!

```Cpp
#include <cute/tensor.hpp>
#include <cute/arch/copy_sm90.hpp>
#include <cutlass/arch/barrier.h>
#include <cutlass/numeric_types.h>
#include <cstdio>

using namespace cute;

constexpr int BLK_M = 128;
constexpr int BLK_K = 64;

// Shared memory workspace
struct SharedStorage {
    alignas(128) cutlass::half_t smem[BLK_M * BLK_K];
    uint64_t tma_barrier;
};

template <class TmaDesc>
__global__ void tma_copy_kernel(TmaDesc tma) {
    __shared__ SharedStorage ss;

    // 1. Describe the shared memory target
    auto sA_layout = make_layout(make_shape(Int<BLK_M>{}, Int<BLK_K>{}),
                                 make_stride(Int<BLK_K>{}, Int<1>{}));
    auto sA = make_tensor(make_smem_ptr(ss.smem), sA_layout);

    // 2. Fetch the global memory tensor from the TMA descriptor
    auto gA = tma.get_tma_tensor(make_shape(Int<BLK_M>{}, Int<BLK_K>{}));

    // 3. Partition the TMA copy
    auto [tAgA, tAsA] = tma_partition(
        tma,
        Int<0>{},             // CTA block coordinate
        Layout<_1>{},         // Grid layout
        group_modes<0,2>(sA), // Collapse modes for flat copy
        group_modes<0,2>(gA)
    );

    // 4. Dispatch the autonomous forklift!
    if (threadIdx.x == 0) {
        using Bar = cutlass::arch::ClusterTransactionBarrier;
        
        // Initialize the mbarrier
        Bar::init(&ss.tma_barrier, 1);
        int bytes_to_copy = BLK_M * BLK_K * sizeof(cutlass::half_t);
        
        // Announce the expected transaction size
        Bar::arrive_and_expect_tx(&ss.tma_barrier, bytes_to_copy);
        
        // Issue the async copy using the manifest
        copy(tma.with(ss.tma_barrier), tAgA, tAsA);
        
        // Block until the forklift delivers the pallet
        Bar::wait(&ss.tma_barrier, 0);
    }
    __syncthreads();
    
    // Now sA is filled and ready for computation!
}

void host_launch_example() {
    // We must build the TMA descriptor on the HOST.
    cutlass::half_t* d_A;
    cudaMalloc(&d_A, BLK_M * BLK_K * sizeof(cutlass::half_t));

    auto mA_layout = make_layout(make_shape(Int<BLK_M>{}, Int<BLK_K>{}),
                                 make_stride(Int<BLK_K>{}, Int<1>{}));
    auto mA = make_tensor(make_gmem_ptr(d_A), mA_layout);
    
    auto sA_layout = make_layout(make_shape(Int<BLK_M>{}, Int<BLK_K>{}),
                                 make_stride(Int<BLK_K>{}, Int<1>{}));

    // Printing the manifest for the autonomous forklift
    auto tma = make_tma_atom(SM90_TMA_LOAD{}, mA, sA_layout,
                             make_shape(Int<BLK_M>{}, Int<BLK_K>{}));

    // Launch the kernel
    tma_copy_kernel<<<1, 128>>>(tma);
    cudaDeviceSynchronize();
    cudaFree(d_A);
}

int main() {
    host_launch_example();
    printf("TMA Copy Successful!\n");
    return 0;
}
```

## 4. Step-by-Step Explanation
Line `make_tma_atom(SM90_TMA_LOAD{}, mA, sA_layout, ...)`: This runs entirely on the host. It bundles the `Layout<Shape, Stride>` math into a hardware-readable manifest. We are telling the GPU how global memory and shared memory relate, *before* the kernel starts.

Line `auto [tAgA, tAsA] = tma_partition(...)`: We slice up the tensors for the copy. Unlike `TiledCopy` which assigns slices per-thread, `tma_partition` treats the CTA as one large worker.

Line `if (threadIdx.x == 0)`: Only **one dispatcher thread** is needed to initiate the TMA load. 

Line `copy(tma.with(&ss.tma_barrier), tAgA, tAsA)`: The actual dispatch command. The hardware asynchronously begins fetching the entire tile into shared memory.

Line `Bar::wait(...)`: Because the forklift works asynchronously, we must wait at the loading dock (the `mbarrier`) before we can safely read the data.

## 5. Engineer's Notebook (Latent Space Notes)
**Analogy:** `Tensor Memory Accelerator (TMA)` is an **autonomous forklift with a manifest**. You don’t have 128 workers carrying boxes anymore. One dispatcher hands the forklift a manifest (the TMA descriptor), and the hardware moves the entire pallet into shared memory. `make_tma_atom` is printing the manifest for the autonomous forklift (the TMA descriptor).

**Hardware Constraints & Gotchas:**
> **Gotcha ;  TMA descriptors are host-built.** `make_tma_atom` must run on the CPU. The descriptor encodes the address math for the tile and cannot be created inside the kernel.

> **Gotcha ;  TMA is async and barrier-driven.** One thread launches the copy; an mbarrier (`ClusterTransactionBarrier`) is required before any thread reads from smem. Without the `wait()`, you will read garbage data.

> **Gotcha ;  TMA prefers static layouts.** Tile shape and strides should be `Int<N>{}` so the descriptor is fully static and the copy is vector-friendly. CuTe uses these static layouts at compile-time to guarantee contiguity and optimize the transfer.

> **B200/Hopper Note:** TMA is the primary way to saturate memory bandwidth on SM90+ architectures. Understanding strides (`address = coord · stride`) is the only way to program the TMA descriptor correctly!

---

# WGMMA ; Warpgroup MMA

- URL: https://www.dcbaslani.xyz/blog/09_wgmma/
- Markdown: https://www.dcbaslani.xyz/blog/09_wgmma/index.md
- Text: https://www.dcbaslani.xyz/blog/09_wgmma/index.txt
- Published: 2026-04-04
- Topic: CUDA
- Description: How to use Warpgroup MMA (WGMMA) to feed NVIDIA Tensor Cores directly from shared memory, bypassing the register file bottleneck.

# 09. WGMMA (Warpgroup MMA)

**Difficulty:** Advanced
**Prerequisites:** Tutorial 06: Hello, MMA, Tutorial 08: The TMA Revolution

## 1. The Problem (The "Why")
Single warps are too small. We need 128 threads driving the Tensor Core together.
In previous generations, threads had to load memory from shared memory into their own registers before they could perform an MMA (Matrix Multiply Accumulate). This creates a massive bottleneck: the register file gets saturated, and the `ldmatrix` instructions waste precious clock cycles. We need a way to feed the Tensor Core directly from the shared memory staging area where the TMA forklift just dropped our data.

## 2. The Mental Model (The Visual)
WGMMA is the **Heavy Industrial Press**. Unlike a standard press where each worker loads their own tray, the WGMMA press reads directly from the forklift's delivery zone.

```text
Single-Warp MMA (SM80/SM89):
  smem ──(ldmatrix)──▶ registers ──(mma.sync)──▶ accumulators

WGMMA (SM90):
  smem ──(wgmma descriptor)────────────────────▶ accumulators
           (128 threads coordinate)
```

And to make things even more efficient, we have **TMA Multicast** ;  the forklift driver shouting "who else needs this?" and dropping one pallet into multiple CTA factories at once:

```text
  Global Memory
       │ (TMA Forklift reads once)
       ▼
  Cluster Broadcast Network
    ┌──┴──┐
    ▼     ▼
  smem  smem    (Delivered to multiple CTA factories simultaneously)
  CTA0  CTA1
```

## 3. The Solution (The Code)
Here is how you define a WGMMA instruction using CuTe:

```cpp
#include <cute/tensor.hpp>
#include <cute/arch/mma_sm90.hpp>
#include <cute/atom/mma_traits_sm90_gmma.hpp>
#include <cute/atom/mma_atom.hpp>

using namespace cute;

void example() {
    // 1. Define the Heavy Industrial Press (WGMMA)
    // 64x64x16 atom. F32 accumulators, F16 inputs. 
    // SS = both A and B are read directly from Shared Memory.
    // GMMA::Major::K means the K dimension is contiguous (TN layout).
    using WGMMA_Atom = SM90_64x64x16_F32F16F16_SS<GMMA::Major::K, GMMA::Major::K>;
    
    // 2. Build the factory floor plan for the 4-warp crew
    auto tiled_mma = make_tiled_mma(WGMMA_Atom{});
    
    // 3. Assume smem_A and smem_B are descriptors to our shared memory
    // (In reality, you would build these using make_smdest_ptr)
    // auto frag_A = ... 
    // auto frag_B = ...
    // auto accum  = ...
    
    // 4. Press the button! (Requires exactly 128 threads)
    // gemm(tiled_mma, frag_A, frag_B, accum);
}
```

## 4. Step-by-Step Explanation
Line 14: `SM90_64x64x16_F32F16F16_SS<...>` defines our specific `WGMMA` operation. It calculates a 64x64x16 matrix multiply. Crucially, the `_SS` suffix means both A and B operands will be sourced directly from Shared Memory descriptors, not from registers. The template parameters `<GMMA::Major::K, GMMA::Major::K>` indicate that both A and B have K as their fast-moving contiguous dimension.

Line 17: `make_tiled_mma` binds the instruction to a `Warpgroup`. CuTe knows that this SM90 atom requires exactly 128 threads (4 warps) and configures the `TiledMMA` layout automatically to coordinate them.

Line 26: `gemm(...)` dispatches the asynchronous `wgmma` instruction to the hardware. All 128 threads must participate in this call.

## 5. Engineer's Notebook (Latent Space Notes)
Analogy: Think of the `Warpgroup` as a synchronized crew of 4 warps working together as a single unit, and `SM90_WGMMA` as the Heavy Industrial Press. It takes inputs directly from the forklift's delivery zone (`smem` desc) rather than personal trays. `tma_multicast` is the forklift driver shouting "who else needs this?", delivering one pallet to multiple factories simultaneously.

Hardware Note: On Hopper (SM90), WGMMA requires exactly 128 threads. You cannot launch a warpgroup MMA with a single warp; it is a collaborative operation across 4 warps. TMA Multicast requires Threadblock Clusters ;  you can only multicast a TMA load to CTAs that are part of the same hardware cluster.

---

# Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry

- URL: https://www.dcbaslani.xyz/blog/cute-dsl-blog/
- Markdown: https://www.dcbaslani.xyz/blog/cute-dsl-blog/index.md
- Text: https://www.dcbaslani.xyz/blog/cute-dsl-blog/index.txt
- Published: 2026-04-19
- Topic: CUDA
- Description: Welcome to the ultimate guide to cute-dsl! Bringing the power of CuTe's concepts like Layouts, Tilers, and vectorized memory operations into a familiar, Pythonic interface.

# Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry

Welcome to the ultimate guide to `cute-dsl`! If you've tinkered with CUDA programming and are curious about CUTLASS's CuTe but felt overwhelmed by the complex C++ templates, `cute-dsl` is your new best friend. It brings the power of CuTe's concepts; like Layouts, Tilers, and vectorized memory operations; straight into a familiar, Pythonic interface.

In this blog, we'll dive into the basics, starting with simple APIs, building intuition around memory layouts with ASCII diagrams, and graduating to vectorized kernels.

## The Essentials: `cute.kernel` and `cute.jit`

At the core of `cute-dsl` are two crucial decorators:
- `@cute.kernel`: This marks a Python function as a device kernel, meaning it will run natively on the GPU. It behaves similarly to PyTorch's Triton `@triton.jit` or Numba's `@cuda.jit`.
- `@cute.jit`: This marks a host function that will compile and launch your device kernels. Think of it as the bridge between your standard Python code and your GPU parallel executions.

### A Gentle Start: Hello, GPU
Let's look at the absolute simplest program you can write:

```python
import cutlass
import cutlass.cute as cute

@cute.kernel
def hello_kernel():
    # Fetch thread coordinates just like CUDA!
    tidx, _, _ = cute.arch.thread_idx()
    if tidx == 0:
        cute.printf("hello from gpu")

@cute.jit
def hello_world():
    cutlass.cuda.initialize_cuda_context()
    hello_kernel().launch(grid=(1, 1, 1), block=(32, 1, 1))

if __name__ == "__main__":
    compiled = cute.compile(hello_world)
    compiled()
```

When you define `hello_kernel`, it gets compiled into PTX code behind the scenes. Inside it, we use `cute.arch.thread_idx()` ;  which returns `(threadIdx.x, threadIdx.y, threadIdx.z)`, just like in normal CUDA. We then launch it using standard `(grid, block)` dimensions inside a `@cute.jit`-compiled host function. Beautifully simple!

## Demystifying Layouts

A **Layout** in `cute` is the fundamental building block of memory access. It dictates how a logical multi-dimensional coordinate maps to a 1D flat memory index.

A Layout is parameterized by two things: **Shape** and **Stride**.

Let's imagine an 8x8 matrix in **Column-Major** layout: `shape=(8, 8)` and `stride=(1, 8)`.

The formula to convert a multi-dimensional coordinate to a flat memory index is: `index = row * stride[0] + col * stride[1]`. With stride `(1, 8)`, moving one step down a column (incrementing row) moves by 1 in memory, while moving one step across a row (incrementing column) jumps by 8.

```text
Logical 8x8 View (showing flat index)       Flat Memory (Column-Major)
+----+----+----+-----                       -----------
|  0 |  8 | 16 | ...                        [0] -> (row=0, col=0)
+----+----+----+-----                       [1] -> (row=1, col=0)
|  1 |  9 | 17 | ...                        [2] -> (row=2, col=0)
+----+----+----+-----                       ...
|  2 | 10 | 18 | ...                        [8] -> (row=0, col=1)
+----+----+----+-----                       [9] -> (row=1, col=1)
```

In **Row-Major** layout, the stride would be `(8, 1)` ;  meaning moving to the *next row* jumps by 8 in flat memory, while moving to the *next column* steps by just 1. So elements within the same row are contiguous in memory, which is the opposite of column-major.

Layouts in `cute` can also be **hierarchically nested**. For example, a shape of `(2, (2, 4))` with stride `(1, (2, 4))` describes a 2-element outer dimension where each element contains a nested 2×4 sub-layout. This nesting lets you mathematically model complex memory access patterns like blocked or swizzled layouts ;  essential for high-performance GPU kernels.

## Logical Divide vs Zipped Divide

When we want to transfer data between different memory hierarchies (like Global Memory -> Shared Memory -> Registers), we often do so in small parallelized blocks called **Tiles**. 

Let's take our 8x8 matrix layout `L = (8, 8)` and divide it using a **Tiler** of size `(2, 4)`. A tiler dictates the shape of the blocks we want to chop our layout into.

```python
L = cute.make_layout(shape=(8, 8), stride=(1, 8))
tiler = (2, 4)

L_logical = cute.logical_divide(L, tiler)
L_zipped = cute.zipped_divide(L, tiler)
```

### Logical Divide
A **Logical Divide** splits each mode independently into `(tile_size, num_tiles)`:
- Rows (size 8) ÷ Tiler Row (2) → `(2, 4)` meaning "2 rows per tile, 4 tile-rows"
- Cols (size 8) ÷ Tiler Col (4) → `(4, 2)` meaning "4 cols per tile, 2 tile-cols"

The combined output shape becomes: `((2, 4), (4, 2))`.

```text
Logical Divide: ((2, 4), (4, 2))

   Mode 0 (Rows):  (row_within_tile, which_tile_row)  =  (2, 4)
   Mode 1 (Cols):  (col_within_tile, which_tile_col)  =  (4, 2)

Indexing requires specifying all 4 sub-dimensions:
   Data[ (row_in_tile, tile_row_idx), (col_in_tile, tile_col_idx) ]

The row and column modes remain separate ;  they are NOT grouped
by "tile" vs "grid". This makes it awkward to grab a full 2x4 tile.
```

### Zipped Divide
A **Zipped Divide** first performs a logical divide, then *re-groups* the results: it bundles all the "within-tile" dimensions into Mode 0 and all the "which-tile" dimensions into Mode 1.
- **Mode 0 (Tile)**: `(2, 4)` ;  the shape of one tile (rows-in-tile, cols-in-tile)
- **Mode 1 (Grid)**: `(4, 2)` ;  how many tiles (tile-rows, tile-cols)

The numbers look identical to logical divide ;  `((2, 4), (4, 2))` ;  but the *meaning* of each mode has changed completely!

```text
Zipped Divide: Original (8, 8) ÷ Tiler (2, 4)

Mode 0 = One Tile (2, 4):            Mode 1 = Grid of Tiles (4, 2):

  col 0  col 1  col 2  col 3           tile_row  tile_col
  +------+------+------+------+        +-------+-------+
  |      |      |      |      |  row0  | (0,0) | (0,1) |
  +------+------+------+------+        +-------+-------+
  |      |      |      |      |  row1  | (1,0) | (1,1) |
  +------+------+------+------+        +-------+-------+
  A single 2x4 tile                    | (2,0) | (2,1) |
                                       +-------+-------+
                                       | (3,0) | (3,1) |
                                       +-------+-------+
                                       4×2 grid = 8 tiles total
```

### Why is Zipped Divide Preferred?
Zipped divide gives you a clean two-level hierarchy: `(Tile_Coordinate, Grid_Coordinate)`.

When doing vectorized loads or memory copies, you typically want to grab a *whole tile* at a specific *grid position*. Compare the two approaches:

- **Logical Divide**: To grab a tile you must index 4 sub-dimensions separately: `Data[(row_in_tile, tile_row_idx), (col_in_tile, tile_col_idx)]` ;  cumbersome and error-prone.
- **Zipped Divide**: The layout already mirrors `(Tile, Grid)`, so you cleanly index: `Data[tile_coord, grid_coord]` ;  much simpler!

This is why `zipped_divide` is the go-to operation for partitioning data across threads and memory hierarchies in CuTe kernels.

## Anatomy of the Naive Elementwise Add Kernel

Let's see how indexing looks across different paradigms for a simple elementwise `C = A + B` kernel.

```python
@cute.kernel
def naive_elementwise_add_kernel(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()

    idx = bidx * bdim + tidx
    
    m, n = A.shape
    ni = idx % n
    mi = idx // n

    C[mi, ni] = A[mi, ni] + B[mi, ni]
```

### The Indexing Difference
- **PyTorch**: Fully abstracted ;  you simply write `C = A + B` and PyTorch handles parallelism, memory layout, and iteration internally.
- **CUDA**: GPU memory is flat 1D, so you typically work with raw pointers: `C[idx] = A[idx] + B[idx]`. If you need multi-dimensional indexing, you must manually compute row/column indices from the flat thread index and handle bounds checking yourself.
- **cute-dsl**: The best of both worlds! Each thread computes a flat `idx`, then uses simple arithmetic (`idx % n` and `idx // n`) to recover 2D coordinates `(mi, ni)`. You then index with clean multi-dimensional syntax: `A[mi, ni]`. Under the hood, CuTe's Layout algebra automatically translates these logical coordinates to the correct physical memory offset ;  no raw pointer math needed!

## Graduating to Vectorized Execution

GPUs love vectorized operations. Why fetch one float per thread when you can comfortably fetch four? Vectorized pipelines saturate memory bandwidth efficiently.
To vectorize seamlessly, we systematically reshape our datasets using ;  you guessed it ;  **tilers**!

```python
@cute.jit 
def vectorized_add(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor):
    # Defining a tile of 4 elements wide
    tiler = (1, 4)
    # Zipping it chunks our layout!
    A = cute.zipped_divide(A, tiler)
    B = cute.zipped_divide(B, tiler)
    ...
```

By applying `zipped_divide`, our flat element dimensions transform into chunks of `(1, 4)`. 
The `A`, `B`, and `C` layouts effectively represent shape `((1, 4), (M, N/4))`:
- **Mode 0**: `(1, 4)` is our Inner Tile footprint (yielding 4 elements horizontally).
- **Mode 1**: `(M, N/4)` is our Outer Grid footprint. Each coordinate in the grid points to a vector chunk!

Let's witness the magic in the vectorized kernel:

```python
@cute.kernel
def vectorized_add_kernel(A, B, C):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()

    idx = bidx * bdim + tidx

    # Retrieve boundaries from Mode 1 (the grid)
    m, n = A.shape[1]
    ni = idx % n
    mi = idx // n

    # MAGIC HAPPENS HERE
    a_val = A[(None, (mi, ni))].load()
    b_val = B[(None, (mi, ni))].load()

    C[(None, (mi, ni))] = a_val + b_val
```

### Breaking down `A[(None, (mi, ni))]`
This is why we advocated for Zipped Divide earlier! After `zipped_divide`, `A` expects coordinates in the format `[TileCoord, GridCoord]`:
- **`None` for TileCoord**: In CuTe DSL, `None` works like Python's `:` slice ;  it means "select everything along this mode." Since our tile shape is `(1, 4)`, passing `None` selects all 4 elements of the tile. This gives us back a small tensor (a 4-element vector) rather than a single scalar.
- **`(mi, ni)` for GridCoord**: This pinpoints *which* tile in the `(M, N/4)` grid this thread should access.
- **`.load()`**: This tells the compiler to emit a hardware-level **vectorized memory load** (e.g., `ld.global.v4`), fetching all 4 elements in a single memory transaction directly into registers ;  far more efficient than 4 separate scalar loads!

### Conclusion
By leveraging Layouts, divide operations, and tilers, `cute-dsl` abstracts away the complex C++ template machinery of raw CUTLASS while preserving its performance. You get optimized layout mapping, zero raw pointer math, and vectorized register loading ;  all with syntax that feels like writing everyday PyTorch or NumPy code.

Time to fire up your GPUs!

---

# Breaking PyTorch Boundaries: Fusing RMSNorm and GDN in Triton for Qwen 3.5

- URL: https://www.dcbaslani.xyz/blog/qwen_3.5/
- Markdown: https://www.dcbaslani.xyz/blog/qwen_3.5/index.md
- Text: https://www.dcbaslani.xyz/blog/qwen_3.5/index.txt
- Published: 2026-05-25
- Topic: Triton
- Description: Rebuilding and optimizing the Qwen 3.5 (9B) Gated Delta Attention inference stack with custom Triton kernels, achieving 5.2x speedup on NVIDIA B200.

# Breaking PyTorch Boundaries: Fusing RMSNorm and GDN in Triton for Qwen 3.5

I rebuilt the inference stack for Qwen 3.5 (9B) from scratch in PyTorch, profiled it, and replaced the key bottlenecks with hand-written Triton GPU kernels. By fusing memory-bound operations and rewriting the Gated Delta Network (GDN) linear attention layers to run without slow Python loops, I bumped throughput from **16 to 83 tokens/second** on a single NVIDIA B200 (a 5.2x speedup).

---

## Table of Contents

- [Project Overview](#project-overview)
- [Hardware Setup](#hardware-setup)
- [Finding the Bottlenecks (Baseline Profiling)](#finding-the-bottlenecks-baseline-profiling)
- [Fusing the Residual Stream](#fusing-the-residual-stream)
- [Optimizing Gated Delta Attention](#optimizing-gated-delta-attention)
- [Triton vs. Other Frameworks](#triton-vs-other-frameworks)
- [Comparison with vLLM](#comparison-with-vllm)
- [Takeaways](#takeaways)

---

## Project Overview

Unlike vanilla transformers, Qwen 3.5 uses a **hybrid architecture** that interleaves standard multi-head attention layers with **Gated Delta Network (GDN)** linear attention layers. The GDN layers use a recurrent state-space mechanism that compresses past context into a fixed-size state matrix instead of maintaining a growing KV cache.

```
Layer 0:  Linear Attention (GDN)
Layer 1:  Linear Attention (GDN)
Layer 2:  Linear Attention (GDN)
Layer 3:  Full Attention (SDPA)    ← every 4th layer
Layer 4:  Linear Attention (GDN)
...
Layer 31: Full Attention (SDPA)
```

Because **75% of the decoder layers** are GDN layers, optimizing them is critical; they dominate both prefill and decode latency.

The goal here wasn't to build a production serving system, but to understand what happens between the CUDA cores and HBM when you run text generation, and to see how much performance we can recover through targeted Triton kernels.

### Code Components

| Component | Description |
|---|---|
| [qwen.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/qwen.py) | Custom PyTorch implementation of the full Qwen 3.5 model (983 LOC) |
| [fused_zero_centered_rmsnorm.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/kernels/triton/fused_zero_centered_rmsnorm.py) | Triton kernel fusing residual additions and zero-centered RMSNorm |
| [prefill.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/kernels/linear_attention/prefill.py) | Triton chunkwise GDN prefill kernel using WY decomposition (993 LOC) |
| [decode.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/kernels/linear_attention/decode.py) | Triton single-token GDN decode kernel |
| [modal/inference.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/modal/inference.py) | Inference pipeline deployed to Modal (NVIDIA B200) |
| [modal/vllm_bench.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/modal/vllm_bench.py) | Benchmarking script comparing our performance against vLLM |

---

## Hardware Setup

I developed local tests on a GTX 1650 to iterate fast, and ran final benchmarks on an NVIDIA B200 via Modal.

```mermaid
graph LR
    subgraph Local["Local Development"]
        GPU1["GTX 1650<br/>4GB VRAM<br/>Turing SM75"]
        DEV["Functional correctness<br/>Kernel compilation<br/>Numerical validation"]
        GPU1 --> DEV
    end

    subgraph Cloud["Modal Cloud"]
        GPU2["NVIDIA B200<br/>192GB HBM3e<br/>Blackwell SM100"]
        PERF["Throughput benchmarks<br/>Profiler traces<br/>Production metrics"]
        GPU2 --> PERF
    end

    DEV -->|"modal run"| PERF
```

| Platform | Local Development | Cloud Benchmark |
|---|---|---|
| **GPU** | GTX 1650 (Turing SM75) | NVIDIA B200 (Blackwell SM100) |
| **VRAM** | 4 GB GDDR6 | 192 GB HBM3e |
| **Memory Bandwidth** | 128 GB/s | 8,000 GB/s |
| **Role** | Fast functional testing | Throughput and profiling |

### Profiling Setup

I captured profiler traces on the B200 using PyTorch's built-in `torch.profiler` with CPU + CUDA activity recording and memory tracking. The profiling script ([inference-torch-profiler.py](https://github.com/Darshan-Baslani/qwen3.5-optimized/blob/main/modal/inference-torch-profiler.py)) profiles the prefill pass and a few decode steps, exporting Chrome traces for analysis.

I ran profiles at three main development stages:

> [!WARNING]
> **Profiling Overhead:** The absolute times in the traces are much slower than actual inference because the profiler (with memory tracking and `CUDA_LAUNCH_BLOCKING=1`) adds significant overhead. The throughput numbers below reflect actual, unprofiled runs, while the trace times show the relative gains under profiling conditions.

| Trace | Stage | Profiled Time (with overhead) | Actual Throughput |
|---|---|---|---|
| `initial_trace.json` | Pure PyTorch baseline | 10.53 s | ~16 tok/s |
| `fused_rms_norm.json` | + Fused Triton RMSNorm | 10.40 s | ~50 tok/s |
| `triton_linear_attention.json` | + Triton GDN kernels | 6.51 s | ~83 tok/s |

---

## Finding the Bottlenecks

### The Initial Trace

I started by profiling the baseline PyTorch model on the B200 to see where the actual execution time was spent.

![Initial trace showing a single decode step and the cascade of tiny CUDA kernels](/blog/images/initial_trace.jpg)

The trace showed a classic memory-bandwidth bottleneck. A single decode step; which should ideally consist of a few fused operations; was launching dozens of tiny individual CUDA kernels:

```
aten::add          →  50 µs  (residual addition)
aten::pow          →  30 µs  (variance computation)
aten::mean         →  40 µs  (reduction for RMSNorm)
aten::rsqrt        →  20 µs  (inverse sqrt)
aten::mul          →  25 µs  (weight application)
aten::to           →  15 µs  (dtype cast)
```

Each of these operations touches HBM; reading the hidden state (8 KB per token), performing one arithmetic step, and writing the result back. The real bottleneck wasn't the arithmetic; it was the kernel launch overhead and memory roundtrips.

### The Abstraction Bottleneck

The standard PyTorch pattern treats each layer and operation as an isolated object:

```mermaid
graph TB
    subgraph Standard["Standard PyTorch Pattern (Memory-Bound)"]
        direction TB
        H0["hidden_states"] -->|"Write to HBM"| HBM1["HBM"]
        HBM1 -->|"Read from HBM"| ADD["aten::add (residual)"]
        ADD -->|"Write to HBM"| HBM2["HBM"]
        HBM2 -->|"Read from HBM"| POW["aten::pow"]
        POW -->|"Write to HBM"| HBM3["HBM"]
        HBM3 -->|"Read from HBM"| MEAN["aten::mean"]
        MEAN -->|"Write to HBM"| HBM4["HBM"]
        HBM4 -->|"Read from HBM"| RSQRT["aten::rsqrt"]
        RSQRT -->|"Write to HBM"| HBM5["HBM"]
        HBM5 -->|"Read from HBM"| MUL["aten::mul (weight)"]
        MUL -->|"Write to HBM"| HBM6["HBM"]
    end
```

That means **six HBM round-trips for a single normalization**. On the B200, each round-trip for a 4096-dim hidden state costs very little time in raw transfer, but the launch overhead for each tiny kernel is 20-50 µs. Multiply that by 32 layers and 2 norms per layer, and you get **64 normalization passes per decode step**.

Essentially, PyTorch's module abstraction (where each layer/operation is a self-contained forward pass) forces separate kernel launches and intermediate memory writes, which kills performance on the GPU.

---

## Fusing the Residual Stream

### Passing the Residual Stream Down

To fix this, we can adopt a trick used by systems like vLLM: pass the residual stream as an independent tensor so that the next layer's normalization kernel can fuse both the addition and the normalization step.

Instead of:

```python
# Standard: Layer owns its residual
hidden = layer_norm(hidden + residual)
```

We restructure the decoder to return both:

```python
# Fused: Residual flows between layers
hidden, residual = fused_norm(hidden, residual)
```

### The Fused Triton Kernel

Here is the Triton implementation:

```python
@triton.jit
def _fused_zero_centered_rmsnorm(
    Y_ptr, Y_row_stride,
    S_ptr, S_row_stride,     # output residual
    X_ptr, X_row_stride,
    R_ptr, R_row_stride,     # input residual
    W_ptr, n_cols, eps,
    BLOCK_SIZE: tl.constexpr,
):
    row_idx = tl.program_id(0)
    col_offsets = tl.arange(0, BLOCK_SIZE)
    mask = col_offsets < n_cols

    # Step 1: Load X and R from HBM (the ONLY HBM read)
    X_row = tl.load(X_ptr + row_idx * X_row_stride + col_offsets, mask=mask)
    R_row = tl.load(R_ptr + row_idx * R_row_stride + col_offsets, mask=mask)

    # Step 2: Fused residual add (stays in SRAM)
    S_row = X_row + R_row
    tl.store(S_ptr + row_idx * S_row_stride + col_offsets, S_row, mask=mask)

    # Step 3: RMSNorm in FP32 (stays in SRAM)
    S_row = S_row.to(tl.float32)
    W_row = tl.load(W_ptr + col_offsets, mask=mask).to(tl.float32)

    mean_square = tl.sum(S_row * S_row, axis=0) / n_cols
    rstd = tl.rsqrt(mean_square + eps)
    S_row = S_row * rstd

    # Step 4: Zero-centered weight trick (Qwen-specific)
    Y_row = S_row * (1.0 + W_row)

    # Step 5: Write normalized output to HBM (the ONLY HBM write)
    tl.store(Y_ptr + row_idx * Y_row_stride + col_offsets, Y_row.to(S_row_dtype), mask=mask)
```

### Key Implementation Details

1. **Zero-Centered Weights:** Standard RMSNorm applies `output = norm(x) * weight`. Qwen 3.5 uses zero-centered weights initialized at 0, applying `output = norm(x) * (1.0 + weight)`. Standard RMSNorm kernels will output incorrect activations and cause the model to diverge.
2. **FP32 Accumulation:** We must upcast the accumulated sum to FP32 before computing the variance. BF16's limited precision causes compounding rounding errors across 32 layers, leading to numerical divergence.
3. **Dual Outputs:** The kernel outputs the normalized hidden state `Y` and the new residual `S` in one go. While `S` still gets written to HBM for the next layer, the fusion eliminates 5 intermediate round-trips and launches.

### Memory Traffic and Kernel Launches

```mermaid
graph TB
    subgraph Fused["Fused Triton Kernel (1 Launch)"]
        direction TB
        READ["Read X, R from HBM"] --> SRAM["SRAM (Shared Memory)"]
        SRAM --> ADD_F["Add: S = X + R"]
        ADD_F --> NORM_F["Normalize in FP32"]
        NORM_F --> WEIGHT_F["Apply (1 + W)"]
        WEIGHT_F --> WRITE["Write Y, S to HBM"]
    end
```

| Metric | Before (PyTorch) | After (Fused Triton) |
|---|---|---|
| HBM round-trips per norm | 6 | 1 |
| Kernel launches per norm | 6 | 1 |
| Total norms per decode step | 64 | 64 |
| Kernel launches eliminated | ;  | **320 per step** |

### Restructuring the Decoder Layer

We also need to map the standard HuggingFace weights into our fused model format during loading (duplicating `input_layernorm.weight` into both a `standard` and `fused` slot).

```python
class Qwen3_5DecoderLayer(nn.Module):
    def forward(self, hidden_states, residual, ...):
        if residual is None:
            # First layer: no residual yet, use standard norm
            residual = hidden_states
            hidden_states = self.input_layernorm_standard(hidden_states)
        else:
            # Subsequent layers: fused residual + norm
            hidden_states, residual = self.input_layernorm_fused(X=hidden_states, R=residual)

        # Token mixer (attention or GDN)
        hidden_states = self.token_mixer(hidden_states)

        # Post-attention: always fused
        hidden_states, residual = self.post_attention_layernorm(X=hidden_states, R=residual)
        hidden_states = self.mlp(hidden_states)

        return hidden_states, residual  # ← residual flows to next layer
```

### Performance Gains (Stage 1)

![Profiler trace after applying fused RMSNorm, showing reduced kernel launch density](/blog/images/fused_rms_norm.png)

| Implementation | Throughput (B200, 150 tokens) |
|---|---|
| Pure PyTorch baseline | ~16 tok/s |
| + Fused Triton RMSNorm | **~50 tok/s** (3.1x speedup) |

Fusing RMSNorm and residual addition got us to ~50 tokens/sec. However, the profiler showed we were still heavily bottlenecked by the Gated Delta Network (GDN) linear attention layers, which were still running in pure PyTorch with sequential Python loops.

---

## Optimizing Gated Delta Attention

### GDN Attention vs. Standard Attention

Standard multi-head attention processes all tokens in parallel via $Q K^\top V$. The KV cache grows linearly with sequence length, but the attention computation itself is highly parallelizable.

GDN linear attention works differently. It maintains a **fixed-size recurrent state matrix** $H \in \mathbb{R}^{K \times V}$ (128 × 128 = 16 KB per head) that compresses all past context:

```mermaid
graph LR
    subgraph Standard["Standard Attention"]
        Q1["Q"] --> DOT1["Q @ Kᵀ"]
        K1["K cache<br/>(grows with seq)"] --> DOT1
        DOT1 --> SOFT["softmax"]
        SOFT --> DOT2["... @ V"]
        V1["V cache<br/>(grows with seq)"] --> DOT2
    end

    subgraph GDN["GDN Linear Attention"]
        Q2["q_t"] --> OUT["o_t = q_t @ H_t"]
        STATE["H_t<br/>(fixed 128×128)"] --> OUT
        K2["k_t"] --> UPDATE["H_{t+1} = γ·H_t + k_t ⊗ δ_t"]
        V2["v_t"] --> DELTA["δ_t = β·(v_t - k_t @ H_t)"]
        STATE --> DELTA
        DELTA --> UPDATE
    end
```

The recurrence relation for each token:

$$H_{t+1} = \gamma_t \cdot H_t + k_t \otimes \beta_t \cdot (v_t - k_t^\top H_t)$$
$$o_t = q_t^\top \cdot H_{t+1}$$

Where the learned gating decay $\gamma_t$ is defined as:
$$\gamma_t = \exp\left(-\exp(A_{\text{log}}) \cdot \text{softplus}(a_t + \text{dt}_{\text{bias}})\right)$$

### The Bottleneck: Sequential Loops in PyTorch

The initial PyTorch implementation processed chunks sequentially using a Python `for` loop:

```python
# The bottleneck: sequential iteration
for i in range(total_sequence_length // chunk_size):
    q_i, k_i, v_i = query[:, :, i], key[:, :, i], value[:, :, i]
    attn = q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i]
    v_prime = k_cumdecay[:, :, i] @ last_recurrent_state  # ← HBM read
    v_new = v_i - v_prime
    attn_inter = (q_i * g[:, :, i, :, None].exp()) @ last_recurrent_state  # ← HBM read
    core_attn_out[:, :, i] = attn_inter + attn @ v_new
    last_recurrent_state = (  # ← HBM write, then read again next iteration
        last_recurrent_state * g[:, :, i, -1, None, None].exp()
        + (k_i * ...).transpose(-1, -2) @ v_new
    )
```

For a prefill sequence of 512 tokens with a chunk size of 64, we have $\frac{512}{64} = 8$ chunks. In every loop iteration:
1. **Read** the recurrent state matrix $H$ from HBM (128 × 128 float32 elements = 64 KB/head × 32 heads = 2 MB total).
2. Perform small matrix multiplications.
3. **Write** the updated 2 MB state matrix back to HBM.
4. Yield execution back to Python to start the next iteration.

This results in:
$$\text{Total HBM Traffic} = 8 \text{ chunks} \times (2\text{ MB read} + 2\text{ MB write}) = 32\text{ MB}$$

This is 32 MB of redundant memory transfers per layer. All of this state could instead live entirely in the GPU's SRAM (shared memory) during the entire forward pass.

### Writing the Triton Chunkwise GDN Kernel

To parallelize this, we use the chunkwise formulation from the [Flash Linear Attention (FLA)](https://github.com/fla-org/flash-linear-attention) paper: intra-chunk interactions are computed in parallel via matrix multiplications, while the inter-chunk state updates are propagated sequentially while keeping the state matrix in SRAM.

#### Prefill Kernel Architecture

The prefill is split into two phases for maximum parallelism:

```mermaid
flowchart TB
    subgraph Phase1["Phase 1: WY Prepass (Embarrassingly Parallel)"]
        direction LR
        P1["Grid: (num_seqs, num_heads, num_chunks)"]
        P1 --> C1["Chunk 0"]
        P1 --> C2["Chunk 1"]
        P1 --> C3["Chunk 2"]
        P1 --> CN["Chunk N"]

        C1 --> WY1["Compute:<br/>• K @ Kᵀ gram matrix<br/>• (I+N)⁻¹ via Neumann series<br/>• Gate cumulative products<br/>• Q @ Kᵀ interaction matrix"]
        C2 --> WY2["Same per-chunk<br/>computation"]
        C3 --> WY3["Same per-chunk<br/>computation"]
        CN --> WYN["Same per-chunk<br/>computation"]
    end

    subgraph Phase2["Phase 2: Sequential State Propagation (State in SRAM)"]
        direction TB
        S0["State H₀<br/>(loaded once)"] --> PROC0["Process Chunk 0<br/>using precomputed WY"]
        PROC0 -->|"State stays in SRAM"| PROC1["Process Chunk 1<br/>using precomputed WY"]
        PROC1 -->|"State stays in SRAM"| PROC2["Process Chunk 2<br/>using precomputed WY"]
        PROC2 -->|"State stays in SRAM"| PROCN["Process Chunk N<br/>using precomputed WY"]
        PROCN --> SN["State H_N<br/>(written once)"]
    end

    Phase1 -->|"WY data in HBM"| Phase2
```

#### The WY Decomposition

The intra-chunk math uses the delta rule to formulate a lower-triangular system:

$$(I + N) \cdot X = \beta \cdot \left(\frac{V}{G} - K \cdot S_{\text{in}}^\top\right)$$

Where $N$ is a strictly lower-triangular nilpotent matrix ($N[j,i] = \beta_j \cdot (k_j \cdot k_i)$ for $i < j$). Since $N$ is nilpotent of order $C$ (chunk size), we can invert $(I+N)$ exactly using the Neumann series:

$$(I + N)^{-1} = (I - N)(I + N^2)(I + N^4) \cdots$$

This is implemented as a fixed-depth doubling chain:

```python
@triton.jit
def _apply_unit_lower_inverse(nil, rhs, BV: tl.constexpr, CHUNK: tl.constexpr):
    """(I+N)^{-1} via doubling. Uses TF32 tensor cores for numerical stability."""
    sol = rhs - _dot_f32(nil, rhs)        # (I - N) @ rhs
    power = _dot_f32(nil, nil)             # N²
    if CHUNK >= 4:
        sol = sol + _dot_f32(power, sol)   # += N² @ sol
        power = _dot_f32(power, power)     # N⁴
    if CHUNK >= 8:
        sol = sol + _dot_f32(power, sol)   # += N⁴ @ sol
        power = _dot_f32(power, power)     # N⁸
    if CHUNK >= 16:
        sol = sol + _dot_f32(power, sol)   # += N⁸ @ sol
        power = _dot_f32(power, power)     # N¹⁶
    if CHUNK >= 32:
        sol = sol + _dot_f32(power, sol)   # += N¹⁶ @ sol
    return sol
```

#### Blackwell-Specific Optimizations

Getting clean performance on the B200 required a few low-level adjustments:

1. **Precision Splitting (TF32 vs. BF16):** The nilpotent inverse chain uses TF32 (19-bit mantissa) since numerical errors compound quickly across the log-depth doubling chain. For the large K=128 contractions (`K @ K^T`, `Q @ state^T`), we use BF16 tensor cores to get a 4x speedup, where rounding is bounded for single-shot matmuls.
2. **Blackwell Code-Gen Bug:** We hit a bug in Triton's compiler on Blackwell (specifically the `TritonGPUHoistTMEMAlloc` pass) where it incorrectly fused `tl.dot` outputs. We bypassed it by wrapping the dot product in an inline PTX `mov.f32` barrier:

```python
@triton.jit
def _dot_f32(a, b):
    out = tl.dot(a, b, input_precision="tf32", out_dtype=tl.float32)
    return tl.inline_asm_elementwise(
        asm="mov.f32 $0, $1;",
        constraints="=r,r",
        args=[out], dtype=tl.float32, is_pure=True, pack=1,
    )
```

3. **Adaptive Tiling:** We adjusted tile sizes dynamically depending on the workload:

| Parameter | Small Batch | Large Batch | Rationale |
|---|---|---|---|
| `CHUNK` | 32 | 16 | Larger chunks amortize launch overhead; smaller chunks save Gram matrix computation |
| `BV` (V-tile) | 16 | 16-32 | Controls shared memory occupancy vs. register pressure |
| `num_warps` | 4 | 2 | Reduces sync overhead on smaller sequence batches |

#### The Decode Kernel

For the single-token decode kernel, things are simpler since we don't need chunking. We map the grid over batch and heads:

1. Load the state tile $H[\text{BV}, K]$ from HBM (single read).
2. Compute the gate decay value $\gamma = \exp\left(-\exp(A_{\text{log}}) \cdot \text{softplus}(a + \text{dt}_{\text{bias}})\right)$.
3. Run the recurrent update step directly in registers.
4. Write out the output token and the new state back to HBM.

```python
@triton.jit
def gdn_decode_kernel(...):
    # Gate computation
    g = tl.exp(-tl.exp(A_log_val) * softplus_x)
    beta = tl.sigmoid(b_val)

    # Decay existing state
    old_state = g * b_h

    # Delta rule update
    old_v = tl.sum(old_state * b_k[None, :], axis=1)
    delta_v = beta * (b_v - old_v)

    # Output BEFORE state update (frees registers)
    old_o = tl.sum(old_state * b_q[None, :], axis=1)
    kq = tl.sum(b_k * b_q)
    b_o = scale * (old_o + delta_v * kq)
    tl.store(out_ptr + ..., b_o.to(tl.bfloat16))

    # State update (register-only, no extra HBM read)
    state_out = old_state + delta_v[:, None] * b_k[None, :]
    tl.store(new_state_ptr + ..., state_out)
```

> [!TIP]
> Storing the output *before* computing `state_out` frees up registers and prevents local memory spilling. Since Blackwell caps registers at 255 per thread, this manual scheduling prevents register spills.

### Performance Gains (Stage 2)

![Profiler trace with fused GDN kernels, showing a clean decode step free of Python loop overhead](/blog/images/triton_linear_attention.png)

| Implementation | End-to-End Time (150 tok) | Throughput |
|---|---|---|
| Pure PyTorch baseline | 9.28 s | ~16 tok/s |
| + Fused Triton RMSNorm | 3.02 s | ~50 tok/s |
| + Triton GDN Kernels | 1.81 s | **~83 tok/s** (5.2x speedup) |

---

## Triton vs. Other Frameworks

When choosing how to write these kernels, there were three main options:

1. **Triton:** Fast iteration, Python-based, portable. But we lose low-level control over TMA (Tensor Memory Accelerator) and ran into compiler bugs on Blackwell.
2. **TileLang:** Good compromise, offers warp specialization and better occupancy control, but the ecosystem is very young.
3. **CuTe / CUTLASS (C++):** Maximum performance, explicit layout swizzling, and full control over asynchronous memory copies. But it requires writing 3,000+ lines of C++ template metaprogramming per kernel, adding weeks of development.

### Where Triton Falls Short on Blackwell

Writing kernels for the B200 highlighted a few areas where Triton's compiler abstractions hit limits:
- **TMA Control:** The B200 features a Tensor Memory Accelerator to prefetch data asynchronously. Triton abstracts this, meaning we cannot manually orchestrate tile prefetching or schedule memory loads.
- **Swizzle Layouts:** Preventing bank conflicts in shared memory requires specific data layouts. Triton manages this automatically but can generate suboptimal swizzling patterns for non-power-of-two tiles.
- **Compiler Bugs:** The code-generation bug in `TritonGPUHoistTMEMAlloc` meant we had to resort to inline PTX assembly barriers.

### Why I Stuck With Triton

Even with these issues, Triton was the right choice for this project:

| Factor | Triton | CuTe/CUTLASS |
|---|---|---|
| Iteration Time | Minutes | Hours (compile + run) |
| Lines of Code | ~1,100 total | ~3,000+ per kernel |
| Maintainability | High (Python-like syntax) | Low (highly complex C++ templates) |
| Throughput | 83 tokens/s | Theoretical peak (maybe ~90 tok/s) |
| Development Time | 2 weeks | 2+ months(I have fulltime college :( ) |

Ultimately, hitting 83 tokens/second in a couple of weeks with Triton is a much better engineering tradeoff than spending months in C++ to chase a marginal single-digit performance gain.

---

## Comparison with vLLM

To see where these optimizations stand relative to a production serving system, I benchmarked our setup against vLLM on the same B200 hardware:

| System | Throughput | Setup |
|---|---|---|
| Custom Triton Kernels | **83 tok/s** | Single request, no batching, raw Triton kernels |
| vLLM v0.20.1 | **~250 tok/s** | Production stack (PagedAttention, CUDA Graphs, scheduling) |

### Analyzing the Performance Difference

The 3x throughput gap is due to systems-level architecture rather than raw kernel execution speeds:

| Optimization | Custom Pipeline | vLLM |
|---|---|---|
| **CUDA Graphs** | No (adds CPU overhead per decode step) | Yes (captures and replays decode graphs) |
| **Continuous Batching**| No (runs single request at a time) | Yes (batches active requests dynamically) |
| **PagedAttention** | No (pre-allocates flat tensors) | Yes (dynamic KV page management) |
| **Decode Loop** | Python-driven loop | Optimized C++ scheduling / engine |

In a production environment, kernel and systems-level optimizations are complementary. The custom Triton kernels developed here could be plugged directly into an engine like vLLM to gain the benefits of PagedAttention and CUDA Graphs.

### When to Stop Optimizing

At 83 tokens/second, we hit the point of diminishing returns for kernel optimization:
- Getting from 16 to 83 tokens/sec required about 1,100 lines of Triton.
- To push past 83 tokens/sec, we would need to eliminate CPU launch overhead using CUDA Graphs (a system orchestration task) or rewrite the kernels in CuTe to squeak out a few percentage points of memory efficiency.

Because the core performance bottleneck is now CPU dispatch and framework overhead, further micro-tuning the Triton kernels wouldn't make sense.

---

## Takeaways

Writing high-performance GPU kernels is mostly about respecting the memory hierarchy:
1. **Keep data in SRAM:** Moving intermediate states back and forth to HBM wastes bandwidth. Keep inputs in shared memory as long as possible.
2. **Minimize kernel launches:** Fusing dependent operations (like residual additions and normalization) cuts launch overhead.
3. **Eliminate Python loops:** Replace sequential execution paths with chunked, parallel CUDA blocks.

### Quick Stats

- **Model:** Qwen 3.5-9B (hybrid GDN + multi-head attention)
- **Hardware:** NVIDIA B200 (192 GB HBM3e)
- **Baseline Speed:** 16 tokens/s
- **Optimized Speed:** 83 tokens/s (5.2x speedup)
- **Triton Code:** ~1,100 lines
- **PyTorch Model Code:** ~983 lines

---

*Built by Darshan Baslani. Kernels prototyped on a GTX 1650, benchmarked on NVIDIA B200 via Modal.*

*Source: [github.com/Darshan-Baslani/qwen3.5-optimized](https://github.com/Darshan-Baslani/qwen3.5-optimized)*

---

# The Feynman GPU Lectures

- URL: https://www.dcbaslani.xyz/blog/gpu_masterclass/
- Markdown: https://www.dcbaslani.xyz/blog/gpu_masterclass/index.md
- Text: https://www.dcbaslani.xyz/blog/gpu_masterclass/index.txt
- Published: 2026-06-05
- Topic: CUDA
- Description: A GPU masterclass that builds from transistors and CUDA cores up through SM architecture, memory systems, Tensor Cores, Hopper, and Blackwell.

# 🎓 The Feynman GPU Lectures

### *"What I cannot create, I do not understand."* ;  Richard Feynman

> **Welcome.** We'll start from the transistor, build up to a CUDA core, wire those into a monster called a Streaming Multiprocessor, and then ride the architectural wave from Volta all the way to Blackwell. I'll give you real code ;  not toy code ;  in CUDA and PTX. By the end, you'll know what `tcgen05.mma` does at the wire level.
>
> No fluff. No marketing. Just physics, logic, and math.

---

## Table of Contents

- [Hour 1 ;  The Transistor to the CUDA Core](#hour-1--the-transistor-to-the-cuda-core)
  - [1.1 The Transistor: Nature's Switch](#11-the-transistor-natures-switch)
  - [1.2 From Transistors to Logic Gates](#12-from-transistors-to-logic-gates)
  - [1.3 From Gates to an ALU](#13-from-gates-to-an-alu)
  - [1.4 The CUDA Core: A Bare-Metal FPU](#14-the-cuda-core-a-bare-metal-fpu)
  - [1.5 SIMT: How 32 Threads Breathe Together](#15-simt-how-32-threads-breathe-together)
  - [1.6 The Streaming Multiprocessor (SM)](#16-the-streaming-multiprocessor-sm)
  - [1.7 Warp Divergence: The Cost of `if`](#17-warp-divergence-the-cost-of-if)
- [Hour 2 ;  Memory: The Real Bottleneck](#hour-2--memory-the-real-bottleneck)
  - [2.1 The Memory Hierarchy: A Map of Latencies](#21-the-memory-hierarchy-a-map-of-latencies)
  - [2.2 Registers: Zero-Cost Storage](#22-registers-zero-cost-storage)
  - [2.3 Shared Memory & Bank Conflicts](#23-shared-memory--bank-conflicts)
  - [2.4 Global Memory & Coalescing](#24-global-memory--coalescing)
  - [2.5 Occupancy: The Art of Hiding Latency](#25-occupancy-the-art-of-hiding-latency)
  - [2.6 The Warp Scheduler: Juggling Latency](#26-the-warp-scheduler-juggling-latency)
- [Hour 3 ;  The Generational Leap: Volta → Ampere → Hopper](#hour-3--the-generational-leap-volta--ampere--hopper)
  - [3.1 Tensor Cores: Why Multiply-Accumulate Is King](#31-tensor-cores-why-multiply-accumulate-is-king)
  - [3.2 Volta (SM70): Where Tensor Cores Were Born](#32-volta-sm70-where-tensor-cores-were-born)
  - [3.3 Ampere (SM80): Async Copy & mbarrier](#33-ampere-sm80-async-copy--mbarrier)
  - [3.4 Deep Dive: mbarrier ;  Phase-Based Synchronization](#34-deep-dive-mbarrier--phase-based-synchronization)
  - [3.5 Hopper (SM90): TMA, Clusters & WGMMA](#35-hopper-sm90-tma-clusters--wgmma)
  - [3.6 WGMMA: The Warpgroup Matrix Machine](#36-wgmma-the-warpgroup-matrix-machine)
- [Hour 4 ;  Blackwell: The Fifth Generation](#hour-4--blackwell-the-fifth-generation)
  - [4.1 Blackwell Architecture Overview](#41-blackwell-architecture-overview)
  - [4.2 Tensor Memory (TMEM): A New Address Space](#42-tensor-memory-tmem-a-new-address-space)
  - [4.3 tcgen05: The Fifth-Gen Tensor Core ISA](#43-tcgen05-the-fifth-gen-tensor-core-isa)
  - [4.4 FP4 and Microscaling: Extreme Precision Engineering](#44-fp4-and-microscaling-extreme-precision-engineering)
  - [4.5 Putting It All Together: A Blackwell GEMM Kernel](#45-putting-it-all-together-a-blackwell-gemm-kernel)
  - [4.6 The Full Picture: Architecture Comparison](#46-the-full-picture-architecture-comparison)
- [Appendix A ;  PTX Quick Reference](#appendix-a--ptx-quick-reference)
- [Appendix B ;  Occupancy Calculator Cheat Sheet](#appendix-b--occupancy-calculator-cheat-sheet)

---

# Hour 1 ;  The Transistor to the CUDA Core

## 1.1 The Transistor: Nature's Switch

Everything begins with a single idea: **a voltage-controlled switch**.

A **MOSFET** (Metal-Oxide-Semiconductor Field-Effect Transistor) has three terminals:

```mermaid
graph LR
    G["Gate (Control)"] --> T["Channel"]
    S["Source"] --> T
    T --> D["Drain"]

    style G fill:#ff6b6b,stroke:#333,color:#fff
    style S fill:#4ecdc4,stroke:#333,color:#fff
    style D fill:#45b7d1,stroke:#333,color:#fff
    style T fill:#f9f9f9,stroke:#333
```

There are two flavors:

| Type | Conducts when Gate is... | Pulls output toward... |
|------|--------------------------|------------------------|
| **NMOS** | HIGH (Vdd) | Ground (0) |
| **PMOS** | LOW (0) | Supply (1) |

> **The key insight**: A transistor is not a mystical device. It's a water faucet. Gate voltage = handle. Current = water. That's it.

Modern GPUs use **FinFET** transistors ;  the "fin" is a 3D gate that wraps around the channel on three sides, giving better electrostatic control at tiny scales:

```mermaid
graph TD
    subgraph "FinFET Process Nodes"
        V["Volta (2017)<br/>TSMC 12nm<br/>21.1B transistors"]
        T["Turing (2018)<br/>TSMC 12nm<br/>18.6B transistors"]
        A["Ampere (2020)<br/>TSMC 7nm<br/>54.2B transistors"]
        H["Hopper (2022)<br/>TSMC 4N<br/>80B transistors"]
        B["Blackwell (2024)<br/>TSMC 4NP<br/>208B transistors"]
    end
    V --> T --> A --> H --> B

    style V fill:#6c5ce7,stroke:#333,color:#fff
    style T fill:#a29bfe,stroke:#333,color:#fff
    style A fill:#00b894,stroke:#333,color:#fff
    style H fill:#fdcb6e,stroke:#333,color:#000
    style B fill:#e17055,stroke:#333,color:#fff
```

> **208 billion transistors** on Blackwell. That's about 26 transistors for every human alive, on a chip the size of your thumbnail (well, two chips fused together).

---

## 1.2 From Transistors to Logic Gates

Every digital circuit is built from three atomic operations. Here's how transistors make them:

### The NOT Gate (Inverter) ;  2 Transistors

```mermaid
graph TD
    VDD["Vdd (+)"] --> PMOS
    PMOS -->|"when Input=0, PMOS ON"| OUT["Output"]
    OUT --> NMOS
    NMOS -->|"when Input=1, NMOS ON"| GND["Ground"]
    INPUT["Input"] -.-> PMOS
    INPUT -.-> NMOS

    style VDD fill:#e74c3c,stroke:#333,color:#fff
    style GND fill:#2c3e50,stroke:#333,color:#fff
    style OUT fill:#f1c40f,stroke:#333,color:#000
    style INPUT fill:#3498db,stroke:#333,color:#fff
```

| Input | PMOS | NMOS | Output |
|-------|------|------|--------|
| 0 | ON | OFF | **1** |
| 1 | OFF | ON | **0** |

### The NAND Gate ;  4 Transistors

The NAND gate is the **universal gate** ;  you can build ANY logic function from NANDs alone.

```
        Vdd
       ┌─┤─┐
   A ──┤P₁ │
       └─┬─┘
       ┌─┤─┐    
   B ──┤P₂ │──── Output (= NOT(A AND B))
       └─┬─┘
       ┌─┤─┐
   A ──┤N₁ │
       └─┬─┘
       ┌─┤─┐
   B ──┤N₂ │
       └─┬─┘
        GND

PMOS: Parallel (either A=0 OR B=0 → output=1)
NMOS: Series  (both A=1 AND B=1 → output=0)
```

| A | B | NAND(A,B) |
|---|---|-----------|
| 0 | 0 | **1** |
| 0 | 1 | **1** |
| 1 | 0 | **1** |
| 1 | 1 | **0** |

> **Why NAND is universal**: `NOT(A) = NAND(A,A)`. `AND(A,B) = NOT(NAND(A,B))`. `OR(A,B) = NAND(NOT(A), NOT(B))`. Every other gate follows. This is why chip designers think in NANDs.

---

## 1.3 From Gates to an ALU

An ALU is just a **network of gates** that performs arithmetic. Let's build one from the bottom up.

### Half Adder ;  The Atom of Arithmetic

```mermaid
graph LR
    A["A"] --> XOR["⊕ XOR"]
    B["B"] --> XOR
    A --> AND["∧ AND"]
    B --> AND
    XOR --> S["Sum"]
    AND --> C["Carry"]

    style A fill:#3498db,stroke:#333,color:#fff
    style B fill:#3498db,stroke:#333,color:#fff
    style XOR fill:#e74c3c,stroke:#333,color:#fff
    style AND fill:#e67e22,stroke:#333,color:#fff
    style S fill:#2ecc71,stroke:#333,color:#fff
    style C fill:#2ecc71,stroke:#333,color:#fff
```

$$Sum = A \oplus B$$
$$Carry = A \wedge B$$

### Full Adder ;  Handles a Carry-In

$$Sum = A \oplus B \oplus C_{in}$$
$$C_{out} = (A \wedge B) \lor (C_{in} \wedge (A \oplus B))$$

**Gate count**: ~9 gates = ~36 transistors per bit.

### 32-bit Adder

Chain 32 full adders. But naive chaining is slow (carry ripples through all 32 bits). GPUs use **Carry-Lookahead Adders (CLA)**:

```mermaid
graph TD
    subgraph "Carry-Lookahead Adder - 32-bit"
        G0["Generate/Propagate<br/>Bits 0-7"] --> CLA1["CLA Block"]
        G1["Generate/Propagate<br/>Bits 8-15"] --> CLA2["CLA Block"]
        G2["Generate/Propagate<br/>Bits 16-23"] --> CLA3["CLA Block"]
        G3["Generate/Propagate<br/>Bits 24-31"] --> CLA4["CLA Block"]
        
        CLA1 --> SUPER["Carry-Lookahead<br/>Super Block"]
        CLA2 --> SUPER
        CLA3 --> SUPER
        CLA4 --> SUPER
    end
    
    style SUPER fill:#e74c3c,stroke:#333,color:#fff
    style CLA1 fill:#3498db,stroke:#333,color:#fff
    style CLA2 fill:#3498db,stroke:#333,color:#fff
    style CLA3 fill:#3498db,stroke:#333,color:#fff
    style CLA4 fill:#3498db,stroke:#333,color:#fff
```

**Key math**: For each bit position *i*:
- **Generate**: $G_i = A_i \wedge B_i$ (this bit produces a carry regardless)
- **Propagate**: $P_i = A_i \oplus B_i$ (this bit passes a carry through)
- **Carry**: $C_i = G_i \lor (P_i \wedge C_{i-1})$

The lookahead trick: expand the recursion so all carries compute in $O(\log n)$ gate delays instead of $O(n)$.

**Total for a 32-bit CLA**: ~200 gates ≈ ~800 transistors. Delay: ~4-5 gate stages.

---

## 1.4 The CUDA Core: A Bare-Metal FPU

Now let's build an actual floating-point unit. An FP32 number (IEEE 754):

```
┌──────┬──────────┬───────────────────────┐
│ Sign │ Exponent │      Mantissa         │
│ 1 bit│  8 bits  │      23 bits          │
└──────┴──────────┴───────────────────────┘

Value = (-1)^sign × 2^(exponent-127) × 1.mantissa
```

### FP32 Multiply Pipeline

To compute $A \times B$:

```mermaid
graph TD
    A["Input A<br/>(sign, exp, mant)"] --> S1
    B["Input B<br/>(sign, exp, mant)"] --> S1
    
    S1["Stage 1: Sign XOR<br/>sign_result = sign_A ⊕ sign_B<br/><i>1 gate</i>"] --> S2
    
    S2["Stage 2: Exponent Add<br/>exp_result = exp_A + exp_B - 127<br/><i>8-bit adder + subtractor</i>"] --> S3
    
    S3["Stage 3: Mantissa Multiply<br/>mant_result = 1.mant_A x 1.mant_B<br/><i>24x24 multiplier array</i>"] --> S4
    
    S4["Stage 4: Normalize<br/>Shift mantissa, adjust exponent<br/><i>Leading-zero detector + barrel shifter</i>"] --> S5
    
    S5["Stage 5: Round<br/>IEEE 754 rounding<br/><i>Round-to-nearest-even logic</i>"] --> R["Result"]

    style S1 fill:#e74c3c,stroke:#333,color:#fff
    style S2 fill:#e67e22,stroke:#333,color:#fff
    style S3 fill:#f1c40f,stroke:#333,color:#000
    style S4 fill:#2ecc71,stroke:#333,color:#fff
    style S5 fill:#3498db,stroke:#333,color:#fff
    style R fill:#9b59b6,stroke:#333,color:#fff
```

The **24×24 mantissa multiplier** is the big one. Using a Wallace tree multiplier: ~2,000-3,000 gates.

**Total transistor count for one FP32 CUDA core**: approximately **8,000–20,000 transistors**.

> **Compare to a CPU core**: An Intel Golden Cove core has ~500 million transistors. A CUDA core has ~15,000. That's a factor of **30,000×**. A CUDA core is *stupidly simple* by design. That's the point.

### What a CUDA Core Does NOT Have

```mermaid
graph LR
    subgraph "CPU Core"
        BP["Branch Predictor<br/>~50K transistors"]
        OOO["Out-of-Order Engine<br/>~200K transistors"]
        ROB["Reorder Buffer<br/>~100K transistors"]
        L1D["32KB L1D Cache<br/>Private"]
        L1I["32KB L1I Cache<br/>Private"]
        TLB["TLB<br/>~1000 entries"]
    end

    subgraph "CUDA Core"
        FPU["FP32 ALU<br/>~15K transistors"]
        DONE["That is it."]
    end

    style BP fill:#e74c3c,stroke:#333,color:#fff
    style OOO fill:#e74c3c,stroke:#333,color:#fff
    style ROB fill:#e74c3c,stroke:#333,color:#fff
    style FPU fill:#2ecc71,stroke:#333,color:#fff
    style DONE fill:#2ecc71,stroke:#333,color:#fff
```

No branch predictor. No out-of-order execution. No speculation. No private cache. Just: **input → ALU → output**. The GPU's philosophy is: *"Why predict branches when you can just run 10,000 threads and always have something useful to do?"*

---

## 1.5 SIMT: How 32 Threads Breathe Together

### The Warp ;  The Atom of GPU Execution

A **warp** is 32 threads that execute the **same instruction** at the **same time** on 32 different data.

This is **SIMT** ;  Single Instruction, Multiple Threads. It's like SIMD (SSE/AVX), but each "lane" has its own registers and can branch independently (since Volta).

```mermaid
graph TD
    subgraph "One Warp - 32 threads"
        PC["Program Counter<br/>(shared)"] --> INST["Instruction: FADD R1, R2, R3"]
        
        INST --> T0["Thread 0<br/>R1_0 = R2_0 + R3_0"]
        INST --> T1["Thread 1<br/>R1_1 = R2_1 + R3_1"]
        INST --> T2["Thread 2<br/>R1_2 = R2_2 + R3_2"]
        INST --> DOTS["..."]
        INST --> T31["Thread 31<br/>R1_31 = R2_31 + R3_31"]
    end

    style PC fill:#e74c3c,stroke:#333,color:#fff
    style INST fill:#f1c40f,stroke:#333,color:#000
    style T0 fill:#3498db,stroke:#333,color:#fff
    style T1 fill:#3498db,stroke:#333,color:#fff
    style T2 fill:#3498db,stroke:#333,color:#fff
    style T31 fill:#3498db,stroke:#333,color:#fff
```

**The hardware implementation**: The warp scheduler fetches ONE instruction. The dispatch unit sends it to 32 CUDA cores simultaneously. Each core reads from its own register bank and writes to its own register bank. One instruction, 32 executions. That's a **32× multiplier on instruction fetch/decode efficiency** compared to a scalar processor.

### Thread Identity

```cuda
// Every thread knows exactly where it is:
int globalId = blockIdx.x * blockDim.x + threadIdx.x;
int warpId   = threadIdx.x / 32;    // which warp within this block
int laneId   = threadIdx.x % 32;    // which lane within the warp (0-31)
```

---

## 1.6 The Streaming Multiprocessor (SM)

The SM is the **fundamental compute unit** of a GPU. Everything above (cores, warps, schedulers) lives inside one SM.

```mermaid
graph TD
    subgraph SM["Streaming Multiprocessor"]
        subgraph Q0["Quadrant 0"]
            WS0["Warp Scheduler 0<br/>+ Dispatch Unit"]
            CORES0["32 x FP32 Cores"]
            TC0["1 x Tensor Core"]
            LS0["8 x Load/Store Units"]
            SFU0["4 x SFU"]
            RF0["16,384 x 32-bit Registers"]
        end
        
        subgraph Q1["Quadrant 1"]
            WS1["Warp Scheduler 1"]
            CORES1["32 x FP32 Cores"]
            TC1["1 x Tensor Core"]
            RF1["16,384 Registers"]
        end
        
        subgraph Q2["Quadrant 2"]
            WS2["Warp Scheduler 2"]
            CORES2["32 x FP32 Cores"]
            TC2["1 x Tensor Core"]
            RF2["16,384 Registers"]
        end
        
        subgraph Q3["Quadrant 3"]
            WS3["Warp Scheduler 3"]
            CORES3["32 x FP32 Cores"]
            TC3["1 x Tensor Core"]
            RF3["16,384 Registers"]
        end
        
        SMEM["Shared Memory / L1 Cache<br/>(Configurable, up to 228 KB on Hopper)"]
        ICACHE["Instruction Cache"]
        CCACHE["Constant Cache"]
        TEX["Texture Units"]
    end

    style SM fill:#1a1a2e,stroke:#e94560,color:#fff
    style Q0 fill:#16213e,stroke:#0f3460,color:#fff
    style Q1 fill:#16213e,stroke:#0f3460,color:#fff
    style Q2 fill:#16213e,stroke:#0f3460,color:#fff
    style Q3 fill:#16213e,stroke:#0f3460,color:#fff
    style SMEM fill:#e94560,stroke:#333,color:#fff
    style WS0 fill:#0f3460,stroke:#333,color:#fff
    style CORES0 fill:#533483,stroke:#333,color:#fff
    style TC0 fill:#e74c3c,stroke:#333,color:#fff
```

### SM Specs Across Generations

| Feature | Volta (SM70) | Ampere (SM80) | Hopper (SM90) | Blackwell (SM100) |
|---------|-------------|---------------|---------------|-------------------|
| FP32 Cores/SM | 64 | 128 | 128 | 128 |
| Tensor Cores/SM | 8 | 4 (2x capable) | 4 (4th gen) | 4 (5th gen) |
| Register File | 256 KB | 256 KB | 256 KB | 256 KB |
| Shared Mem (max) | 96 KB | 164 KB | 228 KB | 228 KB |
| Max Threads/SM | 2048 | 2048 | 2048 | 2048 |
| Max Warps/SM | 64 | 64 | 64 | 64 |
| Warp Schedulers | 4 | 4 | 4 | 4 |
| Max Blocks/SM | 32 | 32 | 32 | 32 |

> **The golden ratio of GPUs**: 2048 threads / 256 KB registers = 128 bytes (32 registers) per thread at 100% occupancy. Every register you add above 32 reduces your occupancy. This is the *fundamental tradeoff* of GPU programming.

---

## 1.7 Warp Divergence: The Cost of `if`

Consider this code:

```cuda
__global__ void divergent_kernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        data[idx] = expensive_function_A(data[idx]);  // Even threads
    } else {
        data[idx] = expensive_function_B(data[idx]);  // Odd threads
    }
}
```

### What Happens Inside the Warp

```mermaid
graph TD
    subgraph "Warp Execution Timeline"
        T1["Clock 1-10: Execute Branch A<br/>Active Mask: 0x55555555<br/>Threads 0,2,4,...,30 ACTIVE<br/>Threads 1,3,5,...,31 IDLE"]
        T2["Clock 11-20: Execute Branch B<br/>Active Mask: 0xAAAAAAAA<br/>Threads 1,3,5,...,31 ACTIVE<br/>Threads 0,2,4,...,30 IDLE"]
        T3["Clock 21+: Reconverge<br/>Active Mask: 0xFFFFFFFF<br/>All 32 threads ACTIVE"]
    end
    
    T1 --> T2 --> T3

    style T1 fill:#e74c3c,stroke:#333,color:#fff
    style T2 fill:#e67e22,stroke:#333,color:#fff
    style T3 fill:#2ecc71,stroke:#333,color:#fff
```

**Throughput**: Both branches execute sequentially. If each branch takes 10 cycles, the divergent warp takes **20 cycles** instead of 10. That's **50% efficiency**.

**The hardware truth**: Those "idle" CUDA cores aren't powered down ;  they're executing the instruction but their **write-back is masked**. The energy is mostly wasted.

### The Fix: Think in Warps, Not Threads

```cuda
// GOOD: All threads in a warp take the same branch
__global__ void coalesced_kernel(float* data, int N) {
    int warpId = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
    int laneId = threadIdx.x % 32;

    if (warpId % 2 == 0) {
        // Entire warp goes here ;  no divergence
        data[warpId * 32 + laneId] = expensive_function_A(data[warpId * 32 + laneId]);
    } else {
        // Entire warp goes here ;  no divergence
        data[warpId * 32 + laneId] = expensive_function_B(data[warpId * 32 + laneId]);
    }
}
```

---

# Hour 2 ;  Memory: The Real Bottleneck

> *"The speed of computation doesn't matter if you can't feed the beast."*

## 2.1 The Memory Hierarchy: A Map of Latencies

This is the single most important diagram in GPU programming:

```mermaid
graph TD
    REG["Registers<br/>256 KB per SM<br/>0 cycles - same clock<br/>~20 TB/s aggregate"] --> SMEM
    
    SMEM["Shared Memory<br/>Up to 228 KB per SM<br/>~20-30 cycles<br/>~19 TB/s per SM"] --> L1
    
    L1["L1 Cache<br/>Combined with Shared Mem<br/>~30-40 cycles"] --> L2
    
    L2["L2 Cache<br/>40-96 MB<br/>~200-300 cycles<br/>~6-12 TB/s"] --> HBM
    
    HBM["HBM / GDDR<br/>80-192 GB<br/>~400-800 cycles<br/>2-8 TB/s"] --> PCIE
    
    PCIE["System Memory via PCIe<br/>Host RAM<br/>~10,000+ cycles<br/>64 GB/s"]

    style REG fill:#2ecc71,stroke:#333,color:#fff
    style SMEM fill:#3498db,stroke:#333,color:#fff
    style L1 fill:#9b59b6,stroke:#333,color:#fff
    style L2 fill:#e67e22,stroke:#333,color:#fff
    style HBM fill:#e74c3c,stroke:#333,color:#fff
    style PCIE fill:#c0392b,stroke:#333,color:#fff
```

> **The fundamental equation**: At 1.5 GHz, 400 cycles = **267 nanoseconds**. A single global memory access takes as long as 400 floating-point operations. This is why GPUs need **thousands of threads** ;  to keep the ALUs busy while other threads wait for memory.

Let me put it differently. Say each of your threads needs data from HBM. At 400 cycles per access, you need:

$$\text{Warps needed to hide latency} = \frac{\text{Memory latency}}{\text{Instruction throughput}} = \frac{400 \text{ cycles}}{4 \text{ cycles/instruction}} = 100 \text{ warps}$$

But we only have 64 warps per SM. So we **need** shared memory and caches to reduce effective latency, or we'll always be memory-bound.

---

## 2.2 Registers: Zero-Cost Storage

Registers are the fastest memory in the system. Each SM has **65,536 x 32-bit registers** (256 KB).

```cuda
__global__ void register_demo(float* out, float* in, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // These live in REGISTERS ;  zero-cost access
    float a = in[idx];          // Register R1
    float b = in[idx + N];      // Register R2
    float c = a * b;            // Register R3 = R1 * R2
    float d = c + 1.0f;         // Register R4 = R3 + 1.0
    out[idx] = d;               // Store R4 to memory
}
```

**Register pressure** ;  the critical tradeoff:

```
Total registers per SM:           65,536
Max threads per SM:                2,048
Registers at 100% occupancy:  65,536 / 2,048 = 32 registers per thread

If your kernel uses 64 registers/thread:
  Threads per SM = 65,536 / 64 = 1,024 threads = 32 warps
  Occupancy = 32/64 = 50%

If your kernel uses 128 registers/thread:
  Threads per SM = 65,536 / 128 = 512 threads = 16 warps
  Occupancy = 16/64 = 25%
```

> **Feynman's rule**: *Every register you add above 32 is stealing occupancy from you.* But sometimes that's a good trade ;  more registers means fewer memory accesses. Profile, don't guess.

### PTX Register Usage

```
// PTX: Registers are explicitly declared
.reg .f32 %f<64>;    // 64 FP32 registers: %f0 through %f63
.reg .f64 %fd<16>;   // 16 FP64 registers
.reg .b32 %r<32>;    // 32 32-bit general registers
.reg .b64 %rd<16>;   // 16 64-bit registers (for addresses)
.reg .pred %p<8>;    // 8 predicate registers (for branching)

// Usage:
ld.global.f32 %f0, [%rd0];       // Load from global memory to register
add.f32       %f1, %f0, %f2;     // Add two registers
st.global.f32 [%rd1], %f1;       // Store register to global memory
```

---

## 2.3 Shared Memory & Bank Conflicts

Shared memory is on-chip SRAM, accessible by all threads in a **thread block** (CTA). It's 20-30x faster than global memory.

### Physical Structure: 32 Banks

```mermaid
graph TD
    subgraph "Shared Memory: 32 Banks x 4 Bytes Wide"
        B0["Bank 0<br/>Addr 0, 128, 256..."]
        B1["Bank 1<br/>Addr 4, 132, 260..."]
        B2["Bank 2<br/>Addr 8, 136, 264..."]
        B3["Bank 3<br/>Addr 12, 140, 268..."]
        DOTS1["..."]
        B30["Bank 30<br/>Addr 120, 248..."]
        B31["Bank 31<br/>Addr 124, 252..."]
    end

    subgraph "Warp Threads"
        T0["Thread 0"] --> B0
        T1["Thread 1"] --> B1
        T2["Thread 2"] --> B2
        T3["Thread 3"] --> B3
        T30["Thread 30"] --> B30
        T31["Thread 31"] --> B31
    end

    style B0 fill:#3498db,stroke:#333,color:#fff
    style B1 fill:#2980b9,stroke:#333,color:#fff
    style B2 fill:#2471a3,stroke:#333,color:#fff
    style B3 fill:#1f618d,stroke:#333,color:#fff
    style B30 fill:#1a5276,stroke:#333,color:#fff
    style B31 fill:#154360,stroke:#333,color:#fff
```

**Bank formula**: `bank(address) = (address / 4) % 32`

### The Three Cases

```cuda
__shared__ float smem[1024];

// CASE 1: No conflict ;  stride-1 access (perfect)
float val = smem[threadIdx.x];
// Thread 0 -> Bank 0, Thread 1 -> Bank 1, ..., Thread 31 -> Bank 31
// One cycle. 32 simultaneous reads.

// CASE 2: 2-way conflict ;  stride-2 access
float val = smem[threadIdx.x * 2];
// Thread 0 -> Bank 0, Thread 16 -> Bank 0 (CONFLICT!)
// Thread 1 -> Bank 2, Thread 17 -> Bank 2 (CONFLICT!)
// Two cycles instead of one. 50% throughput.

// CASE 3: 32-way conflict ;  stride-32 access (catastrophic)
float val = smem[threadIdx.x * 32];
// ALL threads -> Bank 0
// 32 cycles instead of 1. 3.125% throughput.

// CASE 4: Broadcast ;  all read SAME address (free!)
float val = smem[0];
// All 32 threads read address 0 -> Bank 0 broadcasts. One cycle.
```

### Bank Conflict Diagram

```mermaid
graph LR
    subgraph "No Conflict - stride 1"
        direction TB
        TA0["T0"] -->|"addr 0"| BA0["Bank 0"]
        TA1["T1"] -->|"addr 4"| BA1["Bank 1"]
        TA2["T2"] -->|"addr 8"| BA2["Bank 2"]
        TA3["T3"] -->|"addr 12"| BA3["Bank 3"]
    end

    subgraph "2-way Conflict - stride 2"
        direction TB
        TB0["T0"] -->|"addr 0"| BB0["Bank 0"]
        TB1["T1"] -->|"addr 8"| BB2["Bank 2"]
        TB16["T16"] -->|"addr 128"| BB0
        TB17["T17"] -->|"addr 136"| BB2
    end

    style BA0 fill:#2ecc71,stroke:#333,color:#fff
    style BA1 fill:#2ecc71,stroke:#333,color:#fff
    style BA2 fill:#2ecc71,stroke:#333,color:#fff
    style BA3 fill:#2ecc71,stroke:#333,color:#fff
    style BB0 fill:#e74c3c,stroke:#333,color:#fff
    style BB2 fill:#e74c3c,stroke:#333,color:#fff
```

### Fixing Bank Conflicts: Padding

```cuda
// Problem: column-major access to a 32-wide array
__shared__ float tile[32][32];   // smem[row][col]
// Accessing column: tile[threadIdx.x][col] -> stride-32 -> 32-way conflict!

// Fix: add padding
__shared__ float tile[32][32 + 1];  // 33 columns
// Now stride is 33, and 33 % 32 = 1 -> perfect stride-1 access!
// One wasted float per row, but 32x throughput improvement.
```

---

## 2.4 Global Memory & Coalescing

Global memory (HBM) is accessed through **128-byte cache lines**. The hardware **coalesces** requests from all 32 threads in a warp into the minimum number of transactions.

```mermaid
graph TD
    subgraph "Coalesced Access - 1 transaction"
        W0["Warp: 32 threads read consecutive floats<br/>Addresses: 0, 4, 8, ..., 124"]
        CL0["Cache Line: 128 bytes at address 0"]
        W0 -->|"1 transaction"| CL0
    end
    
    subgraph "Strided Access - 32 transactions"
        W1["Warp: 32 threads read stride-32<br/>Addresses: 0, 128, 256, ..., 3968"]
        CL1["Cache Line at addr 0"]
        CL2["Cache Line at addr 128"]
        CL3["Cache Line at addr 256"]
        CL32["...32 cache lines total"]
        W1 --> CL1
        W1 --> CL2
        W1 --> CL3
        W1 --> CL32
    end

    style CL0 fill:#2ecc71,stroke:#333,color:#fff
    style CL1 fill:#e74c3c,stroke:#333,color:#fff
    style CL2 fill:#e74c3c,stroke:#333,color:#fff
    style CL3 fill:#e74c3c,stroke:#333,color:#fff
```

### The Coalescing Hardware

Inside each SM's Load/Store Unit (LSU):

1. Collect all 32 addresses from the warp
2. Sort them by 128-byte aligned segment
3. Issue one memory transaction per unique segment
4. Route returned bytes to correct thread registers

```cuda
// Perfectly coalesced ;  1 transaction per warp
__global__ void coalesced(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = data[idx];  // Threads 0-31 read addresses 0-124
}

// Strided ;  32 transactions per warp (32x slower)
__global__ void strided(float* data, int stride) {
    int idx = threadIdx.x * stride;
    float val = data[idx];  // Each thread hits a different cache line
}

// Random ;  up to 32 transactions per warp
__global__ void random_access(float* data, int* indices) {
    float val = data[indices[threadIdx.x]];  // Unpredictable addresses
}
```

### Coalescing Math

For a warp accessing addresses $a_0, a_1, \ldots, a_{31}$:

$$\text{Transactions} = \left|\left\lbrace \left\lfloor \frac{a_i}{128} \right\rfloor \mid i \in [0, 31] \right\rbrace\right|$$

i.e., the number of **distinct 128-byte aligned segments** touched.

**Best case**: 1 transaction (128 bytes), all data used → **100% efficiency**

**Worst case**: 32 transactions (4096 bytes), only 128 bytes used → **3.125% efficiency**

---

## 2.5 Occupancy: The Art of Hiding Latency

**Occupancy** = Active warps / Maximum warps per SM

```mermaid
graph TD
    subgraph "Occupancy Limiters"
        REG["Register Usage<br/>65,536 regs / regs per thread x 32<br/>= max warps from registers"]
        SMEM_L["Shared Memory Usage<br/>Max shared mem / smem per block<br/>= max blocks then max warps"]
        BLK["Block Size<br/>Max 32 blocks/SM<br/>Small blocks waste slots"]
    end
    
    REG --> MIN["Take MINIMUM<br/>= Active Warps"]
    SMEM_L --> MIN
    BLK --> MIN
    MIN --> OCC["Occupancy = Active / 64"]
    
    style REG fill:#e74c3c,stroke:#333,color:#fff
    style SMEM_L fill:#3498db,stroke:#333,color:#fff
    style BLK fill:#2ecc71,stroke:#333,color:#fff
    style MIN fill:#f1c40f,stroke:#333,color:#000
    style OCC fill:#9b59b6,stroke:#333,color:#fff
```

### Worked Example

```
Kernel: 256 threads/block, 48 regs/thread, 16 KB shared memory
GPU: SM90 (Hopper)

Register limit:
  65,536 regs / (48 regs x 256 threads) = 65,536 / 12,288 = 5.33 -> 5 blocks
  5 blocks x 256 threads = 1,280 threads = 40 warps

Shared memory limit:
  228 KB / 16 KB = 14.25 -> 14 blocks
  14 blocks x 256 threads = 3,584 -> but max is 2,048 -> 64 warps

Block count limit:
  32 blocks max -> 32 x 256 = 8,192 -> capped at 2,048 -> 64 warps

Final: min(40, 64, 64) = 40 warps -> Occupancy = 40/64 = 62.5%
```

> **Feynman's warning**: *Don't worship occupancy.* Sometimes 50% occupancy with 64 registers per thread beats 100% occupancy with 32 registers, because you avoid expensive memory spills. The occupancy calculator is a guide, not a god.

---

## 2.6 The Warp Scheduler: Juggling Latency

Each SM has **4 warp schedulers**, each managing a pool of warps. Every cycle, each scheduler:

```mermaid
graph TD
    CHECK["Check: Is any warp ready?<br/>Not stalled on memory, barrier, or dependency"] -->|"Yes"| SELECT
    CHECK -->|"No"| STALL["Scheduler stall<br/>wasted cycle"]
    
    SELECT["Select ready warp<br/>Greedy-Then-Oldest policy"] --> FETCH["Fetch instruction<br/>from warp's PC"]
    
    FETCH --> DECODE["Decode + Dispatch<br/>to execution unit"]
    
    DECODE --> EXEC["Execute on 32 CUDA cores<br/>or other unit"]
    
    EXEC --> SCOREBOARD["Update scoreboard<br/>Mark dest regs as pending"]
    
    SCOREBOARD --> CHECK

    style CHECK fill:#3498db,stroke:#333,color:#fff
    style SELECT fill:#2ecc71,stroke:#333,color:#fff
    style FETCH fill:#e67e22,stroke:#333,color:#fff
    style EXEC fill:#e74c3c,stroke:#333,color:#fff
    style SCOREBOARD fill:#9b59b6,stroke:#333,color:#fff
    style STALL fill:#c0392b,stroke:#333,color:#fff
```

### The Scoreboard

Each warp has a **scoreboard** ;  a bit vector tracking which registers have pending writes:

```
Warp 7 Scoreboard:
  R0: ready    R8:  ready    R16: PENDING (waiting for LD)
  R1: ready    R9:  ready    R17: ready
  R2: PENDING  R10: ready    R18: ready
  ...

If next instruction is: ADD R5, R2, R3
  -> R2 is PENDING -> warp 7 is STALLED
  -> Scheduler picks another warp
```

This is why GPUs **need many warps**: every memory access stalls a warp for hundreds of cycles. The scheduler swaps to another warp **in zero cycles** (zero-cost context switching, because all warp state is always resident in registers).

### Latency Hiding Math

For a **memory-bound** kernel:

$$\text{Warps needed} \geq \frac{\text{Memory latency (cycles)}}{\text{Cycles between memory ops}} $$

Example: Memory latency = 400 cycles, one memory op every 8 arithmetic instructions (4 cycles each = 32 cycles):

$$\text{Warps needed} \geq \frac{400}{32} = 12.5 \rightarrow 13 \text{ warps per scheduler}$$

With 4 schedulers: **52 warps minimum** → ~81% occupancy needed.

---

# Hour 3 ;  The Generational Leap: Volta → Ampere → Hopper

## 3.1 Tensor Cores: Why Multiply-Accumulate Is King

Deep learning is dominated by one operation: **matrix multiply-accumulate (MMA)**.

$$D = A \times B + C$$

Where $A$ is $M \times K$, $B$ is $K \times N$, and $C, D$ are $M \times N$.

The arithmetic intensity of matrix multiplication:

$$\text{FLOPs} = 2 \times M \times N \times K$$
$$\text{Bytes loaded} = (M \times K + K \times N + M \times N) \times \text{bytes per element}$$
$$\text{Arithmetic intensity} = \frac{2MNK}{(MK + KN + MN) \times b}$$

For large square matrices ($M = N = K$): intensity $\approx \frac{2K}{3b}$, which **grows with matrix size**. This is why GPUs love large GEMMs ;  they become compute-bound, not memory-bound.

A **Tensor Core** is a specialized circuit that computes a small MMA (e.g., 4x4x4) in a **single clock cycle**, rather than requiring 128 individual FMA (fused multiply-add) operations.

```mermaid
graph LR
    subgraph "Without Tensor Core"
        direction TB
        C1["CUDA Core 0: a00 x b00"]
        C2["CUDA Core 1: a01 x b10"]
        C3["CUDA Core 2: a02 x b20"]
        C4["CUDA Core 3: a03 x b30"]
        ACC["Accumulate: d00 = c00 + Sum"]
        C1 --> ACC
        C2 --> ACC
        C3 --> ACC
        C4 --> ACC
    end

    subgraph "With Tensor Core"
        direction TB
        TC["Tensor Core:<br/>Computes ENTIRE<br/>4x4x4 MMA<br/>in ONE cycle<br/>= 128 FMA ops"]
    end

    style TC fill:#e74c3c,stroke:#333,color:#fff
    style C1 fill:#3498db,stroke:#333,color:#fff
    style C2 fill:#3498db,stroke:#333,color:#fff
    style C3 fill:#3498db,stroke:#333,color:#fff
    style C4 fill:#3498db,stroke:#333,color:#fff
```

---

## 3.2 Volta (SM70): Where Tensor Cores Were Born

**Volta** (2017) introduced the **1st generation Tensor Core** and **independent thread scheduling**.

### Tensor Core V1: 4x4x4 FP16 to FP32

Each Tensor Core computes: $D_{4 \times 4} = A_{4 \times 4} \cdot B_{4 \times 4} + C_{4 \times 4}$

- **Input**: FP16 (A, B)
- **Accumulate**: FP32 (C, D)
- **Per Tensor Core per clock**: 64 FMA operations
- **8 Tensor Cores per SM**: 512 FMA ops/SM/clock

### CUDA: WMMA API

```cuda
#include <mma.h>
using namespace nvcuda;

__global__ void volta_wmma_gemm(half* A, half* B, float* C, float* D,
                                 int M, int N, int K) {
    // Declare fragments for a 16x16x16 tile
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

    // Initialize accumulator to zero
    wmma::fill_fragment(c_frag, 0.0f);

    // Compute warp's tile position
    int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / 32 / (N/16);
    int warpN = (blockIdx.x * blockDim.x + threadIdx.x) / 32 % (N/16);
    int row = warpM * 16;
    int col = warpN * 16;

    // Loop over K dimension in tiles of 16
    for (int k = 0; k < K; k += 16) {
        // Load A and B tiles from global memory
        wmma::load_matrix_sync(a_frag, A + row * K + k, K);
        wmma::load_matrix_sync(b_frag, B + k * N + col, N);

        // Tensor Core MMA: C += A x B
        wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    }

    // Store result
    wmma::store_matrix_sync(D + row * N + col, c_frag, N, wmma::mem_row_major);
}
```

### PTX: wmma.mma.sync

```
// PTX for Volta WMMA (16x16x16, FP16 to FP32)
// Warp-synchronous ;  all 32 threads participate

// Declare register fragments
.reg .f32 %acc<8>;       // 8 accumulator registers per thread (16x16 / 32 threads)
.reg .b32 %a<4>;         // A fragment (packed FP16)
.reg .b32 %b<4>;         // B fragment (packed FP16)

// Load A fragment from shared memory
wmma.load.a.sync.aligned.m16n16k16.shared.row.f16
    {%a0, %a1, %a2, %a3}, [smem_a_addr], stride_a;

// Load B fragment from shared memory
wmma.load.b.sync.aligned.m16n16k16.shared.col.f16
    {%b0, %b1, %b2, %b3}, [smem_b_addr], stride_b;

// Tensor Core MMA
wmma.mma.sync.aligned.m16n16k16.row.col.f32.f16.f16.f32
    {%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7},
    {%a0, %a1, %a2, %a3},
    {%b0, %b1, %b2, %b3},
    {%acc0, %acc1, %acc2, %acc3, %acc4, %acc5, %acc6, %acc7};
```

> **Key insight**: The `wmma.mma.sync` is **warp-synchronous** ;  all 32 threads must execute it together. Each thread contributes a piece of the A, B, and C matrices. The Tensor Core hardware does the reduction internally.

---

## 3.3 Ampere (SM80): Async Copy & mbarrier

**Ampere** (2020) brought three game-changers:
1. **3rd-gen Tensor Cores** (TF32, BF16, FP64, structured sparsity)
2. **cp.async** ;  hardware asynchronous copy from global to shared memory
3. **mbarrier** ;  hardware-accelerated synchronization primitive

### cp.async: Bypass the Register File

Before Ampere, loading from global to shared memory required:

```
Global Memory -> Register -> Shared Memory   (2 steps, registers busy)
```

Ampere's `cp.async` does it in one hardware step:

```
Global Memory -> Shared Memory   (1 step, registers free!)
```

```mermaid
graph LR
    subgraph "Pre-Ampere via registers"
        GM1["Global Memory"] -->|"LDG to register"| REG1["Register File"]
        REG1 -->|"STS to shared"| SM1["Shared Memory"]
    end
    
    subgraph "Ampere cp.async bypass registers"
        GM2["Global Memory"] -->|"cp.async<br/>hardware DMA"| SM2["Shared Memory"]
    end

    style GM1 fill:#e74c3c,stroke:#333,color:#fff
    style REG1 fill:#f1c40f,stroke:#333,color:#000
    style SM1 fill:#3498db,stroke:#333,color:#fff
    style GM2 fill:#e74c3c,stroke:#333,color:#fff
    style SM2 fill:#3498db,stroke:#333,color:#fff
```

### CUDA cp.async

```cuda
#include <cuda_pipeline.h>

__global__ void async_copy_kernel(float4* global_data, int N) {
    __shared__ float4 smem_buffer[2][64];  // Double buffer

    int tid = threadIdx.x;
    int stage = 0;

    // Stage 0: Issue first async copy
    __pipeline_memcpy_async(&smem_buffer[0][tid],
                            &global_data[tid], sizeof(float4));
    __pipeline_commit();

    for (int i = 1; i < N; i++) {
        int next_stage = stage ^ 1;  // Toggle 0 <-> 1

        // Issue async copy for NEXT tile
        __pipeline_memcpy_async(&smem_buffer[next_stage][tid],
                                &global_data[i * 64 + tid], sizeof(float4));
        __pipeline_commit();

        // Wait for CURRENT tile to be ready
        __pipeline_wait_prior(1);  // Wait until only 1 group pending
        __syncthreads();

        // Compute on current tile
        compute(smem_buffer[stage]);

        stage = next_stage;
    }

    // Process last tile
    __pipeline_wait_prior(0);
    __syncthreads();
    compute(smem_buffer[stage]);
}
```

### PTX: cp.async

```
// Copy 16 bytes from global to shared memory, asynchronously
cp.async.ca.shared.global [smem_addr], [global_addr], 16;

// Commit this group of copies
cp.async.commit_group;

// Wait until at most N groups are still pending
cp.async.wait_group N;

// Wait for ALL pending copies
cp.async.wait_all;
```

### Ampere's Structured Sparsity (2:4)

Ampere introduced hardware-accelerated **2:4 structured sparsity**: in every group of 4 elements, at least 2 must be zero.

```
Dense:    [0.5, 0.0, 0.3, 0.0, 0.7, 0.0, 0.1, 0.0]
Sparse:   [0.5, 0.3, 0.7, 0.1]  + metadata: [0,2,0,2]
```

The Tensor Core skips zero multiplications → **2x throughput**.

---

## 3.4 Deep Dive: mbarrier ;  Phase-Based Synchronization

> *"Synchronization is the tax you pay for parallelism. mbarrier reduces that tax."*

### Why Not Just `__syncthreads()`?

`__syncthreads()` is a **sledgehammer**: it forces ALL threads in a block to reach the same point. It knows nothing about asynchronous operations. It's block-scoped only.

**mbarrier** is a **scalpel**: it can track async operations (like `cp.async`), works across CTAs in a cluster, and uses phases for pipelining.

| Feature | `__syncthreads()` | `mbarrier` |
|---------|-------------------|-----------|
| Scope | Block-level only | Flexible (warp, block, cross-CTA in cluster) |
| Async awareness | No | Yes ;  can track async operations |
| Hardware support | Barrier instruction | Dedicated hardware unit in SM |
| Phase tracking | No | Yes ;  alternating phases |
| Completion tracking | Thread arrival only | Thread arrival + async byte count |

### The mbarrier State Machine

```mermaid
stateDiagram-v2
    [*] --> Init: mbarrier.init count=N
    
    Init --> Phase0_Waiting: Phase 0 begins
    
    Phase0_Waiting --> Phase0_Waiting: arrive count--
    Phase0_Waiting --> Phase0_Complete: count reaches 0
    
    Phase0_Complete --> Phase1_Waiting: Phase flips to 1
    
    Phase1_Waiting --> Phase1_Waiting: arrive count--
    Phase1_Waiting --> Phase1_Complete: count reaches 0
    
    Phase1_Complete --> Phase0_Waiting: Phase flips back to 0
```

> **Phase 0 note**: Threads call `mbarrier.arrive()`, async ops arrive automatically. Waiters spin on `try_wait(phase=0)`.
>
> **Phase 1 note**: Same mechanism, opposite phase. Allows pipelining stages.

### The mbarrier Object in Memory

An mbarrier lives in **shared memory** as an 8-byte (64-bit) opaque object:

```
Bits 63-32: Pending async byte count (for cp.async tracking)
Bits 31-16: Arrival count remaining
Bit  0:     Phase bit (alternates 0 and 1)
```

The hardware manages these bits atomically.

### Complete PTX mbarrier Example: Multi-Stage Pipeline

```
// ============================================
// Multi-stage async pipeline using mbarrier
// ============================================

.shared .align 8  .b64 mbar[4];              // 4 mbarrier objects (4 stages)
.shared .align 128 .b8  smem_buf[4][TILE_SZ]; // 4 stage buffers

// --- Initialization (thread 0 only) ---
.reg .pred %is_t0;
setp.eq.u32 %is_t0, %tid, 0;

@%is_t0 mbarrier.init.shared.b64 [mbar + 0],  %block_size;
@%is_t0 mbarrier.init.shared.b64 [mbar + 8],  %block_size;
@%is_t0 mbarrier.init.shared.b64 [mbar + 16], %block_size;
@%is_t0 mbarrier.init.shared.b64 [mbar + 24], %block_size;
bar.sync 0;  // Ensure init is visible to all threads

// --- Prologue: Fill the pipeline ---
// Stage 0: Issue async copy and arrive
cp.async.ca.shared.global [smem_buf + 0 + %tid_offset],
                           [global_addr_0 + %tid_offset], 16;
cp.async.commit_group;
mbarrier.arrive.expect_tx.shared.b64 %state, [mbar + 0], 16;
    // Tell the barrier: "expect 16 more bytes from async ops"

// Stage 1: Issue async copy
cp.async.ca.shared.global [smem_buf + TILE_SZ + %tid_offset],
                           [global_addr_1 + %tid_offset], 16;
cp.async.commit_group;
mbarrier.arrive.expect_tx.shared.b64 %state, [mbar + 8], 16;

// --- Main Loop ---
.reg .u32 %stage;
mov.u32 %stage, 0;

MAIN_LOOP:
    // Wait for current stage data to be ready
    .reg .u64 %mbar_addr;
    // compute %mbar_addr = mbar + (%stage % 4) * 8
    .reg .pred %wait_done;
    
    TRY_WAIT:
    mbarrier.try_wait.parity.shared.b64 %wait_done, [%mbar_addr], %phase;
    @!%wait_done bra TRY_WAIT;
    // Data is ready! Proceed.

    // Issue async copy for stage+2 (prefetch ahead)
    // ... (similar cp.async + mbarrier.arrive.expect_tx) ...

    // Compute on current stage data
    ld.shared.f32 %f0, [%current_buf + %tid_offset];
    // ... computation ...
    st.global.f32 [%output + %tid_offset], %f_result;

    // Advance
    add.u32 %stage, %stage, 1;
    setp.lt.u32 %p_loop, %stage, %num_tiles;
    @%p_loop bra MAIN_LOOP;
```

### mbarrier with TMA (Hopper+)

On Hopper, the **TMA hardware** automatically arrives at an mbarrier when its copy completes ;  no thread intervention needed:

```mermaid
sequenceDiagram
    participant Thread as Warp Thread
    participant TMA as TMA Engine
    participant MBAR as mbarrier
    participant SMEM as Shared Memory
    participant GMEM as Global Memory

    Thread->>MBAR: mbarrier.arrive.expect_tx(bytes)
    Thread->>TMA: cp.async.bulk.tensor [smem], [tensorMap], [mbar]
    Note right of Thread: Thread continues other work!
    TMA->>GMEM: Read tensor tile
    GMEM-->>TMA: Data
    TMA->>SMEM: Write to shared memory
    TMA->>MBAR: Hardware auto-arrive (tx_count -= bytes)
    Note right of MBAR: When all arrivals + TX complete: phase flips
    Thread->>MBAR: mbarrier.try_wait(phase) = complete!
```

---

## 3.5 Hopper (SM90): TMA, Clusters & WGMMA

**Hopper** (2022) was a quantum leap. Three major innovations:

### 1. Thread Block Clusters

A **cluster** is a group of up to 8 CTAs (thread blocks) that are **guaranteed to be co-scheduled** on the same GPC.

```mermaid
graph TD
    subgraph GPC0["GPC 0"]
        subgraph Cluster["Thread Block Cluster"]
            CTA0["CTA 0<br/>SM 0<br/>Shared Mem 0"]
            CTA1["CTA 1<br/>SM 1<br/>Shared Mem 1"]
            CTA2["CTA 2<br/>SM 2<br/>Shared Mem 2"]
            CTA3["CTA 3<br/>SM 3<br/>Shared Mem 3"]
            
            CTA0 <-->|"DSMEM"| CTA1
            CTA1 <-->|"DSMEM"| CTA2
            CTA2 <-->|"DSMEM"| CTA3
            CTA0 <-->|"DSMEM"| CTA3
        end
    end

    style Cluster fill:#1a1a2e,stroke:#e94560,color:#fff
    style CTA0 fill:#16213e,stroke:#0f3460,color:#fff
    style CTA1 fill:#16213e,stroke:#0f3460,color:#fff
    style CTA2 fill:#16213e,stroke:#0f3460,color:#fff
    style CTA3 fill:#16213e,stroke:#0f3460,color:#fff
```

**Distributed Shared Memory (DSMEM)**: CTA 0 can directly read/write CTA 1's shared memory ;  no global memory round-trip needed.

```cuda
__cluster_dims__(4, 1, 1)  // 4 CTAs in a cluster
__global__ void cluster_kernel() {
    // Get cluster info
    auto cluster = cooperative_groups::this_cluster();
    extern __shared__ int smem[];

    // Access another CTA's shared memory directly!
    int target_cta = (cluster.block_rank() + 1) % cluster.num_blocks();
    int* remote_smem = cluster.map_shared_rank(smem, target_cta);

    // Direct cross-CTA shared memory access (via DSMEM)
    int val = *remote_smem;  // No global memory needed!
}
```

### 2. Tensor Memory Accelerator (TMA)

The TMA is a **dedicated hardware engine** in each SM that moves multi-dimensional tensor tiles between global memory and shared memory.

```mermaid
graph LR
    subgraph "Without TMA - Manual Tiling"
        direction TB
        T0["Thread 0: load gmem 0 to reg to smem 0"]
        T1["Thread 1: load gmem 1 to reg to smem 1"]
        T2["Thread 2: load gmem 2 to reg to smem 2"]
        DOTS["...128 threads doing address math"]
    end

    subgraph "With TMA - 1 thread 1 instruction"
        direction TB
        TMA["TMA Engine:<br/>cp.async.bulk.tensor.2d<br/>[smem], [tensorMap, coords], [mbar]<br/><br/>Handles:<br/>- Address calculation<br/>- Bounds checking<br/>- Swizzling<br/>- Format conversion<br/>- mbarrier notification"]
    end

    style TMA fill:#e74c3c,stroke:#333,color:#fff
    style T0 fill:#95a5a6,stroke:#333,color:#fff
    style T1 fill:#95a5a6,stroke:#333,color:#fff
    style T2 fill:#95a5a6,stroke:#333,color:#fff
```

### TMA Setup (Host Side)

```cuda
// Create a tensor map descriptor on the host
CUtensorMap tensorMap;
CUresult result = cuTensorMapEncodeTiled(
    &tensorMap,
    CU_TENSOR_MAP_DATA_TYPE_FLOAT16,    // Element type
    2,                                    // Number of dimensions
    globalPtr,                            // Base pointer in global memory
    globalDim,                            // {rows, cols} of global tensor
    globalStrides,                        // {stride_row_bytes, stride_col_bytes}
    boxDim,                               // {tile_rows, tile_cols} to transfer
    elementStrides,                       // Element strides
    CU_TENSOR_MAP_INTERLEAVE_NONE,       // Interleaving
    CU_TENSOR_MAP_SWIZZLE_128B,          // Swizzle pattern
    CU_TENSOR_MAP_L2_PROMOTION_L2_128B,  // L2 policy
    CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE    // Out-of-bounds handling
);
```

### TMA PTX (Device Side)

```
// Load a 2D tile from global to shared memory
// Only ONE thread needs to execute this ;  TMA does all the work
.reg .pred %is_leader;
setp.eq.u32 %is_leader, %tid, 0;

@%is_leader cp.async.bulk.tensor.2d.shared::cluster.global.tile
    .mbarrier::complete_tx::bytes
    [smem_addr],                    // Destination in shared memory
    [tensorMapPtr, {%coord_x, %coord_y}],  // Source: tensor map + coordinates
    [mbar_addr];                    // mbarrier to signal on completion

// All threads wait for TMA to complete
WAIT:
mbarrier.try_wait.parity.shared.b64 %wait_result, [mbar_addr], %phase;
@!%wait_result bra WAIT;
```

### TMA Multicast (Load once, deliver to multiple CTAs)

```
// Load tile and multicast to CTAs 0, 1, 2, 3 in a cluster
@%is_leader cp.async.bulk.tensor.2d.shared::cluster.global.tile
    .mbarrier::complete_tx::bytes
    .multicast::cluster
    [smem_addr],
    [tensorMapPtr, {%coord_x, %coord_y}],
    [mbar_addr],
    %ctaMask;    // Bitmask: 0b1111 = CTAs 0,1,2,3
```

> **Why TMA is revolutionary**: Without TMA, 128 threads waste time computing addresses, doing bounds checks, and issuing individual loads. With TMA, **one instruction** replaces all of that. The freed-up threads do useful compute instead.

---

## 3.6 WGMMA: The Warpgroup Matrix Machine

**WGMMA** (Warpgroup Matrix Multiply-Accumulate) is Hopper's Tensor Core interface. It operates at the **warpgroup** level: 4 warps (128 threads) acting as a unit.

### Why Warpgroups?

```mermaid
graph TD
    subgraph "Volta/Ampere: Warp-level MMA"
        W0["1 Warp = 32 threads<br/>wmma.mma.sync 16x16x16<br/>= 8,192 FLOPs"]
    end
    
    subgraph "Hopper: Warpgroup-level MMA"
        WG["4 Warps = 128 threads<br/>wgmma.mma_async 64x256x16<br/>= 524,288 FLOPs<br/><br/>64x more work per instruction!"]
    end

    style W0 fill:#3498db,stroke:#333,color:#fff
    style WG fill:#e74c3c,stroke:#333,color:#fff
```

### WGMMA Key Innovation

**Operand B can come directly from shared memory** ;  no need to load it into registers first:

```mermaid
graph LR
    subgraph "Volta wmma"
        SMEM1["Shared Mem"] -->|"load to reg"| REG1["Registers A"]
        SMEM1 -->|"load to reg"| REG2["Registers B"]
        REG1 --> TC1["Tensor Core"]
        REG2 --> TC1
    end
    
    subgraph "Hopper wgmma"
        SMEM2["Shared Mem B"] -->|"direct read"| TC2["Tensor Core"]
        REG3["Registers A"] --> TC2
    end

    style TC1 fill:#3498db,stroke:#333,color:#fff
    style TC2 fill:#e74c3c,stroke:#333,color:#fff
```

> **Key**: B reads shared memory directly! This saves register bandwidth.

### WGMMA PTX

```
// ==========================================
// Hopper WGMMA: 64x256x16 FP16 to FP32
// ==========================================

// Step 1: Fence ;  ensure memory ordering
wgmma.fence.sync.aligned;

// Step 2: Issue the MMA
// desc_a = 64-bit descriptor for A matrix in shared memory
// desc_b = 64-bit descriptor for B matrix in shared memory
wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16
    {%acc0,  %acc1,  %acc2,  %acc3,
     %acc4,  %acc5,  %acc6,  %acc7,
     %acc8,  %acc9,  %acc10, %acc11,
     %acc12, %acc13, %acc14, %acc15,
     %acc16, %acc17, %acc18, %acc19,
     %acc20, %acc21, %acc22, %acc23,
     %acc24, %acc25, %acc26, %acc27,
     %acc28, %acc29, %acc30, %acc31,
     %acc32, %acc33, %acc34, %acc35,
     %acc36, %acc37, %acc38, %acc39,
     %acc40, %acc41, %acc42, %acc43,
     %acc44, %acc45, %acc46, %acc47,
     %acc48, %acc49, %acc50, %acc51,
     %acc52, %acc53, %acc54, %acc55,
     %acc56, %acc57, %acc58, %acc59,
     %acc60, %acc61, %acc62, %acc63,
     %acc64, %acc65, %acc66, %acc67,
     %acc68, %acc69, %acc70, %acc71,
     %acc72, %acc73, %acc74, %acc75,
     %acc76, %acc77, %acc78, %acc79,
     %acc80, %acc81, %acc82, %acc83,
     %acc84, %acc85, %acc86, %acc87,
     %acc88, %acc89, %acc90, %acc91,
     %acc92, %acc93, %acc94, %acc95,
     %acc96, %acc97, %acc98, %acc99,
     %acc100,%acc101,%acc102,%acc103,
     %acc104,%acc105,%acc106,%acc107,
     %acc108,%acc109,%acc110,%acc111,
     %acc112,%acc113,%acc114,%acc115,
     %acc116,%acc117,%acc118,%acc119,
     %acc120,%acc121,%acc122,%acc123,
     %acc124,%acc125,%acc126,%acc127},
    %desc_a,        // 64-bit shared memory descriptor for A
    %desc_b,        // 64-bit shared memory descriptor for B
    1,              // Scale D (multiply accumulator by 1)
    1, 1,           // Scale A, Scale B (negate flags)
    0, 0;           // Transpose A, Transpose B

// Step 3: Commit ;  mark this MMA group for tracking
wgmma.commit_group.sync.aligned;

// Step 4: Wait ;  block until MMA completes
wgmma.wait_group.sync.aligned 0;  // 0 = wait for all groups

// Now %acc0..%acc127 contain the 64x256 result matrix
```

### WGMMA Descriptor Layout

The 64-bit descriptor encodes the shared memory tile location:

```
Bits 63-49: Reserved
Bits 48-46: Leading dimension mode
Bits 45-32: Base address offset (in shared memory, 16B aligned)
Bits 31-16: Leading dimension byte offset
Bits 15-4:  Stride dimension byte offset
Bits 3-0:   Swizzle mode
```

### Complete Hopper GEMM Pattern

```mermaid
sequenceDiagram
    participant TMA as TMA Engine
    participant SMEM as Shared Memory
    participant WG as Warpgroup of 4 warps
    participant TC as Tensor Cores

    Note over TMA,TC: === Stage k (double-buffered) ===

    WG->>TMA: Issue TMA load for tile k+1
    TMA->>SMEM: Async bulk copy (tile k+1)

    WG->>WG: wgmma.fence
    WG->>TC: wgmma.mma_async (on tile k data)
    WG->>WG: wgmma.commit_group

    TMA-->>SMEM: TMA arrives at mbarrier
    WG->>WG: Wait on mbarrier (tile k+1 ready)
    WG->>WG: wgmma.wait_group (tile k MMA done)

    Note over TMA,TC: === Stage k+1 ===
    WG->>TMA: Issue TMA load for tile k+2
    WG->>TC: wgmma.mma_async (on tile k+1 data)
```

---

# Hour 4 ;  Blackwell: The Fifth Generation

## 4.1 Blackwell Architecture Overview

**Blackwell** (2024) ;  compute capability **sm_100** (data center) and **sm_120** (desktop).

```mermaid
graph TD
    subgraph GB100["GB100 Full Die"]
        direction TB
        SMs["192 SMs<br/>24,576 CUDA Cores<br/>768 Tensor Cores - 5th gen"]
        L2["96 MB L2 Cache"]
        HBM["192 GB HBM3e<br/>8 TB/s bandwidth"]
        NVLINK["18x NVLink 5<br/>1.8 TB/s bidirectional"]

        subgraph PerSM["Per SM"]
            CUDA_B["128 FP32 CUDA Cores"]
            TC_B["4x 5th Gen Tensor Cores"]
            TMEM_B["TENSOR MEMORY - TMEM<br/>NEW! Dedicated TC storage"]
            TMA_B["TMA Enhanced"]
            RF_B["256 KB Register File"]
            SMEM_B["228 KB Shared Mem / L1"]
            MBAR_B["16 mbarrier slots"]
        end
    end

    style SMs fill:#e74c3c,stroke:#333,color:#fff
    style L2 fill:#e67e22,stroke:#333,color:#fff
    style HBM fill:#f1c40f,stroke:#333,color:#000
    style NVLINK fill:#2ecc71,stroke:#333,color:#fff
    style TMEM_B fill:#e74c3c,stroke:#fff,color:#fff
    style TC_B fill:#c0392b,stroke:#333,color:#fff
```

### What's New in Blackwell

| Feature | Hopper | Blackwell | Improvement |
|---------|--------|-----------|-------------|
| Transistors | 80B | 208B | 2.6x |
| SMs | 132 | 192 | 1.45x |
| Tensor Core gen | 4th | 5th | New ISA |
| Peak FP4 (sparse) | N/A | ~9 PFLOPS | New! |
| Peak FP8 | 990 TFLOPS | ~1800 TFLOPS | ~1.8x |
| HBM Bandwidth | 3.35 TB/s | 8 TB/s | 2.4x |
| L2 Cache | 60 MB | 96 MB | 1.6x |
| NVLink BW | 900 GB/s | 1.8 TB/s | 2x |
| Key TC instruction | wgmma | **tcgen05** | New paradigm |
| New memory space | ;  | **TMEM** | New! |
| New data types | FP8 | **FP4, MX formats** | New! |

---

## 4.2 Tensor Memory (TMEM): A New Address Space

This is the single biggest architectural change in Blackwell. **TMEM** is a new, dedicated memory space that lives inside the SM, separate from both registers and shared memory.

```mermaid
graph TD
    subgraph "Hopper Memory Spaces"
        REG_H["Register File<br/>256 KB<br/>Accumulator lives HERE"]
        SMEM_H["Shared Memory<br/>228 KB<br/>Operands A, B"]
        GMEM_H["Global Memory<br/>HBM3"]
    end
    
    subgraph "Blackwell Memory Spaces"
        REG_BW["Register File<br/>256 KB<br/>General computation"]
        TMEM_BW["TMEM - NEW<br/>Accumulator lives HERE<br/>Dedicated TC bandwidth"]
        SMEM_BW["Shared Memory<br/>228 KB<br/>Operands A, B"]
        GMEM_BW["Global Memory<br/>HBM3e"]
    end

    style REG_H fill:#e74c3c,stroke:#333,color:#fff
    style SMEM_H fill:#3498db,stroke:#333,color:#fff
    style TMEM_BW fill:#e74c3c,stroke:#fff,color:#fff
    style REG_BW fill:#f1c40f,stroke:#333,color:#000
    style SMEM_BW fill:#3498db,stroke:#333,color:#fff
```

### Why TMEM Exists

On Hopper, WGMMA results accumulate in **registers**. But the register file has limited bandwidth ;  reading/writing 128 accumulator registers per warpgroup per MMA creates a bottleneck. The register file is also used for general computation.

TMEM solves this by giving the Tensor Cores their own **private, high-bandwidth storage**:

- **Accumulators live in TMEM**, not registers
- The Tensor Core reads/writes TMEM directly with dedicated pathways
- The register file is freed for other work (address computation, control flow)
- TMEM has **higher bandwidth** to the Tensor Cores than the register file

### TMEM Access Patterns

```
// TMEM is accessed ONLY via tcgen05 instructions:

// Load from shared memory INTO TMEM
tcgen05.ld.16x256b [tmem_addr], [smem_addr];

// Store from TMEM to shared memory
tcgen05.st.16x256b [smem_addr], [tmem_addr];

// MMA writes results directly to TMEM
tcgen05.mma ... [tmem_addr], ...;  // Accumulator in TMEM

// You CANNOT:
// - Load TMEM with regular LD instructions
// - Access TMEM from CUDA core ALUs directly
// - Use TMEM for general-purpose storage
```

---

## 4.3 tcgen05: The Fifth-Gen Tensor Core ISA

**tcgen05** = **T**ensor **C**ore **Gen**eration **05** ;  the instruction set for Blackwell's 5th-gen Tensor Cores.

### Instruction Overview

```mermaid
graph TD
    subgraph "tcgen05 Instruction Family"
        MMA["tcgen05.mma<br/>Matrix Multiply-Accumulate<br/>The main compute instruction"]
        LD["tcgen05.ld<br/>Load shared mem to TMEM"]
        ST["tcgen05.st<br/>Store TMEM to shared mem"]
        FENCE["tcgen05.fence<br/>Memory ordering"]
        COMMIT["tcgen05.commit<br/>Signal completion via mbarrier"]
    end
    
    LD -->|"Load operands"| MMA
    MMA -->|"Results in TMEM"| ST
    MMA -.->|"Order"| FENCE
    MMA -.->|"Track completion"| COMMIT

    style MMA fill:#e74c3c,stroke:#333,color:#fff
    style LD fill:#3498db,stroke:#333,color:#fff
    style ST fill:#2ecc71,stroke:#333,color:#fff
    style FENCE fill:#e67e22,stroke:#333,color:#fff
    style COMMIT fill:#9b59b6,stroke:#333,color:#fff
```

### tcgen05.mma ;  The Core Instruction

```
tcgen05.mma.cta_group::{1,2}.kind [d_tmem], a_desc, b_desc, idesc, enable;
```

Let's break down every operand:

| Operand | Description |
|---------|-------------|
| `cta_group::{1,2}` | How many CTAs collaborate. `1` = single CTA. `2` = two CTAs in a cluster work together on one MMA. |
| `kind` | Data type combination: `.f16`, `.tf32`, `.f8f6f4`, `.mxf8f6f4`, etc. |
| `[d_tmem]` | Destination address in **TMEM** ;  where the accumulator result is stored |
| `a_desc` | 64-bit descriptor for matrix A (shared memory location, layout, size) |
| `b_desc` | 64-bit descriptor for matrix B (shared memory location, layout, size) |
| `idesc` | Immediate descriptor ;  encodes the MMA shape (M, N, K dimensions) |
| `enable` | Predicate ;  allows conditional execution |

### PTX Example: tcgen05.mma

```
// =============================================
// Blackwell tcgen05.mma: FP16 Matrix Multiply
// =============================================

// Prerequisites: A and B tiles in shared memory, TMEM allocated

// Step 1: Fence ;  ensure previous TC ops are ordered
tcgen05.fence::before_thread_sync;

// Step 2: Issue the MMA
// Compute D[TMEM] = A[smem] x B[smem] + D[TMEM]
tcgen05.mma.cta_group::1.kind::f16
    [%tmem_d],          // Accumulator destination in TMEM
    %desc_a,            // 64-bit descriptor for A tile (in shared memory)
    %desc_b,            // 64-bit descriptor for B tile (in shared memory)
    %idesc,             // Immediate descriptor (shape info)
    %enable;            // Predicate enable

// Step 3: Commit ;  signal completion via mbarrier
tcgen05.commit.cta_group::1.mbarrier::arrive [%mbar_addr];

// Step 4: Wait ;  block until MMA completes
// (Check the mbarrier that tcgen05.commit arrived at)
TRY_WAIT_TC:
mbarrier.try_wait.parity.shared.b64 %done, [%mbar_addr], %phase;
@!%done bra TRY_WAIT_TC;
```

### tcgen05 Load/Store ;  Moving Data To/From TMEM

```
// Load: Shared Memory to TMEM
// Transfers a 16x256-bit tile (16 rows, each 256 bits = 32 bytes)
tcgen05.ld.16x256b [%tmem_addr], [%smem_addr];

// Store: TMEM to Shared Memory
tcgen05.st.16x256b [%smem_addr], [%tmem_addr];
```

### tcgen05 vs WGMMA Comparison

```mermaid
graph LR
    subgraph "Hopper WGMMA Flow"
        direction TB
        H1["1. Load A from smem to registers"]
        H2["2. wgmma.mma_async: A reg, B smem to acc reg"]
        H3["3. wgmma.wait: acc ready in registers"]
        H4["4. Store registers to smem/gmem"]
        H1 --> H2 --> H3 --> H4
    end
    
    subgraph "Blackwell tcgen05 Flow"
        direction TB
        B1["1. TMA loads A,B to smem"]
        B2["2. tcgen05.mma: A smem, B smem to acc TMEM"]
        B3["3. tcgen05.commit + mbar wait"]
        B4["4. tcgen05.st: TMEM to smem"]
        B1 --> B2 --> B3 --> B4
    end

    style H1 fill:#3498db,stroke:#333,color:#fff
    style H2 fill:#3498db,stroke:#333,color:#fff
    style H3 fill:#3498db,stroke:#333,color:#fff
    style H4 fill:#3498db,stroke:#333,color:#fff
    style B1 fill:#e74c3c,stroke:#333,color:#fff
    style B2 fill:#e74c3c,stroke:#333,color:#fff
    style B3 fill:#e74c3c,stroke:#333,color:#fff
    style B4 fill:#e74c3c,stroke:#333,color:#fff
```

**Key difference**: In Blackwell, the accumulator **never touches the register file** during the MMA. It lives entirely in TMEM. This frees registers for address computation, control flow, and other work. The register file is no longer the bottleneck for Tensor Core throughput.

### CTA-Group MMA: Two CTAs, One MMA

```
// Two CTAs in a cluster collaborate on a single, larger MMA
tcgen05.mma.cta_group::2.kind::f16
    [%tmem_d], %desc_a, %desc_b, %idesc, %enable;

// Both CTAs contribute portions of A/B and receive portions of D
// The hardware coordinates the data sharing via DSMEM
// Result: larger effective tile size = better utilization
```

---

## 4.4 FP4 and Microscaling: Extreme Precision Engineering

### FP4: E2M1 Format

```
+------+----------+----------+
| Sign | Exponent | Mantissa |
| 1 bit|  2 bits  |  1 bit   |
+------+----------+----------+

Representable values:
  0, +/-0.5, +/-1.0, +/-1.5, +/-2.0, +/-3.0, +/-4.0, +/-6.0

Exponent bias = 1
Value = (-1)^s x 2^(e-1) x (1 + m/2)
      = (-1)^s x 2^(e-1) x {1.0, 1.5}
```

**Only 16 representable values!** How can this possibly work?

### Microscaling (MX) Formats

The trick: **block-level scaling factors**. Instead of each element having its own exponent, a **block** of elements shares one scale:

```mermaid
graph TD
    subgraph "Standard FP16"
        E0["elem 0: full 16-bit float"]
        E1["elem 1: full 16-bit float"]
        E2["elem 2: full 16-bit float"]
        E3["elem 3: full 16-bit float"]
    end
    
    subgraph "MXFP4 Microscaling"
        SCALE["Shared Scale Factor<br/>8-bit E8M0 exponent<br/>Per block of 32 elements"]
        M0["elem 0: 4-bit"]
        M1["elem 1: 4-bit"]
        M2["elem 2: 4-bit"]
        M31["elem 31: 4-bit"]
        
        SCALE --> M0
        SCALE --> M1
        SCALE --> M2
        SCALE --> M31
    end

    style SCALE fill:#e74c3c,stroke:#333,color:#fff
    style M0 fill:#3498db,stroke:#333,color:#fff
    style M1 fill:#3498db,stroke:#333,color:#fff
    style M2 fill:#3498db,stroke:#333,color:#fff
    style M31 fill:#3498db,stroke:#333,color:#fff
```

**Effective value**: $v_i = \text{scale} \times \text{fp4}_i = 2^{\text{shared\\_exp}} \times (-1)^{s_i} \times 2^{(e_i - 1)} \times (1 + m_i/2)$

This gives FP4 the **dynamic range** of ~FP16 with the **storage cost** of 4 bits + amortized scale overhead.

### Why This Matters for Transformers

In transformer inference:
- Attention scores and FFN weights have **relatively uniform magnitudes within blocks**
- A shared scale factor captures the block-level magnitude
- Individual FP4 values capture the relative differences
- The 2nd-gen **Transformer Engine** automatically manages these scales

**Result**: 2x the throughput of FP8, with minimal accuracy loss.

### MX Format Variants on Blackwell

| Format | Element bits | Scale | Elements/block | Use case |
|--------|-------------|-------|----------------|----------|
| MXFP8 | 8 | E8M0 | 32 | High accuracy |
| MXFP6 | 6 | E8M0 | 32 | Balanced |
| MXFP4 | 4 | E8M0 | 32 | Max throughput |

---

## 4.5 Putting It All Together: A Blackwell GEMM Kernel

Here's the complete data flow for a high-performance GEMM on Blackwell:

```mermaid
graph TD
    subgraph "Global Memory HBM3e"
        GA["Matrix A<br/>M x K"]
        GB["Matrix B<br/>K x N"]
    end

    subgraph "TMA Engine"
        TMA_A["TMA Load A tile"]
        TMA_B["TMA Load B tile"]
    end

    subgraph "Shared Memory"
        SA["A tile buffer<br/>double-buffered"]
        SB["B tile buffer<br/>double-buffered"]
    end

    subgraph "5th Gen Tensor Core"
        MMA["tcgen05.mma"]
    end

    subgraph "TMEM"
        ACC["Accumulator D<br/>Lives in TMEM,<br/>never touches registers!"]
    end

    subgraph "Output"
        SC["Result tile in smem"]
        GD["Matrix D: M x N<br/>Global Memory"]
    end

    GA -->|"TMA descriptor"| TMA_A
    GB -->|"TMA descriptor"| TMA_B
    TMA_A -->|"async bulk copy"| SA
    TMA_B -->|"async bulk copy"| SB
    SA -->|"a_desc"| MMA
    SB -->|"b_desc"| MMA
    MMA -->|"accumulate"| ACC
    ACC -->|"tcgen05.st"| SC
    SC -->|"TMA store"| GD

    style MMA fill:#e74c3c,stroke:#333,color:#fff
    style ACC fill:#c0392b,stroke:#333,color:#fff
    style TMA_A fill:#2ecc71,stroke:#333,color:#fff
    style TMA_B fill:#2ecc71,stroke:#333,color:#fff
```

### CUDA/PTX Pseudocode: Blackwell GEMM

```cuda
// =============================================
// Blackwell GEMM Kernel (Conceptual Structure)
// Uses: TMA + tcgen05.mma + TMEM + mbarrier
// =============================================

#define TILE_M 128
#define TILE_N 256
#define TILE_K 64
#define STAGES 4

__cluster_dims__(2, 1, 1)  // 2 CTAs per cluster
__global__ void blackwell_gemm(
    const __grid_constant__ CUtensorMap tensorMapA,
    const __grid_constant__ CUtensorMap tensorMapB,
    half* D, int M, int N, int K)
{
    extern __shared__ char smem[];

    // Partition shared memory: double-buffered A & B tiles + mbarriers
    constexpr int TILE_A_BYTES = TILE_M * TILE_K * sizeof(half);
    constexpr int TILE_B_BYTES = TILE_K * TILE_N * sizeof(half);

    half* smem_A[2] = {
        (half*)(smem),
        (half*)(smem + TILE_A_BYTES)
    };
    half* smem_B[2] = {
        (half*)(smem + 2 * TILE_A_BYTES),
        (half*)(smem + 2 * TILE_A_BYTES + TILE_B_BYTES)
    };

    // mbarrier array for pipeline stages
    __shared__ __align__(8) uint64_t mbar_load[STAGES];
    __shared__ __align__(8) uint64_t mbar_compute;

    // =============================================
    // Initialize mbarriers
    // =============================================
    if (threadIdx.x == 0) {
        for (int i = 0; i < STAGES; i++)
            asm volatile("mbarrier.init.shared.b64 [%0], %1;"
                :: "l"(&mbar_load[i]), "r"(1));
        asm volatile("mbarrier.init.shared.b64 [%0], %1;"
            :: "l"(&mbar_compute), "r"(blockDim.x));
    }
    __syncthreads();

    // Compute block tile coordinates
    int block_m = blockIdx.x * TILE_M;
    int block_n = blockIdx.y * TILE_N;

    // =============================================
    // PROLOGUE: Fill pipeline with first tiles
    // =============================================
    if (threadIdx.x == 0) {
        // TMA load first A and B tiles
        asm volatile(
            "cp.async.bulk.tensor.2d.shared::cluster.global.tile"
            ".mbarrier::complete_tx::bytes"
            " [%0], [%1, {%2, %3}], [%4];"
            :: "l"(smem_A[0]),
               "l"(&tensorMapA), "r"(0), "r"(block_m),
               "l"(&mbar_load[0])
        );
        asm volatile(
            "cp.async.bulk.tensor.2d.shared::cluster.global.tile"
            ".mbarrier::complete_tx::bytes"
            " [%0], [%1, {%2, %3}], [%4];"
            :: "l"(smem_B[0]),
               "l"(&tensorMapB), "r"(block_n), "r"(0),
               "l"(&mbar_load[0])
        );
    }

    // =============================================
    // MAIN LOOP: Iterate over K dimension
    // =============================================
    int num_k_tiles = K / TILE_K;

    for (int k_tile = 0; k_tile < num_k_tiles; k_tile++) {
        int buf = k_tile % 2;

        // Wait for current tile's TMA load to complete
        asm volatile(
            "{\n\t"
            ".reg .pred p;\n\t"
            "WAIT_LOAD_%=:\n\t"
            "mbarrier.try_wait.parity.shared.b64 p, [%0], %1;\n\t"
            "@!p bra WAIT_LOAD_%=;\n\t"
            "}\n\t"
            :: "l"(&mbar_load[k_tile % STAGES]),
               "r"(k_tile / STAGES % 2)
        );

        // Issue TMA load for NEXT tile (prefetch)
        if (threadIdx.x == 0 && k_tile + 1 < num_k_tiles) {
            int next_k = (k_tile + 1) * TILE_K;
            int next_buf = (k_tile + 1) % 2;
            // TMA load next A and B tiles...
            // (similar to prologue, targeting smem_A/B[next_buf])
        }

        // ----- tcgen05 MMA -----
        asm volatile(
            "tcgen05.fence::before_thread_sync;\n\t"

            "tcgen05.mma.cta_group::1.kind::f16"
            " [%0], %1, %2, %3, 1;\n\t"

            "tcgen05.commit.cta_group::1"
            ".mbarrier::arrive [%4];\n\t"
            :: "l"(/* tmem_addr */),
               "l"(/* desc_a */),
               "l"(/* desc_b */),
               "r"(/* idesc */),
               "l"(&mbar_compute)
        );

        // Wait for MMA to complete
        asm volatile(
            "{\n\t"
            ".reg .pred p;\n\t"
            "WAIT_MMA_%=:\n\t"
            "mbarrier.try_wait.parity.shared.b64 p, [%0], %1;\n\t"
            "@!p bra WAIT_MMA_%=;\n\t"
            "}\n\t"
            :: "l"(&mbar_compute), "r"(k_tile % 2)
        );
    }

    // =============================================
    // EPILOGUE: Store results from TMEM
    // =============================================

    // Move accumulator from TMEM to shared memory
    asm volatile(
        "tcgen05.st.16x256b [%0], [%1];\n\t"
        :: "l"(/* smem_result */), "l"(/* tmem_addr */)
    );
    __syncthreads();

    // Store from shared memory to global memory
    // (each thread stores its portion)
    int tid = threadIdx.x;
    int elems_per_thread = (TILE_M * TILE_N) / blockDim.x;
    half* smem_out = (half*)(smem);  // reuse shared memory
    for (int i = 0; i < elems_per_thread; i++) {
        int idx = tid * elems_per_thread + i;
        int row = block_m + idx / TILE_N;
        int col = block_n + idx % TILE_N;
        if (row < M && col < N) {
            D[row * N + col] = smem_out[idx];
        }
    }
}
```

---

## 4.6 The Full Picture: Architecture Comparison

### Evolution of GPU Tensor Core Programming

```mermaid
graph TD
    V["Volta 2017<br/>wmma.mma.sync<br/>Warp-level: 32 threads<br/>16x16x16 tiles<br/>FP16 only<br/>Data: Reg to TC to Reg"]

    A["Ampere 2020<br/>mma.sync + cp.async<br/>Warp-level: 32 threads<br/>Async global to shared copy<br/>mbarrier sync<br/>TF32, BF16, FP64, Sparsity<br/>Data: Reg to TC to Reg"]

    H["Hopper 2022<br/>wgmma.mma_async<br/>Warpgroup-level: 128 threads<br/>TMA hardware engine<br/>Thread Block Clusters<br/>DSMEM, FP8<br/>Data: Smem/Reg to TC to Reg"]

    B["Blackwell 2024<br/>tcgen05.mma<br/>CTA-group level: 1-2 CTAs<br/>TMEM new address space<br/>FP4, MX formats<br/>Enhanced TMA<br/>CTA-group collaboration<br/>Data: Smem to TC to TMEM"]

    V --> A --> H --> B

    style V fill:#6c5ce7,stroke:#333,color:#fff
    style A fill:#00b894,stroke:#333,color:#fff
    style H fill:#fdcb6e,stroke:#333,color:#000
    style B fill:#e17055,stroke:#333,color:#fff
```

### Peak Performance Across Generations

| Generation | FP16 Tensor (TFLOPS) | FP8 Tensor (TFLOPS) | FP4 Tensor (TFLOPS) | HBM BW (TB/s) |
|-----------|---------------------|---------------------|---------------------|---------------|
| V100 (Volta) | 125 | ;  | ;  | 0.9 |
| A100 (Ampere) | 312 | ;  | ;  | 2.0 |
| H100 (Hopper) | 990 | 1,979 | ;  | 3.35 |
| B200 (Blackwell) | ~1,800 | ~3,600 | ~9,000 (sparse) | 8.0 |

### The Memory Wall: Compute Grows Faster Than Memory

```mermaid
graph TD
    COMPUTE["Compute Growth:<br/>~4x per generation<br/>V100 to B200: ~72x"] --> GAP
    MEMORY["Memory BW Growth:<br/>~2x per generation<br/>V100 to B200: ~9x"] --> GAP
    GAP["The Gap:<br/>Compute outpaces memory ~8:1<br/>We must reduce data movement!"]
    
    GAP --> SOL1["1. Lower precision<br/>FP16 to FP8 to FP4 = less data"]
    GAP --> SOL2["2. Larger on-chip memory<br/>96 to 228 KB smem"]
    GAP --> SOL3["3. TMA<br/>Eliminates wasted thread work"]
    GAP --> SOL4["4. TMEM<br/>Eliminates register file bottleneck"]
    GAP --> SOL5["5. Async everything<br/>Overlap compute and memory"]

    style GAP fill:#e74c3c,stroke:#333,color:#fff
    style SOL1 fill:#2ecc71,stroke:#333,color:#fff
    style SOL2 fill:#2ecc71,stroke:#333,color:#fff
    style SOL3 fill:#2ecc71,stroke:#333,color:#fff
    style SOL4 fill:#2ecc71,stroke:#333,color:#fff
    style SOL5 fill:#2ecc71,stroke:#333,color:#fff
```

---

# Appendix A ;  PTX Quick Reference

### Memory Spaces

```
// PTX memory space qualifiers:
.global     // GPU HBM / GDDR (slowest, largest)
.shared     // Per-SM shared memory
.local      // Per-thread local memory (actually in global, cached)
.const      // Constant memory (cached, broadcast)
.param      // Kernel parameters
.reg        // Registers (fastest)
// Blackwell only:
// TMEM: accessed via tcgen05 instructions (not a PTX address space qualifier)
```

### Essential Instructions

```
// ===== Arithmetic =====
add.f32     %f0, %f1, %f2;      // f0 = f1 + f2
mul.f32     %f0, %f1, %f2;      // f0 = f1 * f2
fma.rn.f32  %f0, %f1, %f2, %f3; // f0 = f1 * f2 + f3 (fused, round-nearest)
mad.lo.u32  %r0, %r1, %r2, %r3; // r0 = r1 * r2 + r3 (integer)

// ===== Memory =====
ld.global.f32    %f0, [%rd0];          // Load from global
st.global.f32    [%rd0], %f0;          // Store to global
ld.shared.f32    %f0, [smem_addr];     // Load from shared
st.shared.f32    [smem_addr], %f0;     // Store to shared

// ===== Async Copy (Ampere+) =====
cp.async.ca.shared.global [dst], [src], 16;  // 16 bytes, global to shared
cp.async.commit_group;                        // Commit group
cp.async.wait_group N;                        // Wait for Nth group

// ===== TMA (Hopper+) =====
cp.async.bulk.tensor.2d.shared::cluster.global.tile
    .mbarrier::complete_tx::bytes
    [smem], [tensorMap, {x, y}], [mbar];      // TMA 2D tile load

// ===== mbarrier =====
mbarrier.init.shared.b64 [addr], count;                // Initialize
mbarrier.arrive.shared.b64 state, [addr];              // Arrive
mbarrier.arrive.expect_tx.shared.b64 state, [addr], tx_count; // Arrive + expect async bytes
mbarrier.try_wait.parity.shared.b64 pred, [addr], phase;     // Non-blocking wait

// ===== Tensor Core (Volta/Ampere) =====
wmma.load.a.sync.aligned.m16n16k16.shared.row.f16
    {regs}, [addr], stride;
wmma.mma.sync.aligned.m16n16k16.row.col.f32.f16.f16.f32
    {d_regs}, {a_regs}, {b_regs}, {c_regs};

// ===== WGMMA (Hopper) =====
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16
    {acc_regs}, desc_a, desc_b, scale_d, scale_a, scale_b, trans_a, trans_b;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned N;

// ===== tcgen05 (Blackwell) =====
tcgen05.fence::before_thread_sync;
tcgen05.mma.cta_group::1.kind::f16
    [tmem_d], desc_a, desc_b, idesc, enable;
tcgen05.commit.cta_group::1.mbarrier::arrive [mbar];
tcgen05.ld.16x256b [tmem], [smem];
tcgen05.st.16x256b [smem], [tmem];

// ===== Warp Shuffle (all generations) =====
shfl.sync.bfly.b32 %r0, %r1, lane_mask, 0x1f;  // Butterfly shuffle
shfl.sync.down.b32 %r0, %r1, delta, 0x1f;       // Shift down
shfl.sync.up.b32   %r0, %r1, delta, 0x0;        // Shift up
shfl.sync.idx.b32  %r0, %r1, src_lane, 0x1f;    // Indexed shuffle

// ===== Control Flow =====
setp.lt.f32  %p0, %f0, %f1;     // p0 = (f0 < f1)
@%p0 bra     TARGET;             // Conditional branch
bar.sync     0;                  // __syncthreads()
```

---

# Appendix B ;  Occupancy Calculator Cheat Sheet

### Quick Reference Table (SM90 ;  Hopper)

| Regs/Thread | Max Threads/SM | Max Warps | Occupancy | Notes |
|-------------|---------------|-----------|-----------|-------|
| 16 | 2048 | 64 | 100% | Very few regs ;  likely register spills |
| 32 | 2048 | 64 | 100% | Sweet spot for simple kernels |
| 48 | 1536 | 48 | 75% | |
| 64 | 1024 | 32 | 50% | Common for complex kernels |
| 80 | 768 | 24 | 37.5% | |
| 96 | 640 | 20 | 31.25% | |
| 128 | 512 | 16 | 25% | Register-heavy kernel |
| 255 | 256 | 8 | 12.5% | Maximum registers allowed |

### Formulas

$$\text{Warps from registers} = \left\lfloor \frac{65536}{\text{regs/thread} \times 32} \right\rfloor$$

$$\text{Warps from shared mem} = \min\left( \left\lfloor \frac{\text{max smem per SM}}{\text{smem per block}} \right\rfloor \times \frac{\text{threads per block}}{32}, \; 64 \right)$$

$$\text{Occupancy} = \frac{\min(\text{warps from regs}, \text{warps from smem}, \text{warps from blocks})}{64}$$

---

> *"The thing that makes physics beautiful is that it's simple. The same is true of GPUs ;  once you understand that everything is about hiding latency and maximizing throughput, the whole architecture falls into place like a jigsaw puzzle. Every generation, NVIDIA finds a new piece of latency to hide. Registers were the bottleneck for Tensor Cores? Invent TMEM. Threads wasted on address math? Invent TMA. Synchronization too coarse? Invent mbarrier. The physics hasn't changed ;  electrons through transistors. But the engineering is magnificent."*
>
> ;  Your friend, Dick Feynman (if he wrote GPU code)

---

*End of the Four-Hour Lecture. Go build something.*