# Hello, MMA — Your First Tensor Core Instruction

- Source: https://www.dcbaslani.xyz/blog/06_hello_mma/
- Published: 2026-03-03
- Topic: CUDA
- Description: How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

---

# Hello, MMA ;  Your First Tensor Core Instruction

**Difficulty:** Intermediate
**Prerequisites:** [Tutorial 01: Hello, Layout!](https://www.dcbaslani.xyz/blog/01_hello_layout/), [Tutorial 04: The Parallel Copy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/) (same partition API pattern)

## 1. The Problem (The "Why")

Tutorials 03–05 were about *moving data* ;  getting floats from global memory into shared memory, fast and conflict-free. But at some point, you need to actually *compute*. On modern NVIDIA GPUs, the fastest math unit isn't the CUDA core ;  it's the **Tensor Core**.

A Tensor Core executes a matrix multiply-accumulate (MMA) as a single hardware instruction. The SM80 `mma.sync.aligned.m16n8k16` instruction multiplies a 16×16 half-precision matrix by an 8×16 half-precision matrix and accumulates into a 16×8 float matrix ;  **2048 multiply-adds in a single instruction**, across all 32 threads of a warp. (Note: this instruction has a latency of 20+ cycles; "single instruction" means one issue slot, not one clock cycle. See *Hardware note ;  throughput* below.)

The catch: you can't just pass pointers. The hardware expects A, B, and C fragments to be distributed across 32 threads in a **very specific register layout**. Thread 0 holds certain rows and columns of A, thread 7 holds others, and the MMA instruction reaches into all 32 threads' registers simultaneously. Get the distribution wrong and you get garbage ;  or a hardware trap.

CuTe's `TiledMMA` handles this distribution for you, using the **exact same partition pattern** you learned with `TiledCopy`:

| TiledCopy (Tutorial 04) | TiledMMA (This Tutorial) |
| :--- | :--- |
| `make_tiled_copy(Copy_Atom, thr, val)` | `make_tiled_mma(MMA_Atom<Op>{})` |
| `get_thread_slice(tid)` | `get_thread_slice(tid)` |
| `partition_S` / `partition_D` | `partition_A` / `partition_B` / `partition_C` |
| `copy(tiled_copy, src, dst)` | `gemm(tiled_mma, fragA, fragB, fragC)` |

Same API shape, different hardware backend. If you understood Tutorial 04's ownership maps and partitions, you already know 80% of this tutorial.

> **B200 Note:** On Hopper/Blackwell, the same `TiledMMA` pattern drives WGMMA (Warpgroup MMA) ;  128 threads feeding a larger Tensor Core instruction. Tutorial 09 covers that. The API is identical; only the atom changes.

## 2. The Mental Model (The Visual)

### The Stamping Press

Think of a Tensor Core as a **stamping press** in a factory. You load raw-material trays (A and B fragments) into specific slots, press the button, and out comes the product (C accumulator). The press is fixed-size ;  it always stamps a 16×8 tile of results from 16×16 and 8×16 inputs.

```text
                        ┌─────────────────────────────┐
                        │      STAMPING PRESS          │
                        │   (One MMA Instruction)      │
                        │                              │
   ┌──────────────┐     │                              │     ┌──────────┐
   │  A (16×16)   │────▶│   32 threads cooperate:      │────▶│ C (16×8) │
   │   half       │     │   each loads its tray slot,  │     │  float   │
   │  256 values  │     │   hardware does the rest     │     │ 128 vals │
   └──────────────┘     │                              │     └──────────┘
   ┌──────────────┐     │   2048 multiply-adds         │
   │  B (8×16)    │────▶│   in a SINGLE instruction    │
   │   half       │     │                              │
   │  128 values  │     └─────────────────────────────-┘
   └──────────────┘
```

### Per-Thread Fragment Sizes

The 256 values of A, 128 values of B, and 128 values of C are split evenly across 32 threads:

```text
Thread #7 (one of 32):
┌────────────────────────────────────────────────────┐
│  Registers:                                        │
│    A fragment: 8 half values   (256 total / 32)    │
│    B fragment: 4 half values   (128 total / 32)    │
│    C fragment: 4 float values  (128 total / 32)    │
│                                                    │
│  The hardware knows exactly which matrix cells     │
│  these 8+4+4 values correspond to.                 │
│  CuTe knows too ;  that's what partition_A/B/C do.  │
└────────────────────────────────────────────────────┘
```

You don't need to know the exact register-to-coordinate mapping (it's in the PTX docs if you're curious). CuTe's partition functions handle it transparently ;  just like `TiledCopy` handles thread-to-element mapping without you memorizing a table.

### The API Flow

```text
make_tiled_mma(MMA_Atom<Op>{})
         │
    get_thread_slice(tid)
         │
    ┌────┴──────────────────┬──────────────────┐
    │                       │                  │
partition_A(sA)       partition_B(sB)     partition_C(gC)
    │                       │                  │
make_fragment_like    make_fragment_like   make_fragment_like
    │                       │                  │
copy(smem → reg)      copy(smem → reg)     clear(accum)
    │                       │                  │
    └───────────┬───────────┘                  │
                │                              │
        gemm(tiled_mma, fragA, fragB, accum)───┘
                              │
                     copy(accum → gC)
```

## 3. The Solution (The Code)

Two kernels: first, an ownership map that shows which thread owns which C element (same technique as Tutorial 04). Second, the actual micro-GEMM: load A and B into shared memory, partition them for the MMA, execute one Tensor Core instruction, and write C back to global memory.

```cpp
#include <cute/tensor.hpp>
#include <cute/arch/mma_sm80.hpp>
#include <cute/atom/mma_atom.hpp>
#include <cute/algorithm/gemm.hpp>

#include <cuda_fp16.h>
#include <cstdio>

using namespace cute;

// ─── Kernel 1: MMA Ownership Map ───
// Which thread owns which elements of the C matrix?
__global__ void mma_ownership_map()
{
    using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;
    auto tiled_mma = make_tiled_mma(MMA_Atom<MMA_Op>{});

    // Shared memory to stamp thread IDs into C positions
    __shared__ int smem_C[16 * 8];
    auto sC_layout = make_layout(make_shape(Int<16>{}, Int<8>{}));
    auto sC = make_tensor(make_smem_ptr(smem_C), sC_layout);

    // Clear
    for (int i = threadIdx.x; i < 128; i += blockDim.x)
        smem_C[i] = -1;
    __syncthreads();

    // Each thread stamps its ID into its owned C cells
    auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x);
    auto tCsC = thr_mma.partition_C(sC);

    for (int i = 0; i < size(tCsC); ++i)
        tCsC(i) = threadIdx.x;
    __syncthreads();

    // Thread 0 prints the map
    if (threadIdx.x == 0) {
        printf("MMA C Ownership Map (16x8) ;  SM80_16x8x16_F32F16F16F32_TN\n");
        printf("Each thread holds 4 float accumulators.\n\n");

        printf("       ");
        for (int n = 0; n < 8; ++n) printf("n=%-4d", n);
        printf("\n");

        for (int m = 0; m < 16; ++m) {
            printf("m=%-4d ", m);
            for (int n = 0; n < 8; ++n)
                printf("T%02d   ", sC(m, n));
            printf("\n");
        }
    }
}

// ─── Kernel 2: Micro-GEMM ───
// One warp computes C(16×8) += A(16×16) × B(8×16)^T using a single MMA instruction.
__global__ void micro_gemm(half const* A_ptr, half const* B_ptr, float* C_ptr)
{
    using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;
    auto tiled_mma = make_tiled_mma(MMA_Atom<MMA_Op>{});

    // ── Shared memory ──
    __shared__ half smem_A[16 * 16];   // M × K
    __shared__ half smem_B[8 * 16];    // N × K

    // A layout: (M,K) = (16,16), K stride-1 (required by TN atom)
    auto sA_layout = make_layout(make_shape(Int<16>{}, Int<16>{}),
                                 make_stride(Int<16>{}, Int<1>{}));
    // B layout: (N,K) = (8,16), K stride-1
    auto sB_layout = make_layout(make_shape(Int<8>{}, Int<16>{}),
                                 make_stride(Int<16>{}, Int<1>{}));
    // C layout: (M,N) = (16,8), column-major
    auto gC_layout = make_layout(make_shape(Int<16>{}, Int<8>{}));

    auto sA = make_tensor(make_smem_ptr(smem_A), sA_layout);
    auto sB = make_tensor(make_smem_ptr(smem_B), sB_layout);
    auto gC = make_tensor(make_gmem_ptr(C_ptr),  gC_layout);

    // ── 1. Load A, B from global to shared (simple loop ;  not TiledCopy) ──
    auto gA = make_tensor(make_gmem_ptr(A_ptr), sA_layout);
    auto gB = make_tensor(make_gmem_ptr(B_ptr), sB_layout);

    for (int i = threadIdx.x; i < size(sA); i += blockDim.x)
        sA(i) = gA(i);
    for (int i = threadIdx.x; i < size(sB); i += blockDim.x)
        sB(i) = gB(i);
    __syncthreads();

    // ── 2. Partition A, B, C for this thread ──
    auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x);

    auto tCsA = thr_mma.partition_A(sA);    // (MMA, MMA_M, MMA_K) = (8, 1, 1)
    auto tCsB = thr_mma.partition_B(sB);    // (MMA, MMA_N, MMA_K) = (4, 1, 1)
    auto tCgC = thr_mma.partition_C(gC);    // (MMA, MMA_M, MMA_N) = (4, 1, 1)

    // ── 3. Create register fragments ──
    auto tCrA = make_fragment_like(tCsA);   // 8 half regs for A
    auto tCrB = make_fragment_like(tCsB);   // 4 half regs for B
    auto accum = make_fragment_like(tCgC);  // 4 float regs for C
    clear(accum);

    // ── 4. Copy A, B from shared memory to registers ──
    copy(tCsA, tCrA);
    copy(tCsB, tCrB);

    // ── 5. Execute the MMA! ──
    gemm(tiled_mma, tCrA, tCrB, accum);

    // ── 6. Write result back to global memory ──
    copy(accum, tCgC);
}

int main()
{
    constexpr int M = 16, N = 8, K = 16;

    // ─── 1. Ownership Map ───
    printf("=== MMA Ownership Map ===\n\n");
    mma_ownership_map<<<1, 32>>>();
    cudaDeviceSynchronize();

    // ─── 2. Micro-GEMM ───
    printf("\n=== Micro-GEMM: C(16x8) = A(16x16) * B(8x16)^T ===\n\n");

    // Test data: A = all ones, B = all ones → C[m][n] = K = 16.0
    half h_A[M * K], h_B[N * K];
    float h_C[M * N] = {};

    for (int i = 0; i < M * K; ++i) h_A[i] = __float2half(1.0f);
    for (int i = 0; i < N * K; ++i) h_B[i] = __float2half(1.0f);

    half *d_A, *d_B;
    float *d_C;
    cudaMalloc(&d_A, M * K * sizeof(half));
    cudaMalloc(&d_B, N * K * sizeof(half));
    cudaMalloc(&d_C, M * N * sizeof(float));

    cudaMemcpy(d_A, h_A, M * K * sizeof(half), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, N * K * sizeof(half), cudaMemcpyHostToDevice);

    micro_gemm<<<1, 32>>>(d_A, d_B, d_C);
    cudaDeviceSynchronize();

    cudaMemcpy(h_C, d_C, M * N * sizeof(float), cudaMemcpyDeviceToHost);

    // C is column-major: element (m,n) is at offset m + M*n
    printf("C (16x8) ;  each entry = K * 1.0 * 1.0 = 16.0:\n\n");
    printf("       ");
    for (int n = 0; n < N; ++n) printf("n=%-6d", n);
    printf("\n");
    for (int m = 0; m < M; ++m) {
        printf("m=%-4d ", m);
        for (int n = 0; n < N; ++n)
            printf("%-8.1f", h_C[m + M * n]);
        printf("\n");
    }

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    return 0;
}
```

**Expected Output:**

```text
=== MMA Ownership Map ===

MMA C Ownership Map (16x8) ;  SM80_16x8x16_F32F16F16F32_TN
Each thread holds 4 float accumulators.

       n=0   n=1   n=2   n=3   n=4   n=5   n=6   n=7
m=0    T00   T00   T04   T04   T08   T08   T12   T12
m=1    T01   T01   T05   T05   T09   T09   T13   T13
m=2    T02   T02   T06   T06   T10   T10   T14   T14
m=3    T03   T03   T07   T07   T11   T11   T15   T15
m=4    T16   T16   T20   T20   T24   T24   T28   T28
m=5    T17   T17   T21   T21   T25   T25   T29   T29
m=6    T18   T18   T22   T22   T26   T26   T30   T30
m=7    T19   T19   T23   T23   T27   T27   T31   T31
m=8    T00   T00   T04   T04   T08   T08   T12   T12
m=9    T01   T01   T05   T05   T09   T09   T13   T13
m=10   T02   T02   T06   T06   T10   T10   T14   T14
m=11   T03   T03   T07   T07   T11   T11   T15   T15
m=12   T16   T16   T20   T20   T24   T24   T28   T28
m=13   T17   T17   T21   T21   T25   T25   T29   T29
m=14   T18   T18   T22   T22   T26   T26   T30   T30
m=15   T19   T19   T23   T23   T27   T27   T31   T31

=== Micro-GEMM: C(16x8) = A(16x16) * B(8x16)^T ===

C (16x8) ;  each entry = K * 1.0 * 1.0 = 16.0:

       n=0     n=1     n=2     n=3     n=4     n=5     n=6     n=7
m=0    16.0    16.0    16.0    16.0    16.0    16.0    16.0    16.0
m=1    16.0    16.0    16.0    16.0    16.0    16.0    16.0    16.0
...
m=15   16.0    16.0    16.0    16.0    16.0    16.0    16.0    16.0
```

The ownership map reveals the hardware's register distribution: threads 0–3 and 16–19 alternate in 4-row blocks along M, and each thread owns pairs of adjacent columns. Thread T00 holds C[0][0], C[0][1], C[8][0], C[8][1] ;  four floats scattered across two 4-row blocks. You don't need to memorize this pattern; `partition_C` handles it automatically.

The GEMM result is 16.0 everywhere: each entry is the dot product of a length-16 all-ones row from A with a length-16 all-ones row from B. One instruction, 2048 multiply-adds, 32 threads cooperating through their registers. (The instruction has 20+ cycles of latency, but it occupies only one issue slot ;  so pipelining multiple MMAs is how real kernels hide that cost.)

## 4. Step-by-Step Explanation

**Line: `using MMA_Op = SM80_16x8x16_F32F16F16F32_TN;`**

This names the PTX instruction we want: `mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32`. The naming convention:

| Part | Meaning |
| :--- | :--- |
| `SM80` | Ampere architecture (compute capability 8.0+) |
| `16x8x16` | M × N × K ;  the tile dimensions of one MMA instruction |
| `F32F16F16F32` | Types: D=float, A=half, B=half, C=float (D, A, B, C order) |
| `TN` | A is K-major (Transposed), B is K-major (Normal) |

The "TN" layout is the most common: both A and B have their K dimension as stride-1. This means your shared memory layouts for A(M,K) and B(N,K) should both have K as the fast-moving axis.

**Line: `auto tiled_mma = make_tiled_mma(MMA_Atom<MMA_Op>{});`**

Wraps the raw hardware instruction in CuTe's `MMA_Atom`, then promotes it to a `TiledMMA`. With no extra tiling arguments, this creates the simplest possible MMA ;  one atom, one warp, no repetition. The tile size is exactly 16×8×16.

Compare with Tutorial 04's `make_tiled_copy(Copy_Atom, thr_layout, val_layout)`. For MMA, the thread layout and value layout are determined by the hardware instruction itself ;  there's nothing to choose. The atom *is* the complete plan.

**Lines: `sA_layout` with `make_stride(Int<16>{}, Int<1>{})`**

A is (M, K) = (16, 16) with K stride-1. This means `sA(m, k) = m * 16 + k * 1` ;  moving along K is contiguous in memory. The TN atom requires this: the PTX instruction's "row" descriptor for A means K is the fast index.

If you accidentally used column-major (M stride-1, K stride-16), the MMA would read the wrong values ;  you'd be feeding columns of A where it expects rows.

**Lines: `sB_layout` with `make_stride(Int<16>{}, Int<1>{})`**

B is (N, K) = (8, 16), also K stride-1. Same reason: the TN atom's "col" descriptor for B also means K is the fast index. Both A and B want K contiguous.

> This is why "TN" is popular: when A and B are both K-major, your global memory loads can be coalesced along the K dimension. You'll see this pattern again in Tutorial 07's full GEMM.

**Line: `auto tCsA = thr_mma.partition_A(sA);  // (MMA, MMA_M, MMA_K) = (8, 1, 1)`**

`partition_A` asks: "Given the full shared memory tensor `sA` and the MMA's register layout, which elements belong to this thread?" It returns a 3-mode tensor:

- **Mode 0 (MMA):** The 8 half values this thread feeds into one atom invocation.
- **Mode 1 (MMA_M):** How many times the atom repeats along M. Our A has M=16 and the atom handles M=16, so 1 repetition.
- **Mode 2 (MMA_K):** How many times the atom repeats along K. Our K=16 matches the atom's K=16, so 1 repetition.

Compare with Tutorial 04: `partition_S` returned `(ValuesPerAtom, Reps...)`. Same idea, different dimension names.

**Line: `auto tCrA = make_fragment_like(tCsA);`**

Creates a register tensor with the same shape as the smem partition: `(8, 1, 1)` of half-precision values. These 8 registers are this thread's personal tray ;  the slot in the stamping press that it's responsible for loading.

**Line: `copy(tCsA, tCrA);`**

Copies this thread's 8 half values from shared memory to registers. Each thread reads from different smem addresses (determined by `partition_A`), so there's no conflict. After this line, all 32 threads have their A fragments loaded and ready.

**Line: `gemm(tiled_mma, tCrA, tCrB, accum);`**

This is the magic line. `gemm()` dispatches the PTX `mma.sync` instruction, which:

1. Reads A fragments from all 32 threads' registers (8 halfs each = 256 total = 16×16 matrix).
2. Reads B fragments from all 32 threads' registers (4 halfs each = 128 total = 8×16 matrix).
3. Multiplies the 16×16 matrix by the 8×16 matrix (as C += A × B^T).
4. Accumulates the 16×8 result into all 32 threads' C registers (4 floats each = 128 total).

This is a single instruction, but **not** a single clock cycle ;  the MMA has a latency of 20+ cycles. However, it only occupies one issue slot, so the warp scheduler can issue other independent instructions (or another MMA from a different warp) while the Tensor Core is still working. Real GEMM kernels pipeline multiple MMAs to hide this latency. The `.sync` in the PTX name means all threads in the warp participate simultaneously ;  this is a collective operation, not a per-thread one.

**Line: `copy(accum, tCgC);`**

Copies the 4 float accumulator values from registers to their corresponding positions in global memory C. Each thread writes to different addresses (determined by `partition_C`), so no conflicts. The host then reads back the complete 16×8 result.

## 5. Engineer's Notebook (Latent Space Notes)

**Analogy:** `MMA_Atom` is a **stamping press** ;  a fixed-size hardware machine that takes raw-material trays (A and B register fragments), processes them all at once, and stamps out a product (C accumulator). Each worker (thread) loads their assigned tray slot. `TiledMMA` is the factory floor plan that coordinates the workers. `gemm()` is pressing the button.

This analogy extends the Tutorial 03–04 warehouse metaphor: data arrives from the truck (global memory) via the loading dock (TiledCopy + smem), and now the stamping press (TiledMMA) processes it in the factory. The same workers (threads) do both jobs ;  they just switch from mover role to machine-operator role.

**Fragment size formula:**

For any MMA atom with shape M×N×K using W threads:

| Fragment | Per-thread values | Total |
| :--- | :--- | :--- |
| A | M × K / W | M × K |
| B | N × K / W | N × K |
| C | M × N / W | M × N |

For SM80_16x8x16 with W=32: A = 8 halfs, B = 4 halfs, C = 4 floats.

**What "TN" means for your layouts:**

| Layout | A (M, K) | B (N, K) |
| :--- | :--- | :--- |
| **TN** | K stride-1 `(K, 1)` | K stride-1 `(K, 1)` |
| TT | K stride-1 `(K, 1)` | N stride-1 `(1, N)` |
| NN | M stride-1 `(1, M)` | N stride-1 `(1, N)` |
| NT | M stride-1 `(1, M)` | K stride-1 `(K, 1)` |

TN is the most common because both A and B have K contiguous ;  global memory loads along the K-reduction dimension are coalesced.

**The ownership map tells you something important:**

Thread T00 owns C[0][0], C[0][1], C[8][0], C[8][1]. These four values are *not* contiguous in column-major memory (offsets 0, 16, 8, 24). This means writing the C accumulator to global memory via `copy(accum, tCgC)` produces scattered stores ;  each thread writes 4 separate floats to non-contiguous addresses. For a real GEMM kernel (Tutorial 07), the epilogue stage would use a TiledCopy to reorganize the writes through shared memory for coalesced global stores.

**Hardware note ;  checking your GPU:**

The SM80 MMA atom requires compute capability 8.0 or higher (A100, RTX 3090, etc.). To check:

```bash
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# Should print 8.0 or higher
```

**Hardware note ;  latency vs. throughput:**

A single `mma.sync.m16n8k16` has a **latency** of 20+ clock cycles ;  the result isn't ready to read for that long. But the **maximum issue frequency** (throughput) is much higher than one-per-20-cycles, because the Tensor Core is pipelined. The exact throughput depends on the GPU architecture:

| GPU | Max issue frequency (per SM partition) |
| :--- | :--- |
| A100 (SM80) | 1 MMA per 8 cycles |
| Consumer Ampere (e.g. RTX 3090) | 1 MMA per 16 cycles |
| Some other consumer GPUs | 1 MMA per 32 cycles |

On A100, each SM has 4 partitions (sub-partitions / warp schedulers), so the SM-wide throughput is one MMA per 2 cycles, or 2048 FMAs every 2 cycles. Multiply by 108 SMs and a 1.41 GHz clock and you get the advertised Tensor Core TFLOPS ;  but only if the data pipeline (Tutorials 03–05) keeps the Tensor Cores fed and you have enough independent MMAs in flight to hide the 20+ cycle latency.

> **Gotcha ;  layout mismatch:** If your smem layout doesn't match the TN convention (K stride-1), the MMA will read the wrong elements. The symptom is silently wrong results, not a crash. Always double-check that your A and B shared memory strides put K as stride-1 for TN atoms.

> **Gotcha ;  half precision:** The input matrices *must* be `half` (F16). If you accidentally store `float` values in smem and cast the pointer to `half*`, you'll get garbage. Use `__float2half()` to convert before storing.

**What comes next:** This tutorial computed a single 16×8 tile using one MMA instruction. But real matrices are much larger than 16×8. Tutorial 07 (The Global GEMM) shows how to tile over M, N, and K ;  looping the MMA across tiles, piping data from global memory through shared memory into the Tensor Core in a continuous stream.
