Latest Thoughts

I write to clear my mind and share what I learn.

CUDA

The TMA Revolution (Async Copy)

With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.

CUDA

The Global GEMM — Putting It All Together

Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.

CUDA

Hello, MMA — Your First Tensor Core Instruction

How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

CUDA

Swizzling ; Avoiding Shared Memory Bank Conflicts

How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.

CUDA

The Parallel Copy ; Orchestrating Threads with TiledCopy

How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.

CUDA

The Naive Copy ; Scalar vs. Vectorized Memory Movement

Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.

CUDA

The Art of Slicing ; Partitioning Data Across Blocks and Threads

How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.

CUDA

Hello, Layout! ; Visualizing Memory in CuTe

Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.

CUDA

Beating PyTorch: Writing a Faster Softmax Kernel in CUDA

Writing a faster Softmax kernel in CUDA than PyTorch's implementation.

Machine Learning

Stable Diffusion 1.5: How I Optimized It

A detailed worklog on optimizing Stable Diffusion 1.5 for performance.

Logic

Propositional Logic

A deep dive into the fundamental building blocks of mathematical logic.

Machine Learning

Raw Dawgging Linear Regression

Understanding Linear Regression by building it from the ground up.