Latest Thoughts
I write to clear my mind and share what I learn.
The TMA Revolution (Async Copy)
With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.
CUDAThe Global GEMM — Putting It All Together
Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.
CUDAHello, MMA — Your First Tensor Core Instruction
How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.
CUDASwizzling ; Avoiding Shared Memory Bank Conflicts
How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.
CUDAThe Parallel Copy ; Orchestrating Threads with TiledCopy
How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.
CUDAThe Naive Copy ; Scalar vs. Vectorized Memory Movement
Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.
CUDAThe Art of Slicing ; Partitioning Data Across Blocks and Threads
How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.
CUDAHello, Layout! ; Visualizing Memory in CuTe
Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.
CUDABeating PyTorch: Writing a Faster Softmax Kernel in CUDA
Writing a faster Softmax kernel in CUDA than PyTorch's implementation.
Machine LearningStable Diffusion 1.5: How I Optimized It
A detailed worklog on optimizing Stable Diffusion 1.5 for performance.
LogicPropositional Logic
A deep dive into the fundamental building blocks of mathematical logic.
Machine LearningRaw Dawgging Linear Regression
Understanding Linear Regression by building it from the ground up.