Blogs - Darshan Baslani

CUDA

The TMA Revolution (Async Copy)

With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.

CUDA

The Global GEMM — Putting It All Together

Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.

CUDA

Hello, MMA — Your First Tensor Core Instruction

How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

CUDA

Swizzling ; Avoiding Shared Memory Bank Conflicts

How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.

CUDA

The Parallel Copy ; Orchestrating Threads with TiledCopy

How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.

CUDA

The Naive Copy ; Scalar vs. Vectorized Memory Movement

Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.

CUDA

The Art of Slicing ; Partitioning Data Across Blocks and Threads

How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.

CUDA

Hello, Layout! ; Visualizing Memory in CuTe

Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.

CUDA

Beating PyTorch: Writing a Faster Softmax Kernel in CUDA

Writing a faster Softmax kernel in CUDA than PyTorch's implementation.

Machine Learning

Stable Diffusion 1.5: How I Optimized It

A detailed worklog on optimizing Stable Diffusion 1.5 for performance.

Logic

Propositional Logic

A deep dive into the fundamental building blocks of mathematical logic.

Machine Learning

Raw Dawgging Linear Regression

Understanding Linear Regression by building it from the ground up.

Latest Thoughts

The TMA Revolution (Async Copy)

The Global GEMM — Putting It All Together

Hello, MMA — Your First Tensor Core Instruction

Swizzling ; Avoiding Shared Memory Bank Conflicts

The Parallel Copy ; Orchestrating Threads with TiledCopy

The Naive Copy ; Scalar vs. Vectorized Memory Movement

The Art of Slicing ; Partitioning Data Across Blocks and Threads

Hello, Layout! ; Visualizing Memory in CuTe

Beating PyTorch: Writing a Faster Softmax Kernel in CUDA

Stable Diffusion 1.5: How I Optimized It

Propositional Logic

Raw Dawgging Linear Regression