# Blog — Darshan Baslani Technical blog posts on CUDA programming, GPU kernel optimization, CuTe, Triton, and machine learning systems. ## Posts - [Hello, Layout! ; Visualizing Memory in CuTe](https://www.dcbaslani.xyz/blog/01_hello_layout/) - Description: Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids. - Topic: CUDA - Published: 2025-06-15 - Markdown: https://www.dcbaslani.xyz/blog/01_hello_layout/index.md - Text: https://www.dcbaslani.xyz/blog/01_hello_layout/index.txt - [The Art of Slicing ; Partitioning Data Across Blocks and Threads](https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/) - Description: How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads. - Topic: CUDA - Published: 2025-06-20 - Markdown: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/index.md - Text: https://www.dcbaslani.xyz/blog/02_the_art_of_slicing/index.txt - [The Naive Copy ; Scalar vs. Vectorized Memory Movement](https://www.dcbaslani.xyz/blog/03_the_naive_copy/) - Description: Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it. - Topic: CUDA - Published: 2026-02-22 - Markdown: https://www.dcbaslani.xyz/blog/03_the_naive_copy/index.md - Text: https://www.dcbaslani.xyz/blog/03_the_naive_copy/index.txt - [The Parallel Copy ; Orchestrating Threads with TiledCopy](https://www.dcbaslani.xyz/blog/04_the_parallel_copy/) - Description: How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies. - Topic: CUDA - Published: 2026-02-24 - Markdown: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/index.md - Text: https://www.dcbaslani.xyz/blog/04_the_parallel_copy/index.txt - [Swizzling ; Avoiding Shared Memory Bank Conflicts](https://www.dcbaslani.xyz/blog/05_swizzling/) - Description: How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code. - Topic: CUDA - Published: 2026-02-26 - Markdown: https://www.dcbaslani.xyz/blog/05_swizzling/index.md - Text: https://www.dcbaslani.xyz/blog/05_swizzling/index.txt - [Hello, MMA — Your First Tensor Core Instruction](https://www.dcbaslani.xyz/blog/06_hello_mma/) - Description: How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores. - Topic: CUDA - Published: 2026-03-03 - Markdown: https://www.dcbaslani.xyz/blog/06_hello_mma/index.md - Text: https://www.dcbaslani.xyz/blog/06_hello_mma/index.txt - [The Global GEMM — Putting It All Together](https://www.dcbaslani.xyz/blog/07_the_global_gemm/) - Description: Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory. - Topic: CUDA - Published: 2026-03-07 - Markdown: https://www.dcbaslani.xyz/blog/07_the_global_gemm/index.md - Text: https://www.dcbaslani.xyz/blog/07_the_global_gemm/index.txt - [The TMA Revolution (Async Copy)](https://www.dcbaslani.xyz/blog/08_the_tma_revolution/) - Description: With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware. - Topic: CUDA - Published: 2026-03-23 - Markdown: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/index.md - Text: https://www.dcbaslani.xyz/blog/08_the_tma_revolution/index.txt - [WGMMA ; Warpgroup MMA](https://www.dcbaslani.xyz/blog/09_wgmma/) - Description: How to use Warpgroup MMA (WGMMA) to feed NVIDIA Tensor Cores directly from shared memory, bypassing the register file bottleneck. - Topic: CUDA - Published: 2026-04-04 - Markdown: https://www.dcbaslani.xyz/blog/09_wgmma/index.md - Text: https://www.dcbaslani.xyz/blog/09_wgmma/index.txt - [Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry](https://www.dcbaslani.xyz/blog/cute-dsl-blog/) - Description: Welcome to the ultimate guide to cute-dsl! Bringing the power of CuTe's concepts like Layouts, Tilers, and vectorized memory operations into a familiar, Pythonic interface. - Topic: CUDA - Published: 2026-04-19 - Markdown: https://www.dcbaslani.xyz/blog/cute-dsl-blog/index.md - Text: https://www.dcbaslani.xyz/blog/cute-dsl-blog/index.txt - [Breaking PyTorch Boundaries: Fusing RMSNorm and GDN in Triton for Qwen 3.5](https://www.dcbaslani.xyz/blog/qwen_3.5/) - Description: Rebuilding and optimizing the Qwen 3.5 (9B) Gated Delta Attention inference stack with custom Triton kernels, achieving 5.2x speedup on NVIDIA B200. - Topic: Triton - Published: 2026-05-25 - Markdown: https://www.dcbaslani.xyz/blog/qwen_3.5/index.md - Text: https://www.dcbaslani.xyz/blog/qwen_3.5/index.txt - [The Feynman GPU Lectures](https://www.dcbaslani.xyz/blog/gpu_masterclass/) - Description: A GPU masterclass that builds from transistors and CUDA cores up through SM architecture, memory systems, Tensor Cores, Hopper, and Blackwell. - Topic: CUDA - Published: 2026-06-05 - Markdown: https://www.dcbaslani.xyz/blog/gpu_masterclass/index.md - Text: https://www.dcbaslani.xyz/blog/gpu_masterclass/index.txt