A curious mind exploring the intersection of
Computation and Machine Intelligence.
I am a student, a thinker, and an engineer. I craft elegant systems and learn new things(most of the time).
Who I Am
I am a student of everything around me. My roots are humble. I come from a Tier 3 city in Gujarat and a college far from the spotlight, where the path wasn’t already paved for me. I realized early on that knowledge would not be handed to me; I would have to take it. Everything I know—from the fundamentals of systems to the complexities of AI—is the result of relentless self-learning.
My strength is not loud; it lives in my curiosity, in my willingness to keep going when things get complex. I believe that the best work comes from a place of genuine curiosity. While I have a deep background in technical problem-solving, my true passion lies in simplifying complexity.
Currently, I am refining this craft as a Master’s student in Data Science and a Machine Learning Engineer in training. Because I had to teach myself the foundations, I don't just use tools—I deconstruct them. Whether I am optimizing algorithms or architecting systems, I am driven by a single goal: to understand the "why" behind the "how."
I am a student; of code, of math, of systems, of life. Work occupies most of my time, not out of pressure but out of passion. And when I’m not working, I consume ideas: podcasts, philosophy, blogs. I keep learning because it keeps me alive.
I take inspiration from some legendary figures like Elon Musk, Srinivas Ramanujan, Isaac Newton, Albert Einstein, Vikram Sarabhai, and many more :)
Latest Thoughts
I write to clear my mind and share what I learn.
Swizzling ; Avoiding Shared Memory Bank Conflicts
How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.
CUDAThe Parallel Copy ; Orchestrating Threads with TiledCopy
How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.
CUDAThe Naive Copy ; Scalar vs. Vectorized Memory Movement
Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.
CUDAThe Art of Slicing ; Partitioning Data Across Blocks and Threads
How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.
CUDAHello, Layout! ; Visualizing Memory in CuTe
Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.
CUDABeating PyTorch: Writing a Faster Softmax Kernel in CUDA
Writing a faster Softmax kernel in CUDA than PyTorch's implementation.
Machine LearningStable Diffusion 1.5: How I Optimized It
A detailed worklog on optimizing Stable Diffusion 1.5 for performance.
LogicPropositional Logic
A deep dive into the fundamental building blocks of mathematical logic.
Machine LearningRaw Dawgging Linear Regression
Understanding Linear Regression by building it from the ground up.