double dissent

From many to one

In the last post we used GPUs to add two vectors, element by element. It’s what we call an embarrassingly parallel problem, but you might be thinking: is this all that GPUs are suited for? Doing independent things in parallel with no coordination? One operation often arising in LLM inference is summing all elements of a vector. On a CPU, you would do this: How would you parallelize this? Each valu…

Jan 02 ⎯ 👏1

How to add two vectors, fast

Dec 27

I’ve been building a Llama 3.2 inference engine aptly named forward (purely as an excuse to stop procrastinating and learn modern C++ and CUDA), from scratch. Were I reasonable, I would not have decided to build a tensor library first, then a neural network library on top of it, and finally the Llama 3.2 architecture as the cherry on top. But here we are. The first milestone was getting Llama to …
You are not the code

Dec 20

It was 2016. I was listening to Justin Bieber's Sorry (and probably so were you, don't judge), coding away at my keyboard in an office building in Canary Wharf, London. We were a small team working on a complex internal tool for power users. Our tools were Clojure and ClojureScript, which we wielded with mastery and pride. The great thing about building for advanced users is that they are as obses…
Reproducing U-Net

Jun 23

While getting some deep learning research practice, I decided to do some archaeology by reproducing some older deep learning papers. Then, I recalled (back in my fast.ai course days) Jeremy Howard discussing U-Net and the impression it had on me—possibly the first time I encountered skip connections. So I read the paper and wondered: Is it possible to reproduce their results only from the paper, a…

From many to one

How to add two vectors, fast

You are not the code

Reproducing U-Net