MLOps
Distributed Training Finally Stopped Making Me Cry (Mostly)
I still remember the first time I tried to shard a 70B parameter model across a cluster of GPUs. It was 2 AM, I was three coffees deep, and the error logs.
I still remember the first time I tried to shard a 70B parameter model across a cluster of GPUs. It was 2 AM, I was three coffees deep, and the error logs.