← Home

Everyone talks about training. Nobody talks about the cooling.

14 years ago I was building modular GPU-clustered liquid-cooled processing farms. Back then it was Ethereum mining. Now it's AI inference. The physics is the same.

Power density

The conversation in boardrooms is about models and training runs. Nobody asks about the cooling until the rack shuts itself down. But that's where the real engineering lives. Power density per rack. Heat dissipation. Airflow vs liquid. These aren't afterthoughts. They're the foundation.

When you put 40kW in a single rack, air cooling stops being viable. You can push more fans. You can raise the floor. You can crank the CRAC units. At some point the math just doesn't work. The air can't carry the heat away fast enough.

The transition to liquid

Direct-to-chip liquid cooling changes the equation. Water has roughly 4,000 times the volumetric heat capacity of air. You're not fighting physics anymore. You're working with it.

I went through this transition years ago with GPU mining rigs. Custom cooling loops. Manifolds. Quick-disconnect fittings so you could swap a card without draining the loop. Leak detection sensors under every rack. The lessons were expensive and occasionally wet.

The same engineering applies to modern AI inference clusters. The GPUs changed. The TDPs went up. But the thermodynamics didn't. Heat still flows from hot to cold. Coolant still needs flow rate and delta-T. Pumps still fail at the worst possible time.

What people get wrong

Most data center conversations about cooling start with "what vendor do we use?" That's the wrong first question. The right first question is "what's our power density target per rack and what's our cooling budget in kW?" Everything else follows from those two numbers.

I've seen projects where the compute hardware was specced perfectly and the cooling was an afterthought. Those projects hit thermal throttling within weeks. The GPUs were fine. The cooling design wasn't.

Both eras

My experience spans both the mining era and the AI era. The hardware changed. The scale changed. The thermodynamics didn't. When someone tells me they're planning a dense GPU cluster, my first question isn't about the model they're running. It's about the cooling plant.