Recently the good folks at Groq released a formidable demo showing a 70 billion parameter language model inferencing at 300 tokens per second at batch size 1. This immediately elicited two responses from the community:
- Wow, this is so fast and the best thing ever
- This sucks because it has to run on an entire cluster
In fact, the truth is somewhere in between. Let's take a look.
Configuration. Groq's marketing is somewhat facetious here. The 'LPU' is just their old 14 nm SRAM-based systolic array processor, which is in turn somewhere in between the Graphcore SRAM-based processor array and the big systolic arrays found in Gaudi and TPU. The LPU has 230 Mbytes of SRAM per chip, and some software tricks are used to shard the model across many, many LPUs for inference. If we assume 10 Mbytes activation memory per token, 4K context on a 70B parameter model in int8 comes out to about 110 Gbytes of memory, which requires eight racks (512 devices) to hold. Needless to say, that is a rather voluminous configuration.
Power. At first glance Groq is absolutely boned here, requiring 512 300W devices (150KW) to do their inference. Fortunately, batch size 1 inference doesn't really stress the ALUs even with all of Groq's bandwidth (a forward pass is 140 Gflops so 300 tokens/sec is a tiny fraction of the total throughput of the cluster) so the actual power per chip will be quite low.
Economics. This is where it gets really spicy. Deriders will remark that the Groq device is $20K, but that's in quantity 1 from Mouser for a card built by Bittware, a company notorious for high markups. Groq's chip is fabbed on a '14 nm process' - give or take, the die probably costs $60, 2x that for packaging and testing. Now, here's the magic - because it is SRAM based, the boards are very simple; all in, I would estimate that a Groq board costs under $500 to build.
Suddenly, we're looking cost competitive. The boards for 512 devices come out to $250,000, figure double that once we account for the host servers. Half a million for 10x the performance of an octal A100 ($180,000) is suddenly pretty good. Of course, we're being facetious here, because an octal A100 costs about $50,000 to build, but Groq gets to pay wholesale prices and you don't.
Conclusions. What are our real conclusions here? I'd say there are two, maybe three:
- You have to cut the margins out. Nvidia systems are competitive at or near cost, but at their current ludicrous 10x margins you could design and tape out a custom chip for less than the cost of a medium sized cluster. Nvidia is responding to this - their 'cloud and custom' group will allow the sale of semi-custom SKUs to hyperscalers without compromising their annual reports too much, by breaking out the margin reduced parts into a different business unit.
- Old nodes are OK. Because inference is so bandwidth bound and memory controllers don't scale well any more, you can squeeze a lot of life out of old nodes. That's a big deal because that lets you second source your design, which puts pressure on the foundry to keep prices down. In contrast, pricing on leading edge TSMC nodes is utter chaos right now because capacity is completely booked.
- Engineering is silly. Groq took an ancient accelerator designed for high speed inference of ResNets, and through a monumental feat of system design and clever presentation was able to adapt it to achieve a world record in the hottest field in venture capital right now. In my eyes, that's pretty darn cool.