Wednesday, February 21, 2024

About that Groq demo

Recently the good folks at Groq released a formidable demo showing a 70 billion parameter language model inferencing at 300 tokens per second at batch size 1. This immediately elicited two responses from the community:

  1. Wow, this is so fast and the best thing ever
  2. This sucks because it has to run on an entire cluster
In fact, the truth is somewhere in between. Let's take a look.

Configuration. Groq's marketing is somewhat facetious here. The 'LPU' is just their old 14 nm SRAM-based systolic array processor, which is in turn somewhere in between the Graphcore SRAM-based processor array and the big systolic arrays found in Gaudi and TPU. The LPU has 230 Mbytes of SRAM per chip, and some software tricks are used to shard the model across many, many LPUs for inference. If we assume 10 Mbytes activation memory per token, 4K context on a 70B parameter model in int8 comes out to about 110 Gbytes of memory, which requires eight racks (512 devices) to hold. Needless to say, that is a rather voluminous configuration.

Power. At first glance Groq is absolutely boned here, requiring 512 300W devices (150KW) to do their inference. Fortunately, batch size 1 inference doesn't really stress the ALUs even with all of Groq's bandwidth (a forward pass is 140 Gflops so 300 tokens/sec is a tiny fraction of the total throughput of the cluster) so the actual power per chip will be quite low.

Economics. This is where it gets really spicy. Deriders will remark that the Groq device is $20K, but that's in quantity 1 from Mouser for a card built by Bittware, a company notorious for high markups. Groq's chip is fabbed on a '14 nm process' - give or take, the die probably costs $60, 2x that for packaging and testing. Now, here's the magic - because it is SRAM based, the boards are very simple; all in, I would estimate that a Groq board costs under $500 to build.

Suddenly, we're looking cost competitive. The boards for 512 devices come out to $250,000, figure double that once we account for the host servers. Half a million for 10x the performance of an octal A100 ($180,000) is suddenly pretty good. Of course, we're being facetious here, because an octal A100 costs about $50,000 to build, but Groq gets to pay wholesale prices and you don't.

Conclusions. What are our real conclusions here? I'd say there are two, maybe three:

  1. You have to cut the margins out. Nvidia systems are competitive at or near cost, but at their current ludicrous 10x margins you could design and tape out a custom chip for less than the cost of a medium sized cluster. Nvidia is responding to this - their 'cloud and custom' group will allow the sale of semi-custom SKUs to hyperscalers without compromising their annual reports too much, by breaking out the margin reduced parts into a different business unit.
  2. Old nodes are OK. Because inference is so bandwidth bound and memory controllers don't scale well any more, you can squeeze a lot of life out of old nodes. That's a big deal because that lets you second source your design, which puts pressure on the foundry to keep prices down. In contrast, pricing on leading edge TSMC nodes is utter chaos right now because capacity is completely booked.
  3. Engineering is silly. Groq took an ancient accelerator designed for high speed inference of ResNets, and through a monumental feat of system design and clever presentation was able to adapt it to achieve a world record in the hottest field in venture capital right now. In my eyes, that's pretty darn cool.

Saturday, January 6, 2024

How to buy an AI server on eBay

A Quick Primer

A neural network is, in essence, a stack of linear projections with some nonlinear functions interspersed in the stack. Merely stacking linear projections is boring, because the composition of two linear functions is another linear function, but it turns out that inserting even the simplest nonlinearity into the stack works some magic - the composite function "learns" locally linear regions of a complex nonlinear function because the interspersed nonlinearity changes the subspace it projects into. The important takeaway is that computationally, neural networks are a sequence of matrix-vector multiplies.

To train a neural network means to fit it to some set of known input value-function pairs. This is typically done through some variant of first order gradient descent on a loss function L which describes how well we've done the fitting. L and grad L are evaluated with respect to the neural network parameter weights; we typically approximate grad L by the gradient on some subset of the inputs (stochastic gradient descent). At each step the weights W are replaced by W' = W + gamma * grad L; the step size gamma is a key factor in convergence speed.

The widely accepted fastest-converging choice of gamma uses a heuristic called "AdamW", which looks at the first (time average) and second (time average of squares) moments of the gradient to compute gamma. AdamW stores the gradient, first moment, second moment, weight, and "fp16 copy of the weights" (matrix multiplier units typically work on fp16/bf16 inputs), a total of 18 bytes per parameter. This results in a large memory footprint during training - a 7B parameter network requires 126GB, a 34B parameter model 612GB, and 175B parameter model requires a whopping 3150GB. In practice this number is even higher - intermediate activations take up space, especially at large batch size.

Now obviously 3150GB of optimizer states aren't going to fit in a single node, and even 612GB is troublesome. In order to evade having to build ever higher capacity nodes (which is electrically hard), we instead partition the parameters, states, and inputs across nodes, do our computations as much as possible on each node, then synchronize the states across nodes by passing messages. This is nothing new - it's exactly how large scientific simulations on supercomputers work. The ML people have a name for this - sharding - and the scheme used to do the computations is called 'fully sharded data parallel'.

In order to reassemble the gradient, inter-GPU communication is required. Between nodes, this is done through a network which is assumed to be "slow", and painstaking steps are taken to mask the slowness. Internal to a node, the communications are assumed to be "fast", which is where the confusion begins: Nvidia enterprise GPUs can directly pass messages to each other, as well as to Nvidia-branded network controllers, if and only if all devices exist on the same PCI-e controller. If not, the sender has to first write to a CPU buffer, which is then read by the receiver.

Now, there's technically nothing wrong with this: PCI-e 3.0 is good for 128 gbits/sec, the intersocket connection is good for 300+ gbits/sec, and memory is good for many tbits/sec. The issue is all of this back and forth adds latency, and current frameworks generally make no effort to hide intra-node latency (in fact, they may exploit low intra-node latencies to optimize overall performance). This means we need to carefully arrange the GPUs and NICs to be on the same PCI-e controller in order to guarantee expected performance with existing libraries.

With that being said, let's take a look at some good and bad servers on eBay that are marketed as "good for machine learning":

Gigabyte G482

At first glance, this is looking pretty good. For about half of retail, you get a current-generation octal GPU system in 4U. It looks pretty well-engineered, and the vendor's website even says you can use it for AI!

However, this is not a good AI server. In this case, the vendor is as much to blame as the listing. Let's take a look at the block diagram:

As you can see, groups of four GPUs are connected to each socket. An unfortunate quirk of Epyc is each socket contains multiple PCI-e controllers, meaning at best, pairs of GPUs can directly communicate with each other. GPUs from the two hives of four have to cross the intersocket connection to send and receive data. Finally, it's also one slot short - the left CPU doesn't have a slot for a NIC.

This is a fine server for inference if the model fits on one or two GPUs, but then what are you doing running your high capacity inference at home? 🙃

Asus ESC4000's

These are old 2U HPC nodes which make no pretense to being good at AI (they're also not good miners, since you're spending hundreds of extra watts per node running the fans, chipset, and CPUs). Despite their humble origins and low price, these make solid inference servers for latency-sensitive applications - load them up with 4x A5000 and you'll be happily running two instances of llama-2-70B at 20 tokens/sec each. And trust me, quad GPU nodes are way easier to deal with than octal GPU nodes, which tend to be mechanically fragile and difficult to ship and power.

Supermicro SYS-4028

SYS-4028 was an extremely popular AI system in the era of smaller convolutional networks. However, the most common listings come with X9DRG-O-PCIE, which is an older-generation board designed for conventional HPC:
From the 40-lane-per-socket allocation, X9DRG-O-PCIE assigns two groups of 16 lanes each to two PLX switches, which each host two GPUs. So far so good, we get four GPUs able to communicate with each other via P2P. The remaining 8 lanes go to an x8 slot on the same root complex, which is enough for a FDR Infiniband card. That's a pretty robust setup for HPC, where some of the computations might need to happen on the host - hives of four GPUs can communicate over RDMA, and get 32 GB/sec back to the socket. Unfortunately, it's suboptimal for large scale machine learning, since the hives need to communicate over shared memory (and over QPI between the sockets, too!). Instead, what you want is this:

A rather insane layout, to be sure. All eight GPUs (128 lanes) oversubscribe 32 lanes back to one socket, with the remaining 8 lanes from that socket available for connectivity. You wouldn't dream of running your physics simulations on such a system, but the layout is optimal for machine learning: hives of 4 can DMA each other through the 96-lane switch, and the two hives can DMA each other through the root complex.

Happily, X10DRG-O-PCIE is available used for about $700. When all is said and done $1600 for what is the pinnacle of PCI-e based machine learning systems is not bad.