Isopack: January 2024

A Quick Primer

A neural network is, in essence, a stack of linear projections with some nonlinear functions interspersed in the stack. Merely stacking linear projections is boring, because the composition of two linear functions is another linear function, but it turns out that inserting even the simplest nonlinearity into the stack works some magic - the composite function "learns" locally linear regions of a complex nonlinear function because the interspersed nonlinearity changes the subspace it projects into. The important takeaway is that computationally, neural networks are a sequence of matrix-vector multiplies.

To train a neural network means to fit it to some set of known input value-function pairs. This is typically done through some variant of first order gradient descent on a loss function L which describes how well we've done the fitting. L and grad L are evaluated with respect to the neural network parameter weights; we typically approximate grad L by the gradient on some subset of the inputs (stochastic gradient descent). At each step the weights W are replaced by W' = W + gamma * grad L; the step size gamma is a key factor in convergence speed.

The widely accepted fastest-converging choice of gamma uses a heuristic called "AdamW", which looks at the first (time average) and second (time average of squares) moments of the gradient to compute gamma. AdamW stores the gradient, first moment, second moment, weight, and "fp16 copy of the weights" (matrix multiplier units typically work on fp16/bf16 inputs), a total of 18 bytes per parameter. This results in a large memory footprint during training - a 7B parameter network requires 126GB, a 34B parameter model 612GB, and 175B parameter model requires a whopping 3150GB. In practice this number is even higher - intermediate activations take up space, especially at large batch size.

Now obviously 3150GB of optimizer states aren't going to fit in a single node, and even 612GB is troublesome. In order to evade having to build ever higher capacity nodes (which is electrically hard), we instead partition the parameters, states, and inputs across nodes, do our computations as much as possible on each node, then synchronize the states across nodes by passing messages. This is nothing new - it's exactly how large scientific simulations on supercomputers work. The ML people have a name for this - sharding - and the scheme used to do the computations is called 'fully sharded data parallel'.

In order to reassemble the gradient, inter-GPU communication is required. Between nodes, this is done through a network which is assumed to be "slow", and painstaking steps are taken to mask the slowness. Internal to a node, the communications are assumed to be "fast", which is where the confusion begins: Nvidia enterprise GPUs can directly pass messages to each other, as well as to Nvidia-branded network controllers, if and only if all devices exist on the same PCI-e controller. If not, the sender has to first write to a CPU buffer, which is then read by the receiver.

Now, there's technically nothing wrong with this: PCI-e 3.0 is good for 128 gbits/sec, the intersocket connection is good for 300+ gbits/sec, and memory is good for many tbits/sec. The issue is all of this back and forth adds latency, and current frameworks generally make no effort to hide intra-node latency (in fact, they may exploit low intra-node latencies to optimize overall performance). This means we need to carefully arrange the GPUs and NICs to be on the same PCI-e controller in order to guarantee expected performance with existing libraries.

With that being said, let's take a look at some good and bad servers on eBay that are marketed as "good for machine learning":

Gigabyte G482

At first glance, this is looking pretty good. For about half of retail, you get a current-generation octal GPU system in 4U. It looks pretty well-engineered, and the vendor's website even says you can use it for AI!

However, this is not a good AI server. In this case, the vendor is as much to blame as the listing. Let's take a look at the block diagram:

As you can see, groups of four GPUs are connected to each socket. An unfortunate quirk of Epyc is each socket contains multiple PCI-e controllers, meaning at best, pairs of GPUs can directly communicate with each other. GPUs from the two hives of four have to cross the intersocket connection to send and receive data. Finally, it's also one slot short - the left CPU doesn't have a slot for a NIC.

This is a fine server for inference if the model fits on one or two GPUs, but then what are you doing running your high capacity inference at home? 🙃

Asus ESC4000's

These are old 2U HPC nodes which make no pretense to being good at AI (they're also not good miners, since you're spending hundreds of extra watts per node running the fans, chipset, and CPUs). Despite their humble origins and low price, these make solid inference servers for latency-sensitive applications - load them up with 4x A5000 and you'll be happily running two instances of llama-2-70B at 20 tokens/sec each. And trust me, quad GPU nodes are way easier to deal with than octal GPU nodes, which tend to be mechanically fragile and difficult to ship and power.

Supermicro SYS-4028

SYS-4028 was an extremely popular AI system in the era of smaller convolutional networks. However, the most common listings come with X9DRG-O-PCIE, which is an older-generation board designed for conventional HPC:

From the 40-lane-per-socket allocation, X9DRG-O-PCIE assigns two groups of 16 lanes each to two PLX switches, which each host two GPUs. So far so good, we get four GPUs able to communicate with each other via P2P. The remaining 8 lanes go to an x8 slot on the same root complex, which is enough for a FDR Infiniband card. That's a pretty robust setup for HPC, where some of the computations might need to happen on the host - hives of four GPUs can communicate over RDMA, and get 32 GB/sec back to the socket. Unfortunately, it's suboptimal for large scale machine learning, since the hives need to communicate over shared memory (and over QPI between the sockets, too!). Instead, what you want is this:

A rather insane layout, to be sure. All eight GPUs (128 lanes) oversubscribe 32 lanes back to one socket, with the remaining 8 lanes from that socket available for connectivity. You wouldn't dream of running your physics simulations on such a system, but the layout is optimal for machine learning: hives of 4 can DMA each other through the 96-lane switch, and the two hives can DMA each other through the root complex.

Happily, X10DRG-O-PCIE is available used for about $700. When all is said and done $1600 for what is the pinnacle of PCI-e based machine learning systems is not bad.

Saturday, January 6, 2024

How to buy an AI server on eBay