Tuesday, December 19, 2023

Sapphire Rapids for AI: a mini-review

 Intel often touts the performance of their 4th-generation Xeon Scalable "Sapphire Rapids" for generative AI applications, but there are surprisingly few meaningful benchmarks, even from Intel itself. The official Intel slides are as such:

Oof. Three ResNets, two DLRM's, and BERT-large. Come on guys, this is 2023 and no one is buying hardware to run ResNet50. Let's try benchmarking some real workloads instead.

Benchmarked Hardware

Xeon Platinum 8461V ($4,491) on Supermicro X13SEI, default power limits, 256 GB JEDEC DDR5-4800

Stable Diffusion 2.1

Everyone's favorite image generator. SD-2.1 runs on a $200 GPU, so perhaps it doesn't make sense to test performance on a $4,500 Xeon, but as Intel says, every server needs a CPU so the Xeon is in some sense, "free". We use an OpenVINO build of SD-2.1, which contains Intel-specific optimizations, but critically, is not quantized, distilled, or otherwise compressed - it should have comparable FLOPS to the vanilla SD models. Generation is at 512x512 for 20 steps.


Not too shabby. 6.27 it/s puts us somewhere around the ballpark of an RTX3060. On the other hand, you could run your inference on an A4000 (which is licensed for datacenter use) and get higher performance for just $1100, so the Xeon isn't exactly winning on price here.

llama.cpp inference


Inferencing your LLM on a CPU was a bad idea until last week. Microsoft's investments into OpenAI make it almost impossible to compete, price-wise, with gpt-3.5-turbo: since Microsoft is a cloud provider, OpenAI gets to pay deeply discounted rates over the 3-10x markup you would have to pay for cloud infrastructure. You can't escape by building your own datacenter either: without the high occupancy of a cloud datacenter, you still end up paying overhead for idle servers. This left 7B-sized models as the only meaningful ones to self-host, but 7B models are so small that you can run them on a $200 GPU, obviating the need for a huge CPU. (Obviously, there are security-related reasons to self-host, but by and large the bulk of LLM applications are not security-sensitive).

Fortunately for Intel, a medium-sized MoE model, Mixtral-8x7B, with good performance appeared last week. MoE's are unique in that they have the memory footprint of a large model, but the compute (really, bandwidth) requirements of a small model. That sounds like a perfect fit for CPUs, which have tons of memory but limited bandwidth.

It ends up being that llama.cpp, an open-source hobbyist implementation of LLMs on CPUs, is the fastest CPU implementation. At batch size 1 we get about 18 tokens per second in Q4, which is a perfectly usable result (prompt eval time is poor, but I think that's an MoE limitation in llama.cpp which should be fixed shortly?).

Falcon-7B LLM fine-tuning with TRL


This is probably the most interesting benchmark. Full fine-tuning of a 7B parameter LLM needs over 128 GB of memory, requiring the use of unobtanium 4x A100 or 4x A6000 cloud instances or putting $15K into specialized on-premises hardware that will be severely underutilized (given that you are unlikely to be fine-tuning all the time). The big Xeon is able to achieve over 200 tokens per second on this benchmark (with numactl -C 0-47 I was able to achieve about 240 tokens per second on this particular dataset at batch size 16).

It's worth noting that other 7B LLM's perform slower. I don't think this is because Falcon is architecturally different; rather, Falcon was a somewhat off-brand implementation using a bespoke implementation of FlashAttention.

200 tokens per second is pretty decent, allowing 3 epochs of fine tuning on a 10M token dataset in about a day and a half (for reference, openassistant-guanaco, a high quality subset of the guanaco dataset, is about 5M tokens, and a page of text is about 450 tokens).

Conclusions: usable vs useful?

First, without a doubt the three results presented above are usable. A single-socket Sapphire Rapids machine is able to generate images, run inference on a state of the art LLM, and fine-tune a 7B-parameter LLM on millions of tokens, all at speeds which are unlikely to have you tear your hair out. 

On the other hand, is it useful

In the datacenter, we could see a case for image generation, which runs briskly on the Xeon and has a light memory footprint. The problem is, the image generation workload uses all 48 cores for three seconds, which is a poor fit for oversubscribed virtualized environments. On a workstation, the thought of having 48 cores but no GPU is ludicrous, and even AMD GPUs are going to tie the Xeon in Stable Diffusion.

The LLM inferencing use case is a bit different, being primarily bound by bandwidth and memory capacity. The dynamics here are entirely enforced by market conditions, not raw performance or technology supremacy. For example, Sapphire Rapids CPUs are available as spot instances from hyperscalers; GPUs are not. Nvidia also chooses to charge $100 per GB of VRAM on its datacenter parts, but conversely, Intel seems to think its cores are worth $100 each even in bulk. Software support for the Xeon is poor - while llama.cpp is fast, it is primarily a single-user library and has minimal (no?) support for batched serving.

The training benchmark is the most interesting one of all, because it is an example of a workload that will not run at all on most GPU instances, and in fact, there are technology limitations as to why GPUs have trouble reaching 100s of gigabytes of memory. Once again, the Xeon is held back here by Intel's high pricing - a 1S Xeon system costs about $6,000, compared to about $15,000 for a 3x RTX A6000 machine which is significantly faster.

Finally, let's take a look at Intel's most important claim: "the CPU is free because you aren't allowed to not have one". Disregarding hyperscalers which work on a different cost model, a company upgrading to Sapphire Rapids in 2023 is likely still on Skylake. Going from a 20-core Skylake part to a 32-core Sapphire Rapids part represents a 2x boost in general application performance and a ~6x boost in AI performance, except for bandwidth-limited LLM inference where the gains are closer to 2x. Getting 2x more work done with your datacenter and a competitive option to run AI-based analytics or serve AI applications on top of that is a pretty compelling reason to buy a new server, and for a lot of IT departments that's all you need to make the sale.