Isopack

About that Groq demo

2024-02-21T05:57:00.000-08:00

Recently the good folks at Groq released a formidable demo showing a 70 billion parameter language model inferencing at 300 tokens per second at batch size 1. This immediately elicited two responses from the community:

Wow, this is so fast and the best thing ever
This sucks because it has to run on an entire cluster

In fact, the truth is somewhere in between. Let's take a look.

Configuration. Groq's marketing is somewhat facetious here. The 'LPU' is just their old 14 nm SRAM-based systolic array processor, which is in turn somewhere in between the Graphcore SRAM-based processor array and the big systolic arrays found in Gaudi and TPU. The LPU has 230 Mbytes of SRAM per chip, and some software tricks are used to shard the model across many, many LPUs for inference. If we assume 10 Mbytes activation memory per token, 4K context on a 70B parameter model in int8 comes out to about 110 Gbytes of memory, which requires eight racks (512 devices) to hold. Needless to say, that is a rather voluminous configuration.

Power. At first glance Groq is absolutely boned here, requiring 512 300W devices (150KW) to do their inference. Fortunately, batch size 1 inference doesn't really stress the ALUs even with all of Groq's bandwidth (a forward pass is 140 Gflops so 300 tokens/sec is a tiny fraction of the total throughput of the cluster) so the actual power per chip will be quite low.

Economics. This is where it gets really spicy. Deriders will remark that the Groq device is $20K, but that's in quantity 1 from Mouser for a card built by Bittware, a company notorious for high markups. Groq's chip is fabbed on a '14 nm process' - give or take, the die probably costs $60, 2x that for packaging and testing. Now, here's the magic - because it is SRAM based, the boards are very simple; all in, I would estimate that a Groq board costs under $500 to build.

Suddenly, we're looking cost competitive. The boards for 512 devices come out to $250,000, figure double that once we account for the host servers. Half a million for 10x the performance of an octal A100 ($180,000) is suddenly pretty good. Of course, we're being facetious here, because an octal A100 costs about $50,000 to build, but Groq gets to pay wholesale prices and you don't.

Conclusions. What are our real conclusions here? I'd say there are two, maybe three:

You have to cut the margins out. Nvidia systems are competitive at or near cost, but at their current ludicrous 10x margins you could design and tape out a custom chip for less than the cost of a medium sized cluster. Nvidia is responding to this - their 'cloud and custom' group will allow the sale of semi-custom SKUs to hyperscalers without compromising their annual reports too much, by breaking out the margin reduced parts into a different business unit.
Old nodes are OK. Because inference is so bandwidth bound and memory controllers don't scale well any more, you can squeeze a lot of life out of old nodes. That's a big deal because that lets you second source your design, which puts pressure on the foundry to keep prices down. In contrast, pricing on leading edge TSMC nodes is utter chaos right now because capacity is completely booked.
Engineering is silly. Groq took an ancient accelerator designed for high speed inference of ResNets, and through a monumental feat of system design and clever presentation was able to adapt it to achieve a world record in the hottest field in venture capital right now. In my eyes, that's pretty darn cool.

How to buy an AI server on eBay

2024-01-06T04:12:00.000-08:00

A Quick Primer

A neural network is, in essence, a stack of linear projections with some nonlinear functions interspersed in the stack. Merely stacking linear projections is boring, because the composition of two linear functions is another linear function, but it turns out that inserting even the simplest nonlinearity into the stack works some magic - the composite function "learns" locally linear regions of a complex nonlinear function because the interspersed nonlinearity changes the subspace it projects into. The important takeaway is that computationally, neural networks are a sequence of matrix-vector multiplies.

To train a neural network means to fit it to some set of known input value-function pairs. This is typically done through some variant of first order gradient descent on a loss function L which describes how well we've done the fitting. L and grad L are evaluated with respect to the neural network parameter weights; we typically approximate grad L by the gradient on some subset of the inputs (stochastic gradient descent). At each step the weights W are replaced by W' = W + gamma * grad L; the step size gamma is a key factor in convergence speed.

The widely accepted fastest-converging choice of gamma uses a heuristic called "AdamW", which looks at the first (time average) and second (time average of squares) moments of the gradient to compute gamma. AdamW stores the gradient, first moment, second moment, weight, and "fp16 copy of the weights" (matrix multiplier units typically work on fp16/bf16 inputs), a total of 18 bytes per parameter. This results in a large memory footprint during training - a 7B parameter network requires 126GB, a 34B parameter model 612GB, and 175B parameter model requires a whopping 3150GB. In practice this number is even higher - intermediate activations take up space, especially at large batch size.

Now obviously 3150GB of optimizer states aren't going to fit in a single node, and even 612GB is troublesome. In order to evade having to build ever higher capacity nodes (which is electrically hard), we instead partition the parameters, states, and inputs across nodes, do our computations as much as possible on each node, then synchronize the states across nodes by passing messages. This is nothing new - it's exactly how large scientific simulations on supercomputers work. The ML people have a name for this - sharding - and the scheme used to do the computations is called 'fully sharded data parallel'.

In order to reassemble the gradient, inter-GPU communication is required. Between nodes, this is done through a network which is assumed to be "slow", and painstaking steps are taken to mask the slowness. Internal to a node, the communications are assumed to be "fast", which is where the confusion begins: Nvidia enterprise GPUs can directly pass messages to each other, as well as to Nvidia-branded network controllers, if and only if all devices exist on the same PCI-e controller. If not, the sender has to first write to a CPU buffer, which is then read by the receiver.

Now, there's technically nothing wrong with this: PCI-e 3.0 is good for 128 gbits/sec, the intersocket connection is good for 300+ gbits/sec, and memory is good for many tbits/sec. The issue is all of this back and forth adds latency, and current frameworks generally make no effort to hide intra-node latency (in fact, they may exploit low intra-node latencies to optimize overall performance). This means we need to carefully arrange the GPUs and NICs to be on the same PCI-e controller in order to guarantee expected performance with existing libraries.

With that being said, let's take a look at some good and bad servers on eBay that are marketed as "good for machine learning":

Gigabyte G482

At first glance, this is looking pretty good. For about half of retail, you get a current-generation octal GPU system in 4U. It looks pretty well-engineered, and the vendor's website even says you can use it for AI!

However, this is not a good AI server. In this case, the vendor is as much to blame as the listing. Let's take a look at the block diagram:

As you can see, groups of four GPUs are connected to each socket. An unfortunate quirk of Epyc is each socket contains multiple PCI-e controllers, meaning at best, pairs of GPUs can directly communicate with each other. GPUs from the two hives of four have to cross the intersocket connection to send and receive data. Finally, it's also one slot short - the left CPU doesn't have a slot for a NIC.

This is a fine server for inference if the model fits on one or two GPUs, but then what are you doing running your high capacity inference at home? 🙃

Asus ESC4000's

These are old 2U HPC nodes which make no pretense to being good at AI (they're also not good miners, since you're spending hundreds of extra watts per node running the fans, chipset, and CPUs). Despite their humble origins and low price, these make solid inference servers for latency-sensitive applications - load them up with 4x A5000 and you'll be happily running two instances of llama-2-70B at 20 tokens/sec each. And trust me, quad GPU nodes are way easier to deal with than octal GPU nodes, which tend to be mechanically fragile and difficult to ship and power.

Supermicro SYS-4028

SYS-4028 was an extremely popular AI system in the era of smaller convolutional networks. However, the most common listings come with X9DRG-O-PCIE, which is an older-generation board designed for conventional HPC:

From the 40-lane-per-socket allocation, X9DRG-O-PCIE assigns two groups of 16 lanes each to two PLX switches, which each host two GPUs. So far so good, we get four GPUs able to communicate with each other via P2P. The remaining 8 lanes go to an x8 slot on the same root complex, which is enough for a FDR Infiniband card. That's a pretty robust setup for HPC, where some of the computations might need to happen on the host - hives of four GPUs can communicate over RDMA, and get 32 GB/sec back to the socket. Unfortunately, it's suboptimal for large scale machine learning, since the hives need to communicate over shared memory (and over QPI between the sockets, too!). Instead, what you want is this:

A rather insane layout, to be sure. All eight GPUs (128 lanes) oversubscribe 32 lanes back to one socket, with the remaining 8 lanes from that socket available for connectivity. You wouldn't dream of running your physics simulations on such a system, but the layout is optimal for machine learning: hives of 4 can DMA each other through the 96-lane switch, and the two hives can DMA each other through the root complex.

Happily, X10DRG-O-PCIE is available used for about $700. When all is said and done $1600 for what is the pinnacle of PCI-e based machine learning systems is not bad.

Sapphire Rapids for AI: a mini-review

2023-12-19T21:59:00.000-08:00

Intel often touts the performance of their 4th-generation Xeon Scalable "Sapphire Rapids" for generative AI applications, but there are surprisingly few meaningful benchmarks, even from Intel itself. The official Intel slides are as such:

Oof. Three ResNets, two DLRM's, and BERT-large. Come on guys, this is 2023 and no one is buying hardware to run ResNet50. Let's try benchmarking some real workloads instead.

Benchmarked Hardware

Xeon Platinum 8461V ($4,491) on Supermicro X13SEI, default power limits, 256 GB JEDEC DDR5-4800

Stable Diffusion 2.1

Everyone's favorite image generator. SD-2.1 runs on a $200 GPU, so perhaps it doesn't make sense to test performance on a $4,500 Xeon, but as Intel says, every server needs a CPU so the Xeon is in some sense, "free". We use an OpenVINO build of SD-2.1, which contains Intel-specific optimizations, but critically, is not quantized, distilled, or otherwise compressed - it should have comparable FLOPS to the vanilla SD models. Generation is at 512x512 for 20 steps.

Not too shabby. 6.27 it/s puts us somewhere around the ballpark of an RTX3060. On the other hand, you could run your inference on an A4000 (which is licensed for datacenter use) and get higher performance for just $1100, so the Xeon isn't exactly winning on price here.

llama.cpp inference

Inferencing your LLM on a CPU was a bad idea until last week. Microsoft's investments into OpenAI make it almost impossible to compete, price-wise, with gpt-3.5-turbo: since Microsoft is a cloud provider, OpenAI gets to pay deeply discounted rates over the 3-10x markup you would have to pay for cloud infrastructure. You can't escape by building your own datacenter either: without the high occupancy of a cloud datacenter, you still end up paying overhead for idle servers. This left 7B-sized models as the only meaningful ones to self-host, but 7B models are so small that you can run them on a $200 GPU, obviating the need for a huge CPU. (Obviously, there are security-related reasons to self-host, but by and large the bulk of LLM applications are not security-sensitive).

Fortunately for Intel, a medium-sized MoE model, Mixtral-8x7B, with good performance appeared last week. MoE's are unique in that they have the memory footprint of a large model, but the compute (really, bandwidth) requirements of a small model. That sounds like a perfect fit for CPUs, which have tons of memory but limited bandwidth.

It ends up being that llama.cpp, an open-source hobbyist implementation of LLMs on CPUs, is the fastest CPU implementation. At batch size 1 we get about 18 tokens per second in Q4, which is a perfectly usable result (prompt eval time is poor, but I think that's an MoE limitation in llama.cpp which should be fixed shortly?).

Falcon-7B LLM fine-tuning with TRL

This is probably the most interesting benchmark. Full fine-tuning of a 7B parameter LLM needs over 128 GB of memory, requiring the use of unobtanium 4x A100 or 4x A6000 cloud instances or putting $15K into specialized on-premises hardware that will be severely underutilized (given that you are unlikely to be fine-tuning all the time). The big Xeon is able to achieve over 200 tokens per second on this benchmark (with numactl -C 0-47 I was able to achieve about 240 tokens per second on this particular dataset at batch size 16).

It's worth noting that other 7B LLM's perform slower. I don't think this is because Falcon is architecturally different; rather, Falcon was a somewhat off-brand implementation using a bespoke implementation of FlashAttention.

200 tokens per second is pretty decent, allowing 3 epochs of fine tuning on a 10M token dataset in about a day and a half (for reference, openassistant-guanaco, a high quality subset of the guanaco dataset, is about 5M tokens, and a page of text is about 450 tokens).

Conclusions: usable vs useful?

First, without a doubt the three results presented above are usable. A single-socket Sapphire Rapids machine is able to generate images, run inference on a state of the art LLM, and fine-tune a 7B-parameter LLM on millions of tokens, all at speeds which are unlikely to have you tear your hair out.

On the other hand, is it useful?

In the datacenter, we could see a case for image generation, which runs briskly on the Xeon and has a light memory footprint. The problem is, the image generation workload uses all 48 cores for three seconds, which is a poor fit for oversubscribed virtualized environments. On a workstation, the thought of having 48 cores but no GPU is ludicrous, and even AMD GPUs are going to tie the Xeon in Stable Diffusion.

The LLM inferencing use case is a bit different, being primarily bound by bandwidth and memory capacity. The dynamics here are entirely enforced by market conditions, not raw performance or technology supremacy. For example, Sapphire Rapids CPUs are available as spot instances from hyperscalers; GPUs are not. Nvidia also chooses to charge $100 per GB of VRAM on its datacenter parts, but conversely, Intel seems to think its cores are worth $100 each even in bulk. Software support for the Xeon is poor - while llama.cpp is fast, it is primarily a single-user library and has minimal (no?) support for batched serving.

The training benchmark is the most interesting one of all, because it is an example of a workload that will not run at all on most GPU instances, and in fact, there are technology limitations as to why GPUs have trouble reaching 100s of gigabytes of memory. Once again, the Xeon is held back here by Intel's high pricing - a 1S Xeon system costs about $6,000, compared to about $15,000 for a 3x RTX A6000 machine which is significantly faster.

Finally, let's take a look at Intel's most important claim: "the CPU is free because you aren't allowed to not have one". Disregarding hyperscalers which work on a different cost model, a company upgrading to Sapphire Rapids in 2023 is likely still on Skylake. Going from a 20-core Skylake part to a 32-core Sapphire Rapids part represents a 2x boost in general application performance and a ~6x boost in AI performance, except for bandwidth-limited LLM inference where the gains are closer to 2x. Getting 2x more work done with your datacenter and a competitive option to run AI-based analytics or serve AI applications on top of that is a pretty compelling reason to buy a new server, and for a lot of IT departments that's all you need to make the sale.

Assorted video decoding tidbits

2023-04-18T06:57:00.003-07:00

Some form of hardware acceleration is all-but-mandatory for playing 4K video, especially high bitrate H.265. Nowadays H.265 decoding is commonplace (you need to go all the way back to 2015 to find a CPU or GPU that can't decode HEVC Main10), but presumably as the world transitions to AV1 the same will apply.

Hardware accelerated decode in web browsers on Linux

It works! Well, sort of. On my test machine with an i3-12100F and an ancient Polaris 12 (AMD) GPU running the open source drivers, 1080p H.264 content is properly decoded by UVD but 4K60 VP9 (the famous '4K Costa Rica' demo clip on Youtube) is not. CPU usage seems a bit high in either case, about half a core in the former and 1-2 cores in the latter.

Decode on integrated graphics, display connected to discrete graphics

An esoteric use case. I ran into the H.265 version (!); I had a M4000 I wanted to use for Solidworks and the M4000 does not support HEVC decode, but the iGPU on the i5-12600 it was paired with does.

Unfortunately, this doesn't seem to work. iGPU usage remained zero and the M4000 ran in some sort of weird hybrid decoding mode. Performance, however, was acceptable.

Decode in integrated graphics, multiple GPUs and displays

Can we fix the above by plugging the monitor into the iGPU? The behavior is strange:

Both GPUs are now at 30% usage, but the CPU usage has gone through the roof. Very bad indeed, but possibly fixable with enough effort.

Remarkably, even in this bugged state the Costa Rica clip runs at 60 fps.

VLC + H.265, but the decoding is supposed to be done on the integrated graphics

Unfortunately, the iGPU chooses not to participate, but the hybrid decoding seems to work fine, playing back 4K24 high bitrate video with about 12% CPU usage. It's worth noting however that this is 12% of a 4.8GHz hex core Alder Lake, which is like..an entire laptop CPU from not that long ago or two whole 3GHz Skylake cores.

Heavy decode on integrated graphics

Surprisingly good. On my Kaby Lake laptop, the CPU is able to keep up with about 25% usage while remaining throttled to ~1.6 GHz on battery - the fixed function hardware really does the heavy lifting here and keeps the power consumption down.

Hardware accelerated decode, but there are many cores

The test system was a Epyc 7702 with an RTX 3060, by all means close to the state of the art. I didn't expect problems here, and didn't find any; the 3060 ran at heavy usage on the Costa Rica clip and the CPU was basically idle.

It's unclear what the actual CPU usage was; Task Manager lacks the granularity to deal with so many cores since even 1% is almost an entire core.

"Phones are getting better": buying a camera in 2023

2023-02-21T05:56:00.002-08:00

It's a tough time to be a camera manufacturer. The mighty ISOCELL HP2 now rules the mobile space, sporting 200 million (!) 0.6um pixels binnable as 12.5M 2.4um pixels. Subelectron read noise, backside illumination, very deep wells, and sophisticated readout schemes allow virtually unlimited dynamic range while not compromising light sensitivity. Practically speaking, the out-of-the-box performance of a state-of-the-art mobile camera greatly exceeds that of any ILC in challenging-but-well-lit conditions: the phone has access to live gyro data for stack alignment, more processing than any ILC could dream of, and is backed by hundreds of millions of dollars of software R&D. It also has access to color science that, no doubt, has been statistically developed to be perceived as "good looking" across a wide demographic of viewers - I'm a firm believer that the best photos are the ones that make other people happy.

We can do some math to see just how screwed ILC's are. The Galaxy S23 Ultra ships with a 23mm f1.7 equivalent lens and a sensor measuring 9.83mm x 7.37mm, for a total sensor area of 72mm2. A full frame sensor measures 864mm2. Light gathering goes as the square of the f-number, so we have the following equivalency:

ISOCELL HP2 (1/1.3"): f1.7
4/3": f2.9
APS-C: f3.9
Full frame: f5.9

At the wide end, ILC's are looking pretty dead: 24mm f5.6 is a reasonable aperture and focal length to shoot at on full frame, and the same performance can be achieved with with a phone. There's some argument that the FF sensor has higher native DR, but the phone has what is more or less a hardware implementation of multi-shot HDR which makes up for the difference. Plus the phone is, you know, a phone, and fits in your pocket.

Astute readers will note that 23mm is awful wide, and it's true - the effective sensor area of the phone decreases if you want a tighter focal length. Taking a look at a 2x crop (46mm equivalent), the sensor area of the phone drops to a rather shabby 18mm2, so the equivalency is now:

4/3": 5.9
APS-C: f7.8
Full frame: f11.7

The ILC suddenly looks much more compelling - if you forced me to shoot at 45mm f11 all day I'd abandon photography and take up basket weaving.

This makes shopping a whole lot easier: phones obsolescing the moderate-wide-end means that zooms which include 50mm are are pretty useless. Suddenly, 50mm primes look real interesting again, especially since we are now spoiled for choice in the 50mm space. 4/3" cameras, which looked pretty dead for a while, also suddenly look viable: subjects you would shoot with a 50mm prime are often DOF-limited which means the larger sensors can't take advantage of faster apertures.

Things get trickier at longer focal lengths, because you are less likely to be DOF-limited. Wait, what?! Don't telephotos have a shallower depth of field? Well, it turns out for moderate focus distances, the DOF of a lens is proportional to the f-number, and inversely proportional to the square of the magnification. More likely than not, telephoto subjects are large, and since the sensor area is constant in a given camera the DOF actually increases if you stand far away enough to fit the subject on the sensor.

Telephotos provide a compelling argument to buy a full-frame body. Regardless of the sensor format telephotos are going to stay a constant size because they are dominated by their large front elements, and in the types of lighting you might want to use a telephoto lens you are often struggling for light. That premium 35-100/2.8 for 4/3" looks real nice until you remember it is the optical equivalent of a 70-200/5.6 on full frame, a lens so sad that they don't even make one.

Finally, there's image stabilization. I would argue that IS is mandatory for a good experience, especially for new users: for stationary subjects IS gives you something like three stops of improvement, allowing stabilized cameras and lenses to beat un-stabilized cameras with sensors ten times the size. The importance of that cannot be emphasized enough: that 18mm2 of cropped phone sensor can gather as much light as a 180mm2 (nearly 4/3 sized) sensor with an f1.7 lens on it factoring in IS. This unfortunately throws a wrench in many otherwise-sound budget combos: short primes didn't ship with IS until quite recently, and many budget bodies are un-stabilized.

With all that said, here are some buying suggestions:

The $500 "I'm poor" combo

The situation is dire. Long ago, I would have recommended an old 17-50mm f2.8 stabilized zoom from a third party and a used entry-level DSLR. Unfortunately, you'd be insane to tell a new user to buy an entry-level DSLR in 2023 (people expect features like "working live view" and "4K video") and the third-party zooms don't work with most mirrorless cameras. What we really want is a stabilized 40-50mm f2-2.8 equivalent (that's 40-50mm equivalent focal length, f2-2.8 real aperture) on a body that supports 4K video and PDAF, but inexplicably, that combination does not exist, even in the micro-4/3 world (which has had IBIS for a long time).

Consolation prize: any of the 24MP Nikon DSLR's, plus a Sigma 17-50/2.8 OS, used, but I wouldn't recommend it.

The $1000 combo

This used to be a downright reasonable price point, but inflation and feature creep have somewhat diluted it. Fortunately, the long-lived life cycles of Sony cameras help you here: a6500's are regularly available, used, for $600-700 leaving you with $300 for a lens. By some crazy miracle of third-party lenses you can fit two autofocus lenses into $300 - the Rokinon 35/2.8 is cheap and small, and the other can be 'to taste'.

The drawback here is that as a new system, long lenses for Sony tend to be rather inaccessible, but the same can be said of any other mirrorless-only system and the starter offerings for the Canon/Nikon ecosystem are very poor in comparison. It's also worth noting that there's nothing good at this price point new.

Recommendation: the most beat-up a6500 you can find, a used Rokinon 35/2.8, and one other lens, or save up and buy a second lens which costs more than $150 :)

The photographer's special: a D800, 50mm 1.8G, and 70-200 VR1. You lose a lot of features (stabilization, eye AF, 4K video, touchscreen) but optically the D800 is as good as they get, and the two lenses will let you take pictures none of your friends can. Highly recommended if you've spent some time behind a camera before - otherwise, the transition from phone to optical viewfinder may be a bit jarring.

The dubious alternative: an a7R ii and Rokinon 45/1.8. The a7R ii checks every box: stabilization, full frame, BSI, 4K video, but still manages to be a poor user experience nonetheless thanks to its ill-thought-out controls and menus. If you're fine with that, you get unsurpassable (as in, the sensor is limited by the laws of physics) optical performance for $1000.

The $1500 combo

Things start getting a little weird here. The a6500 is a really good camera, and its nice and compact too. The E-mount ecosystem matured quickly, with a ton of off-brand companies making decent prime lenses at ludicrously good prices. I would argue if you're content shooting short-to-moderate focal lengths, you are better off staying in the E-mount ecosystem - you can buy an a7iii and a nice starter prime for $1500, then (quickly) build out the system from there.

If you want to shoot long lenses, Sony no longer looks so sweet. The 70-200 options from the big three are comparable: the Sony GM Mark I is a native lens which is about $1500, the same price as the adaptable-without-penalties Nikon -FL but optically inferior. The EF-mount Mark III is the same price but worse than the FL (and better than the GM); the EF-mount Mark II is $500 less and probably superior to the Nikon VR2. The Nikon VR1 is incredibly cheap for a modern pro lens, but the corners are dubious which is a disaster for some people and a non-issue for others.

Above 200mm, Sony is out - the 200-600 is a very good budget 600mm option but pretty pathetic at 300 and 400mm. It's also very expensive: if you accept the extending barrel, the 150-600 options from third parties are $700 cheaper. Among Canon and Nikon, Nikon wins on the budget end (the Z6 was a very usable camera, the original EOS R was not), but Canon just announced some new releases so we should expect prices to move down across the stack.

Recommendation: a7iii plus your favorite primes (or a7R ii and your favorite primes if you don't shoot video at all)

Recommendation: Z6 Mark I, FTZ, 40mm f2, and 70-200 VR1. This comes out to $1800, and the VR1 recommendation is going to make a lot of people angry, but you'll be getting beautiful images for years to come (or until you drop the extra $1000 on the FL).

"I have money, help me spend it"

I'm a Nikon shooter so I'll just provide a Nikon kit. This assumes you have plenty of cash, but you aren't interesting in wasting it.

Body: Z6 Mark I. The Mark II is not worth the extra cash (unless you really need the second slot). Also, an FTZ to go with it.

Short lens: 40mm f2. There are so many options here and they're all good, so really, pick your poison.

Telephoto: Unfortunately, save up for the FL. It's not that much better than the VR2, but it fixes the focus breathing on the VR2 that was problematic at portrait focal lengths. The FL also performs exceptionally with teleconverters, so pick up a TC14 and TC20 III and get your 280mm f4 and 400mm f5.6 for "free".

Long lens: A sane man would recommend the 200-500 f5.6. I would also put a vote in for the 300/2.8 AF-S, which is only about $700 used and works well with teleconverters to get to 600mm. A madman would go all in and buy a 600/4 but seriously, don't do it - shooting with that lens is a serious commitment.

Ultrawide: Get fucked. You absolutely want a native ultrawide since the short flange distance on the Z bodies makes them much better, but Nikon refuses to provide a 16-35/2.8 which doesn't suck. I guess the 17-28/2.8 will have to do? The runner up would be the 14-30/4, but f4 isn't f2.8.

Sony will give you the same thing, trading the a better ultrawide for a worse long lens. Canon is...honestly superior, I think, but I am not familiar enough with Canon to make a $5000 recommendation.

The best little telescope that'll never get built

2022-09-16T02:30:00.001-07:00

I really love the IMX183 from Sony. For a very good price, you get 20 million BSI pixels with photon-counting levels of read noise and negligible dark current - truly a miracle of the economies of scale. The sensor is also large enough to have good light gathering capabilities, yet small enough to work with compact optics.

The problem is taking advantage of all 20 million of those pixels. 20 MP isn't that much - 5400-pixel horizontal resolution leaves you with precious little room to crop to 4K so actually resolving 20MP is important. You can't really buy a telescope that resolves 2.4um pixels - in the center anything diffraction limited will work, but if you want a fast, wide system getting good performance in the corners isn't happening. Obviously, the answer is to design your own telescope.

Now, I have zero interest in making telescopes - grinding your own lenses is ass and generally a poor use of time. If I wanted to spend time doing repetitive tasks I'd pick up embroidering or something. As such the system is designed to be reasonably manufacturable by an overseas vendor.

The telescope is a Houghton-type design with integrated field flattener, a focal length of 150mm, an entrance pupil of 76mm (3"), and a 6-degree field of view covering a 16mm sensor. The optical performance is pristine:

Normally, I would not vouch for a form like this - the telescope is much longer than its focal length, the image plane is inside the tube, and two full-aperture refractive elements are needed. However, in this case, we are building a small, tightly-integrated system: polishing a 3" lens is easy, the difference between a 6" and 10" OTA is negligible from a practical standpoint, and the IMX183 can easily be fit into the tube.

The glasses aren't great - the automatic glass substitution tool came up with BSM28 and BSM9 which are moderately costly, infrequently-melted glasses. The chemical properties are poor, basically akin to those of ED glass, but 77mm clear filters are cheap and readily available. The elements don't have any serious manufacturing red flags, though the curvatures are a bit steep compared to your usual refractor doublet.

Image a DLP down through a microscope objective and look at it with a camera through the same objective

2022-03-20T19:58:00.002-07:00

For some reason it's taken me a while to actually do this, but its easy to expose high resolution film or resist with a DLP. All you need is a camera, a beam splitter, a DLP, and some microscope parts.

The DLP is an ordinary DLP projector with the lens removed - I used a TI dev kit because the firmware offers nice features like "disable the LEDs" which can be used to, for example, replace the blue LED with a violet one for exposing resist. Unless you have a really exotic objective it is probably best to use a g-line (430nm) LED rather than a 405nm LED - the 430nm parts are hard to find but objectives will be quite dispersive in the far violet unless they are specifically designed to be apochromatic from violet to red. Your best bet is probably to leave the green LED in place for focusing and replace the blue LED with a violet one. A regular pico-projector will work just fine, but is less convenient to work with. You do save about $800 at current street prices though.

The light from the DLP reflects off a beam splitter and passes through the tube lens, which is just a 200mm a(po)chromat. A 200mm doublet would probably work fine here, but you can get better corner performance from a triplet or a quadruplet - the DLP is not very large though, so it might not matter. The tube lens collimates the light and sends it down the microscope objective, which focuses it on the target.

The camera looks down the other beam splitter path and sees the light scattered off the target. Unfortunately, this technique only works with optically smooth targets - otherwise, the camera sees the surface texture of the target and not the imaged patterns displayed on the DLP.

Parfocalizing the camera and the DLP is easier than it seems - the object side of the tube lens is something like f/50 so the depth of field is very large. Roughly speaking, there is only a small range of distances where the the objective comes into focus at all. Either by reading the datasheet or by trial and error it is possible to roughly set the backfocus, then the camera and target position are adjusted for best focus. Once a starting position is found, the backfocus can be adjusted to minimize spherical aberration (microscope objectives gain some spherical aberration if the conjugates aren't the right ones).

The pixel scale in the capture is 560nm/pixel, so the Windows clock is only a few microns long. Performance is as expected but it is always entertaining to use the Windows desktop on a 1mm wide screen :)

The State of the CPU Market, early 2022

2022-01-30T22:32:00.001-08:00

It's been a year of shakeups in the high-end CPU market, what with catastrophic supply chain shortages, the rise of AMD, and Pat Gelsinger's spearheading of Intel's return to competency. Now that the dust has mostly settled, it's interesting to look at what's hot, and what's not.

The 5950X is still king...

Alder Lake i9 really gives AMD a run for the money, but the 5950X is still the king of workstation CPUs, especially now that you can buy it. It has aggressive boost clocks which allow it to beat every Skylake and Cascade Lake Xeon Platinum (including the 28-core flagships), consistent internal design which scales well in every application, consistent and manageable 142W power limit, and bonus server features like ECC support. ADL is good, but the 250W PL2 really kills it for workstation use (a good 12900K build requires serious effort to get right), and because of the heterogeneous internal layout it fails to scale on some operating systems and in some applications.

Scaling is really important in this era of 16, 32, and 64-core processors; many applications completely fail to scale past 16 cores, and even those that do exhibit much less than linear returns. As a result, those 16 highly clocked cores punch above their weight when it comes to real-world results - a 4.x GHz Zen 3 core isn't actually twice as fast as a 2.8 GHz Skylake core, but 16 4.x GHz Zen 3 cores can still outperform 28 Skylake cores because the 12 extra cores are doing less work.

...Except when it's not

The elephant in the room, of course, is single-threaded performance. Alder Lake outruns Zen 3 by about 20%, which is an enormous leap in x86 performance. For all intents and purposes, ADL is an 8-core CPU with 8 bonus cores that you can't really rely on. If you commit to the 8-core life (which encompasses a lot of applications), Alder Lake suddenly looks a lot more enticing, because 8 Golden Cove cores are 20% faster than 8 Zen 3 cores.

Of course, part of the reason why ADL can do this is because Intel 10 ESF ("7 nm") is a high performance node designed to scale to aggressive clocks at high voltages, and TSMC N7 is a SoC node designed for lower clocks and voltages. The price you pay is that those 8 Golden Cove cores draw twice as much power as 8 Zen 3 cores to perform 20% more work, which isn't very good engineering.

In the end, Zen 3 and Alder Lake are mostly complementary products. If your workflow is interactive content creation, gaming, or design work, ADL is right for you. If you're building a machine mostly to handle long renders and simulations, the 5950X is the best processor under $2000 for the job.

What about HEDT?

HEDT is a curious thing. In the beginning, desktop and server parts were cut from the same silicon. Starting with Nehalem, Intel experimented with bifurcating their designs into laptop (Lynnfield) and server (Gulftown) variants with rather drastically differing designs - Lynnfield was a SoC with an on-die PCIe root complex offering 16 lanes, Gulftown was a traditional design with an off-package PCIe controller offering 48 lanes. The bifurcation makes sense - laptops rarely need more than 16 PCIe lanes, whereas servers need dozens, or even hundreds, of lanes for accelerators, storage, and networking.

The bifurcation really took off starting with Sandy Bridge; Intel aggressively marketed the 2600K, which was cut from the mobile silicon. Sandy Bridge-E, the server based variant, filled a niche, but the platform was expensive and to top it off, unlocked processors based on the full 8-core SB-EP die were never released.

Since then, HEDT has come and gone - it hit a real low during the Broadwell-EP era, but experienced a resurgence with Skylake-X, which competed against the dubious-and-not-really-recommendable Threadripper 1 and 2. Unfortunately, Ice Lake-SP gives off Broadwell-EP vibes - namely, the process it is built on does not have the frequency headroom required to make a compelling desktop platform. This leaves AMD relatively unchallenged in the high end space:

The 24-core 3960X is currently a dubious choice over the 5950X - supply is poor, power consumption is high, and performance is not that much better. If you need balanced performance with good PCIe it's not a bad choice, but there are cheaper (Skylake-X, used Skylake-SP) and faster (the other Threadrippers) offerings in the category.
The 32-core 3970X is a good processor for most applications. Thanks to the blessing (or curse) of multicore scaling, it comes within striking distance of the 3990X is most applications at half the price, while offering the full suite of Threadripper features.
The 64-core behemoth 3990X is...not a very good choice, mostly due to extreme pricing ("$1 per X") and really bad scaling. Fortunately, it wields a very competent implementation of turbo, so it is never slower than the 3970X.
Threadripper Pro ("Epyc-W") is everything you've ever wanted, but is expensive and platform options are limited.

There are also a few interesting choices in the server space, with the usual caveats (long POST times, no audio, no RGB):

Dual Rome or Milan 64-core processors offer unmatched multithreaded performance, but not much can take advantage of 256 threads.
Dual 32-core Epycs are an interesting choice, offering performance comparable to a 3990X but with four times the aggregate memory bandwidth for all your sparse matrix needs
Dual low-end Ice Lake (e.g. 5320, 6330) offers AVX-512 support and high memory bandwidth at a price and performance comparable to those of a 3990X, but may be more available. Unfortunately, 2P ICL motherboards are rather expensive.

As far as used options go, Haswell-EP and older are finally ready to retire, unless you really need RDIMM support. A pair of 14-core Haswell processors performs worse than a 5950X at twice the power, with all the caveats of 2P platforms attached. Otherwise:

Dual Skylake-SP is an OK choice, simply because Skylake Xeons are entering liquidation and Epyc Rome is not. Technologically, Skylake has no redeeming features over Rome, but the fact that you can pick up a pair of 24-core Platinums for slightly more than $1000 is interesting. It's worth noting only the 2P configuration is interesting; 1P Xeon is generally slower than a 5950X.
Epyc Naples is bad. Don't do it. Threadripper 1 falls in the same category, the only times you'd consider either of these is if you found a motherboard in the trash or something.

Summary

"My application scales indefinitely with core count"

No, it doesn't. But for this class of trivially-parallelizable application (rendering, map/reduce, dense matrix), the 5950X is a safe bet. The most extreme cases can benefit from one of the high core count platforms (Threadripper, Epyc, Xeon Scalable), but careful to benchmark the applications first - the 5950X wields a considerable clock speed advantage over the enterprise platforms which often swings things in it favor.

"I only need 8 cores"

The 12700K is probably your friend here, it's strictly faster than the 5800X. This category encompasses most content creation and all of CAD (minus simulation).

"Give me PCIe or give me death!"

This encompasses all of machine learning, plus anything which streams data in and out of nonvolatile storage. The 3960X is perfect for you, but in case its out of stock (which it probably is), the winner is...the 10980XE, which is fast enough to feed your accelerators and generally available. Of course, die-hard accelerator enthusiasts are going to look to more exotic platforms, and there, the platform dictates the choice of CPU.

"I'm out of RAM"

If your application requires more than 256GB of memory, Epyc-W is the CPU for you. Unfortunately, it is rather expensive, so the second place prize, and the bang-for-the-buck prize, goes to a pair of used 24-core Xeon Scalable, which gets you pretty darn close to Epyc-W for $1200.

A Small Astrograph with a Large Payload

2021-09-21T07:36:00.002-07:00

Building a large telescope is hard; designing a small telescope is hard. What exactly do I mean by that? Well, there are parts of the telescope that don't scale well with size, for example, the instrument payload, the filters, or the focusing actuators. More often than not, a design which works well on a 1m-class instrument fails to scale down to a 300mm-class instrument because the payload is incompatible with the mechanics, or is so large that it fills the clear aperture of the instrument.

A small telescope should also be...small. A good example of this is the remarkable unpopularity of equatorially-mounted Newtonians; a parabolic mirror with a 3-element corrector offers fast focal ratios and good performance, but an f/4 Newtonian is four times longer than it is wide, which gets unwieldy even for a 300mm diameter instrument.

The Argument for Cassegrain Focus

Prime focus instruments are popular as survey instruments in professional observatories. However, they fail to meet the needs of small instruments because of:

Excessive central obscuration. A 5-position, 2" filter wheel is about 200mm in diameter. In order to maintain a reasonable central obstruction, a 400mm clear aperture instrument is required which is marginally "small". Any larger-diameter instrumentation requires a 0.6m+ class instrument which is outside of the scope of many installations.
Unreasonable length. The fastest commercially available paraboloids are about f/3. Anything faster is special-order and very expensive. An f/3 prime focus system is actually longer than 3 times its diameter because of the equipment required to support the instrument payload.
Challenging focusing. For a very large system, actuating the instrument is the correct method for focusing because even the secondary mirror will be several tons. For a small system, reliably actuating 10+ kg of payload with no tilt or slip in a cost-effective fashion is rather unpleasant.
Too fast. A short prime focus system is necessarily very fast, complicating filter selection. A very fast system also performs poorly combined with scientific sensors with large pixels.

The commercially available prime focus instruments (Celestron RASA/Starizona Hyperstar, Hubble HNA, Sharpstar HNT) are designed for use with small, moderately-cooled CMOS cameras, possibly with a filter wheel in case of the Newtonian configurations. The RASA is wholly unsuited for narrowband imaging because a filter wheel would almost cover the entire aperture.

A Cassegrain system solves these issues by (1) allowing for moving-secondary focusing (2) roughly decoupling focal ratio from tube length and (3) moving the focal plane to be outside of the light path.

The 50% Central Obstruction

A 50% CO sounds bad, but by area the light loss is 25%, or less than half a stop. A 300mm nominal instrument with a 50% CO has the light gathering capacity of a 260mm system, which is pretty reasonable. The 50% CO also makes sizing the system an interesting exercise, because at some point the payload will be smaller than the secondary and prime focus makes sense again.

The Design

The Busack Medial Cassegrain is a really nice telescope that this design draws inspiration from, but it requires two full-aperture elements each with two polished sides that makes it ill-suited to mass production. Instead, we build the system as a Schmidt Corrector, an f/2 spherical mirror, and a 4E/3G integrated corrector. There's really nothing to it - by allowing the CO to grow and using the corrector to deal with the increasing aberrations, an f/4 SCT is entirely within the realm of possibility. There's a ton of freedom in the basic design, the present example makes the following tradeoffs:

f/4 overall system allowing for the use of an f/2 primary (which we know is cheaply manufacturable based on existing SCT's). f/4 also allows for the use of commodity narrowband filters.
400mm overall tube length (not counting back focus) is a good balance between mechanical length and aberrations. 50mm between the corrector and secondary allows ample space for an internally-mounted focus actuator.
160mm back focus allows for generous amounts of instrumentation including filters, tip-tilt correction, and even deformable mirrors.
Integrated Schmidt corrector allows for good performance with no optical compromises.
Corrector lenses are under 90mm in diameter and made from BK7 and SF11 glass, all easily fabricated using modern computer-controlled polishing.

The total length of the system could also be shortened, and the corrector diameters reduced, by increasing the primary-secondary separation and reducing the back focus, depending on instrument needs. Overall performance is quite good, achieving 4um spots sizes in the center and a high MTF across the field.

Actually Building It?!

Obviously, you are not going to make a 300mm Schmidt corrector and a four-element, 90mm correction assembly at home. This design is probably buildable via standard optical supply chains (the hardest part would be getting someone who is neither Celestron nor Meade to build Schmidt correctors). The correction assembly should also be further improved - there are a huge number of choices for its configuration and the 'correct' one is probably the one that is most manufacturing-friendly.

Shoot me an e-mail in case you are crazy and want to do something with the prescription for this design!

GCA 6100C Wafer Stepper Part 2: the stages

2021-07-09T00:13:00.004-07:00

The modern wafer scanner is a truck-sized contraption full of magnets, springs, and slabs of granite capable of accelerating at several g's while maintaining single-digit nanometer positioning accuracy. The motion systems contained within painstakingly try to optimize for dynamic performance by using active vibration dampening, voice coils, linear motors, and air bearings, all to increase the value of the machine for its owner (who spent a good fraction of a billion dollars on it).

As it turns out, a 80's stepper is none of these things. Scanners are immensely complex because they are dynamic systems - as the wafer moves in one direction, the reticle moves in the other direction, perfectly synchronized but four times faster. In contrast, steppers are allowed time to settle between steps, which allows for much more leeway in the motion system design. Throughput requirements were also lower; compare the 35 6" wph of an old stepper to the 230 12" wph of a modern scanner.

Old stepper stages are an instructive exercise in the design of a basic precision motion system; in fact, Dr. Trumper used to give this exact stage out as a controls exercise in 2.171. The GCA stages are also particularly interesting from a hardware perspective - they are carefully designed to achieve 40nm positioning accuracy using fairly commodity parts. The only precision parts seem to be the slides for the coarse stage, and even those are ground, not scraped.

The stage architecture

System overview

GCA steppers use a stacked stage architecture. Coarse positioning is done by two conventional mechanical bearing stages stacked on top of each other. Fine positioning is done by a single two-axis flexure stage. Rotational positioning, which only happens for alignment, is done using a simple open-loop, limited travel stage mounted on the fine stage. Focusing, which is done by changing the Z spacing between the lens and the wafer, is done by moving the optical column up and down with a linkage mechanism.

The position feedback system

The fine position feedback on GCA steppers is implemented through a two-axis HP 5501A heterodyne interferometer. Briefly, a stabilized HeNe laser is Zeeman split through a powerful magnet to create two adjacent lines separated by a few MHz with different polarizations. One of these lines is separated with a polarizing beam splitter and reflected off a moving mirror; this line is Doppler shifted due to the velocity of the moving mirror and beat against the stationary component to generate a signal. This signal is compared against a stationary REF signal to derive velocity and position measurements. Heterodyne interferometers are the preferred choice for metrology due to their insensitivity to ambient effects and power fluctuations.

The 5501A is the de facto choice for interferometric metrology; its successor the 5517 is still available from Keysight. A description of the system as found in the GCA steppers is as follows:

The laser points towards the rear of the stepper; a 10707A beam bender and a 10701A 50% beam splitter generate the two axes of excitation. The X and Y stages have identical measurement assemblies; the Y assembly is located to the rear of the stepper (behind the column) and the X assembly is located inside the laser housing. Both assemblies use a plane-mirror interferometer which differentially measures the wafer position against the optical column; the stationary mirror is a corner cube mounted to the column and the moving mirror is a 6" long dielectric quartz block mirror mounted to the wafer stage. The flats are precision shimmed to ensure orthogonality (since it is the orthogonality of the flats which determines the closed-loop orthogonality of the motion).

There are two additional position sensors in the system. The first is a sensor to measure the position of the fine stage relative to the coarse stage. Literature indicates that this is an LVDT, but on the 6100C it appears to be implemented as two photodiodes outputting a sin/cos type signal. The second is a brushed tachometer on each of the coarse stage drive motors, which is used for loop closure by the stock controller.

The coarse stage

The purpose of the coarse stage is to position the fine stage to within 0.001" of its final position. The stage is built as a pair of stacked plain-bearing stages; these stages are driven by brushed DC motors with brushed tachometers for velocity feedback. The motors go through a right-angle gearbox comprising of a bevel gear and several spur gear stages before being coupled by a flexible coupling to a long drive shaft which turns a pinion positioned near the center of each stage. This pinion drives a brass rack mounted to the stage which generates the final motions.

The fine stage

The fine stage is constructed as a parallel two-axis flexure stage with a few hundred microns of travel on each axis. The flexures are constructed from discrete parts; the stage is made from cast iron and the flexures themselves are constructed from blue spring steel. Actuation is by moving-coil voice coil motors with samarium-cobalt magnets, and position is read directly from the interferometer system.

The theta stage

The theta stage is a limited travel stage based on a tangent arm design. A (very small) Faulhaber Minimotor is coupled into a high reduction gearbox, which drives a worm gear that turns a segment of a worm wheel. The worm wheel pushes on a linkage which rotates the wafer stage about a pivot point.

Rotation control is entirely open-loop - the wafer is rotated once during the alignment process based on the fiducials observed through the alignment microscopes. A slow open-loop system is acceptable given that the speed of rotational alignment does not significantly affect wafer throughput.

The Z mechanism

The focusing mechanism is a limited-travel (according to literature, about 600um) flexure mechanism. The entire optical column is suspended on two large spring steel plates; a stiff spring counterbalances the weight of the column. A voice coil motor (identical to the fine stage VCMs) actuates a linkage mechanism which moves the column up and down.

Adjusting the mechanism is a bit subtle. The white rod sticking out is actually a tensioning mechanism for the counterbalance; it is possible to aggressively tension the spring to stiffen the assembly for transport. The cap at the end of the rod can be removed to reveal a nut and a piece of threaded rod with a flathead in it. You want to hold the rod in place with a screwdriver and crank on the nut with a wrench until the column just barely 'floats' in place.

Incidentally, this mechanism also reveals a fairly severe weakness of the focusing system - it is extremely undamped. Any disturbances on the column cause the whole assembly to ring like a bell, with the only source of damping being the resistance of the VCM. I think (though there is some information to the contrary) that 6000-series GCA steppers focused once per wafer, relying on wafer leveling to keep the image in resist in focus between fields. Otherwise if the focusing had to be highly dynamic there could be problems.

GCA 6100C Wafer Stepper Part 1: Intro and Maximus 1000 Light Source

2021-05-23T05:07:00.002-07:00

The yellow lights make it look more legitimate

I have always wanted to expose a wafer. I'd written off making my own transistors long ago (nothing that fits in a house is good for feature sizes small enough for interesting logic, and I'm not a good enough analog engineer to design interesting analog), but there are many useful optical and mechanical parts that can be made lithographically.

The usual route to home lithography is a microscope and a DLP, but the resultant ~2mm field sizes are not sufficient for mechanical parts and stitching a 20mm field out of 2mm subfields is very taxing on your motion system. Contact aligners are simple and perform well, but getting submicron resolution for interesting optical parts out of a contact aligner is challenging (the masks also get quite expensive).

The natural solution is to start with a stepper lens (which is basically a giant microscope objective with very bad color correction). There are a few variants - 1:10 lenses with a 10x10mm field, 1:5 lenses with a 14x14mm field, and 1:4 lenses, which weigh several hundred kg and have a 20x20mm field. Stepper lenses also come in several colors: g-line (436nm), i-line (365nm), and DUV (~250nm).

I wound up with a 1:5 g-line lens; the 1:5 lenses strike a good balance between performance and unwieldiness. I also had a set of stages pulled from a DNA sequencer good for a couple microns of resolution. The rough plan was to stack a fine stage on top of these and use a direct-viewing technique to perform alignment. However, the project quickly went south when I realized building an exposure tool entailed buying the parts out of an...exposure tool. Conveniently, a circa 1985 GCA DSW 6100C showed up for more or less scrap value near me, so one rigging operation later I was the proud owner of a genuine submicron stepper.

The DSW family of steppers are true classics; GCA Mann practically invented the commercial stepper in the late 70's. The GCA steppers remained more or less unchanged until the company's demise; everything from the g-line DSW 4800 to the AutoStep 200 shared a stage design, alignment system, and mechanical construction (unfortunately, they also all shared a terrible 70's-grade electronics package!). A number of GCA tools still survive in university fabs, mostly converted to manual operation. Briefly, the design consists of:

A cast-iron base with a cast-iron 'bridge' holding the optical column.
A stacked stage consisting of two coarse mechanical bearing stages driven by servomotors, two fine flexure stages constructed as a single unit driven by voice coil actuators, and a open-loop, limited-travel rotation stage driven by DC motors.
Feedback provided by an HP 5500-series interferometer that meters the displacement between two mirrors mounted to the optical column and two flats mounted to the fine stage.
A reticle alignment stage consisting of a small flexure actuator and fine-pitch screws for adjustment.
A focusing system using a photoelectric height sensor and a linkage mechanism that adjusts the entire optical column height (!) with a travel range of around 1mm.
An alignment system using two fixed microscopes to align the origin and rotation of the wafer.
A high-pressure mercury arc lamp with a homogenizer and filter (MAXIMUS) to illuminate the reticle with Kohler illumination of the appropriate wavelength.

My copy showed up in an interesting state of disrepair - the laser and alignment microscopes were missing (why anyone would want the alignment microscopes is beyond me), and the Maximus made rattling noises. The first step was to repair the light source.

Inside the Maximus 1000

Life before LEDs was bad. Arc lamps produce a concentrated point of light a few mm across, and turning that into uniform illumination across a 4" reticle is challenging. Now, normal people use a elliptical collector, a condenser lens, and a fly's-eye homogenizer to produce uniform illumination, but not GCA.

Instead, the inside of the Maximus looks like this:

The arc lamp goes in the center; the four identical assemblies each collect 1/4 of the arc lamp output.

The top left is a condenser lens assembly. The diagonal mirror is a cold mirror (it dumps IR into a heatsink not shown); the round filter below it is a narrowband filter for the design wavelength (in this case, 365nm). So far, reasonable. But, where you would usually see a homogenizer after the filter, there is instead a focusing lens. This lens focuses the lamp output into four fiber lightguides, which bundle into a single lightguide on the other end. The output of this lightguide is then imaged onto the reticle in the usual fashion by a illumination lens. This arrangement, while very complex, has a neat benefit: the characteristics of the illumination are solely determined by the light guide. The NA of the fibers sets the illumination field size, and the diameter of the output bundle determines the NA of the illumination. The illumination is perfectly uniform, since every fiber perfectly illuminates the whole field; missing fibers will only result in a slight overall loss of intensity.

As luck would have it, practically every screw in the Maximus was loose, and the bulb was snapped in half. The rebuild took a couple hours, and was greatly improved by removing the head from the stepper - dealing with loose lenses is much easier when you are not six feet off the ground (if by some chance you are reading this and also servicing a GCA stepper, removing the Maximus is easy - just pull the four socket head screws at the base of the condenser, un-route the shutter cables and lamp cables, and the unit lifts right off).

I haven't had a chance to check performance yet, as the bulb needs replacement. The Maximus uses Ushio USH-350DP bulbs. Of critical note: the USH-350DP is a two-screw-terminal designed for aligners. The Maximus uses a screw-on "bullet" on one end to convert it to a plug-in type; if you are changing bulbs, don't throw out the plug!

Additional GCA resources

Here is a collection of various official GCA manuals scraped off the internet, mostly from university sites. The information in the manuals is helpful for understanding how the system works. If you are intent on actually using the stock GCA controller, the manuals are pretty much mandatory, since the PDP-based software is not very user friendly.
Here are various pieces of documentation from third parties (once again, mostly academic fabs). Additionally, there are several good DSW guides:

UCSB has a i-line 6300
Cornell has a g-line 6300
UMD has a g-line 4800

The K39 mini-ITX case

2020-10-11T06:28:00.005-07:00

Intro: the tiny and elusive K39

The K39 is the worlds' smallest case with discrete GPU support. No one really knows where it came from; there are several K39 variants listed on Chinese shopping site Taobao. There's even one listed on Amazon, with Prime shipping to boot, but at $148 with no PSU, it's of questionable value.

K39 specifications

The K39 is an odd case; it throws away…every feature…to achieve its small size. It has no drive bays, no external ports, and of course no lighting. Instead, it relies on onboard storage and rear I/O ports (though it is worth noting there are obscure K39 variants with a single USB port and a single 3.5mm jack). However, it is incredibly small - at 3.9 to 4.5L depending on the variant, it is even smaller than the 5.0L NFC SkyReach 4 Mini. Like the S4 Mini, it is limited to short (180mm) video cards. It also supports standard flex-ATX PSUs, though there are many caveats...

The K39 power supply

While the K39 can mount generic flex-ATX PSUs, it is, for all intents and purposes, a case with a built-in PSU. The K39 PSUs are very cheap 80Plus Gold rated units, built from recycled server parts.

In order to make the PSU more compact, the stock cable harnesses are cut short and soldered into a modular breakout board. Innovatively, the PSU uses thin, high-temperature silicone wire, which is both flexible and capable of carrying high currents. Due to cable routing restrictions, the modular cables are more or less essential for operation.

While it is possible to buy a K39 with no power supply, there isn't really a reason to do so. The modular supplies cost much less than the competition, and are available up to 600W, which is much more than the case can dissipate.

The build

The actual computer inside this case is strange and sort of terrible. It uses a long-discontinued ASRock X99 board, a 120W 14-core Xeon, and an R9 Nano. The ASRock board has gained some sort of strange cult following and costs as much now as it did new (or maybe there are a ton of people who need 128GB on an ITX board?). The Xeons are very cheap, but not very fast, with any cost savings over a Ryzen 3000 CPU immediately nulled by high motherboard prices. The Nano was never very good, it performs somewhere between a 1060 6GB and a 1070, but is crippled by its 4GB of VRAM. A dubious perk is that it is almost the fastest short card supported by macOS; the AXRX 5700 ITX can only be imported from China for a huge amount of money.

However, the components serve their purpose as being a maximum challenge testbed for the build. The X99e-ITX/ac is a very hard board to build around, especially with a 47mm cooler restriction. My previous thin X99 build had an 83mm clearance which was still quite difficult to work with - it required a discontinued Cooler Master cooler, custom waterjet stainless brackets, and a machined-down 120mm fan. The Nano's TDP is also at the top end of short cards; the other contenders (the 1070, RX 5700, and 2070) have similar ratings.

Build notes

There's surprisingly little to say here. The K39 variants are all a little awkward to work with because they require a complete disassembly for component installation. On this flavor, the front panel comes off to reveal a freestanding motherboard tray. The I/O shield pops into the outer shell, then the tray with installed motherboard and riser slides in. The PSU and PSU cables go in next, followed by wiring, the GPU, and finally the front panel. This is where the super-flexible PSU cables come in handy; it would be impossible to route normal cables in the case.

The standard PSUs actually have a SATA power connector on them, but there is no room in the case for a 2.5" drive. My understanding is that folks who have 2.5" drives in the case use foam tape to affix them to one of the side panels.

Cooling

I was targeting a laptop-like acoustic profile on this build; that is, quiet when idle and loud and hot under load. I had originally wanted to use a 1U Dynatron vapor chamber active cooler. Unfortunately, the Dynatron was more or less unusable; it ran extremely hot (60C! at idle) and was amazingly loud. Even with a 50mV undervolt and a custom fan curve that ran minimum speed until 75C, the blower would randomly spool up with even one core active.

It was clear that some more "engineering" was needed. Fortunately, 47mm just barely clears a 1U passive cooler (29mm) with a 15mm thick fan stacked on top. To find a fan, I took to the trusty old technique of disassembling stock coolers; stock coolers are often laughed at, but to get sufficient cooling performance out of a small, cheap heatsink requires a serious fan. I ended up using a 70mm, 8.4W fan out of an FM2+ stock cooler:

The fan required a bit of minor machining (it had mounting feet that put it over the height limit). Some brackets were drawn up and printed…

…and the whole thing was put together with some screws and 3M VHB tape.

The small 40mm fan is critical; without it, the CPU would cook the SSD enough to severely throttle, meaning a lengthy cooldown period was necessary after heavy loads. It also cools the PCH by about 15C, which is not too bad for such an anemic fan.

The cooler bolts into place neatly, and the VHB seems to handle the high temperatures just fine.

Performance

We'll start with the bad news: the 2683 v3 is no longer fast. It does score a healthy 180 fps on KeyShot, but that's merely the performance of a 9900K, a CPU with six fewer cores. On the other hand, it does perform like a $350 CPU for $120, so if rendering is all you do it's not a bad choice.

There are a couple ways to tweak performance. An X99 + Xeon specific is to undervolt the CPU by 50mV. Furthermore, most boards allow custom-tweaked fan curves:

This is pretty necessary with a small, noisy fan; most boards idle too high by default. With a tweaked fan curve and small undervolt, the CPU idles at a slightly-warm-but-not-concerning 50C:

And a fairly nominal 66W:

Load is much more interesting. Most boards have some manner of power tuning available. On this particular board, the electrical design current (EDC) was settable, but unfortunately, the limit did not seem to correspond to actual amps. Thankfully, it was monotonic...setting an EDC of 80 resulted in a load power consumption of about 180W:

Delta-over-idle of 114W corresponded to an all-core speed of 2.3GHz and a load temperature of 78C.

Importantly, neither the SSD nor the DIMMs overheated, though the memory does get quite warm:

Removing the current limit results in 200W even of power consumption, representing a 134W delta-over-idle.

At this point, the fans get really loud, but temperatures are still in control:

The RAM is now looking uncomfortably hot - some heatspreaders might be warranted...

Conclusion

We learned today that it is possible to dissipate 135W with a 47mm cooler. We also learned the importance of ambient airflow - the 40mm fan doesn't move much air, but was absolutely critical for success. In addition, we learned that Haswell Xeons have underwhelming performance in 2020, though for the price they are pretty solid. Fortunately, a lot of this is still applicable to the upcoming Ryzen 5000 CPUs; 135W is perfect for getting stock performance out of a 105W Ryzen 5000. True masochists might also consider the EPC621D4I; with careful tuning, a 28-core Xeon Platinum may be possible.

Cambridge Technology 6230H galvo short teardown

2020-04-25T23:38:00.002-07:00

Cambridge Technology's galvos are popular as the de facto high performance galvonometer scanner. The highest performing models are moving magnet scanners; these are conceptually similar to a single phase brushless motor. However, in galvo duty the rotor never completes a revolution, instead scanning back and forth with a maximum range of around 40 mechanical degrees (+/- 20 degrees). This range is small enough that rotor position is not part of the torque generation loop (and in fact, with a single set of coils, it is impossible to control the stator current phase); instead, galvos operate as current-amplitude-to-torque converters.

The galvo in this teardown is a 6230H, a mid-sized model still in production. The rotor (second from the left, bottom row) is a radially magnetized, single-piece sintered neodymium magnet with a very long aspect ratio. This aspect ratio maximizes the torque-to-inertia of the rotor - torque scales as LR*R = LR^2, whereas rotor inertia scales as MR^2 = LR^2*R^2 = LR^4, so torque-to-inertia falls off as R^2. I'm not sure why further optimizations weren't made to the shaft, for example, a hollow shaft and/or a shaft made of an exotic alloy would have reduced inertia further, and the CT galvos are not particularly cost-sensitive products.

The stator (top left) is epoxied into the galvo housing with (hopefully thermally conductive) epoxy. To avoid saturation, the stator is a complex air-cored winding, similar to what is found in high performance servomotors. Not having stator iron has the added benefit of greatly reducing stator inductance, which could limit the electrical step response of the system. A coreless stator means the short-term current of the system is only limited by the fusing of the stator windings, and, in practice, by demagnetization of the rotor PM's due to off-axis current at the ends of travel (since the stator field does not rotate to stay in sync with the rotor field).

The real voodoo in the Cambridge galvos is the position sensor, consisting of the quadrature photodiode assembly in the bottom left. This is used in conjunction with the IR illuminator (bottom row, third from the left) to measure the rotor position with impressive accuracy. 8uRad short term repeatability equates 16b+ of angle data over 40 degrees of travel, and a linearity of 99.9% open loop is very nearly 10 bits with no additional calibration.

Overall, no surprises - this is a state of the art galvo and the design and construction show it. The motor part is nothing fancy (I'm sure you could copy it with a little help from China), but the position sensor would require quite the R&D to duplicate, especially since the little photodiode "slices" look like custom parts.

Extended Schmidt-Cassegrain

2019-12-07T16:20:00.004-08:00

Schmidt-Cassegrain Telescopes (SCT's) are incredibly cheap on the secondhand market - 8" OTA's are around $350, 10" (Meade)/11" (Celestron) OTA's under $1000, and even the mighty C14 can be had for under $2000 on good day. Compared to, say, an Ritchey-Chretien telescope (a Cassegrain with a hyperbolic primary) this is an incredible deal - a generic 10" GSO RC sells for about $2500 new and is rarely available on the used market. These low prices are largely due to decades of experience in mass production combined with huge volumes for the visual market.

Unfortunately, if you're interested in photographic rather than visual use, SCT's have traditionally been a questionable choice. SCT's are only corrected on-axis; off axis, they suffer from both coma and field curvature - very severe field curvature at that, thanks to the folded design. Now, RC's (and their relative, the Meade ACF telescopes) are really not that better - they are coma-free, but they replace coma with severe astigmatism and still suffer from field curvature.

Despite the generally poor reputation, the SCT design has a key merit - it uses spherical mirrors. A spherical mirror with a stop placed at the center of curvature has just two aberrations: spherical and coma. This is because the mirror "looks the same" from the point of view of every field angle, so all of the asymmetric aberrations are inherently absent. The Schmidt corrector on an SCT corrects for spherical aberration, which means the only remaining aberration is field curvature, right?

Obviously not, because SCT's are full of coma. This is because the corrector (stop) on a commercial SCT is intentionally moved closer than one radius to the primary in order to shorten the overall length of the system and improve handling. This results in performance like so:

Performance is poor - at the corners of a full-frame field (21mm, about 1 degree), RMS spot sizes reach 108 microns, and this system has virtually no contrast past 10LP/mm (Nyquist for 25um pixels) in the corners.

However, let's move the corrector out to ~660mm from the primary, and add a single concave lens. Performance now looks like this:

Performance is pretty decent, with 21 micron RMS spots in the corners at best average focus, greatly reduced coma, and good contrast even at 50 lp/mm.

The weakly diverging lens (130mm radius of curvature, e.g. Newport KPC067) in front of the image plane serves to cancel out the inherent field curvature of the system and substantially improves performance. It also extends the focal length slightly (from 2500mm to 3000mm), which is not great unless your camera has a KAF-1001E or similar large-pixel sensor. It has the additional disadvantage of adding some chromatic aberration, as it is a singlet with nonzero net optical power - for one-shot color imaging it might be best to just leave it out and accept the field curvature. For narrowband (or even LRGB) imaging, just refocus for each wavelength and it should work fine.

The verdict: yes, it does work. The only other option for correcting an SCT is the Starizona SCT corrector, which is a great product; however, at $599 for the full-frame version it only makes sense on one of the larger instruments. This scheme is obviously more labor intensive - actually implementing it is closer to "build a new telescope using the glass from a commercial SCT" than "modifying a commercially available telescope". Your best bet if you actually build this would be a secondhand C8 ($300), a Chinese focuser ($100) and an a7S ($600 used), a combination which gives a well-corrected ~1 degree field of view.

Freefly Systems ARC200 teardown

2019-03-06T01:29:00.000-08:00

Most motor controllers are bad. They range from not really doing motor control (any hobby controller, eBike controllers) to being electrically questionable (small VESCs) or having flaky (SimonK and BLHeli, which in addition, don't do real motor control) or confusing (Sevcon) firmware, to being mechanically questionable (most 'servo drives').

On the surface, the ARC200 doesn't really distinguish itself; nominally, a '200A 48V inverter with FOC' isn't much different from any of the large VESC variants out there. However, none of the big VESCs are great (questionable layout, too much electrolytic capacitance, terrible connector choices, too expensive), and the ARC200 sits at a comfortable price point above the bad Chinese controllers but well below the VESCs and industrial servo drives. In addition, I am good friends with several engineers at Freefly, and had high hopes that their involvement in the product would make it not bad.

What Makes a Motor Controller?

When most people evaluate a motor controller, they immediately jump to the power stage. However, the power stage is but a small part of the system, and with power MOSFETs getting cheaper and better, device selection and layout are becoming less and less critical.

All DC-operated inverters share the following equally important building blocks:

Microcontroller and firmware: You can't really screw up the implementation of a microprocessor, but boy is it possible to screw up an implementation of FOC. The core algorithms are probably right (you quickly notice swapped variables or extra factors of -1), but managing the rest of the state machine is much harder. Startup conditions, integral windup, throttle bounds, and interrupt priority are a few of the many ways to go wrong. The firmware is probably the hardest to test, as some edge operating conditions are difficult to trigger on the bench.

On the other hand, writing good motor control firmware is no different from writing any other kind of software, but most hardware engineers (and many software engineers!) don't have formal training in writing robust code.

Low voltage power supply (LVPS): The LVPS is a DC-DC converter that takes the DC link voltage and generates the 12-15V and 3.3-5V rails to power the gate drive and logic, respectively. The LVPS is a somewhat tricky part to design; typically, it is built using an off-the-shelf SMPS controller IC. Commercial SMPS controllers are "black boxes" that usually expect relatively clean DC input and a slow load. Inverter applications, in contrast, are very noisy, since the DC link is full of transients from the power stage. Cheaper or very low-voltage controllers will usually run the logic off a LDO, possibly with a resistor in series. This suffers from a similar problem; in fact, an unfiltered LDO is guaranteed to pass input transients to the output, as no linear regulator is capable of handling sub-microsecond input changes.

LVPS failures will usually take out the entire inverter, since the failing logic and gate drive rails will put the entire control stage in an invalid state for several milliseconds, leading to desaturation or shoot-through of the power stage. In addition, failure of the LVPS to regulate (due to an input transient) will often damage the control stage by passing the transient to the output.

Isolation and gate drive: The gate drivers turn the power MOSFETs on and off. There are various levels of sophistication; the dumbest gate drivers are just a pair of complementary BJTs, the smartest ones integrate capacitive isolation, desat detection, and fault signaling. Closely related is how the gate drives are powered. Most low-voltage controllers power all six off a single 12 or 15V rail, and use bootstrap capacitors and diodes to generate the high-side voltages. This adds the obvious benefit of simplicity, but makes layout a little tricky (the 12V rail has to fan out to all six gate drives) and has the rather large disadvantage of connecting the logic and power grounds. Circulating ground currents can then potentially upset the microcontroller and LVPS.

In contrast, high voltage inverters always fully isolate the gate drives, for safety reasons. This has the neat benefit of completely separating the control and power grounds (indeed, most 300V+ controllers require a separate low voltage power supply as input), but adds a ton of complexity.

I/O: Control inputs are dangerous, because they often run over a long wire. Small controllers rarely isolate their inputs, which means analog and serial control cables contain a wire directly connected to the ground of the microcontroller. Needless to say, attaching a large antenna to logic ground and having it pick up every switching transient in the system often leads to poor performance.

Industrial servo drives almost universally have optically isolated inputs. In this case, the control signal drives the LED in an optoisolator, removing the need to bring logic ground out of the controller. Annoyingly, most hobby controllers marketed as 'opto' don't actually have optoisolated inputs; 'opto' in this case only means that the controller does not provide a 5V accessory power supply.

Power stage: And now, we finally get to the power stage. The choice of power device is practically a non-issue in this day and age, but layout still requires some consideration, especially in small controllers (where cost and compactness considerations sometimes lead to electrical compromises). In particular, very high-current, low-voltage controllers have trouble finding space for enough copper to carry the full phase current, and sometimes run into package current limitations as well.

Surprisingly enough, high-voltage inverters are much easier to lay out. The smaller ones (400V, up to ~50A) are covered by fully integrated 'smart power modules', and the larger ones (up to ~300A) are covered by sixpack IGBT modules (which contain 3 half bridges sharing a DC link, but no gate drive).

Selecting capacitors is also somewhat of a black art. Electrolytic capacitors are mostly resistors at high frequencies, and a poor implementation of an electrolytic DC link capacitor can be worse than nothing (the energy stored helps blow up the devices in case of a failure). On the other hand, insufficient DC link capacitance leads to high ripple current in the DC link cables (which could be long, and therefore potential sources of EMI) and high voltage transients (switching spikes aside, the average DC link voltage + half the peak to peak ripple needs to remain under the voltage rating of the devices). For large high-voltage inverters, the cost and weight of the capacitors is often equal to that of the switching devices.

Electromechanical components: Connector selection is a matter of taste and application. The automotive and aerospace industries have stringent isolation, water-resistance, and even coloration requirements. An inverter for an OEM robotics application might value weight and compactness over waterproofing, and an industrial controller would use common connectors found in automation. There are clear wrong choices (pin headers come to mind), but never a single "best" connector. The same goes for housings and thermal management.

The Actual Teardown

Phew, that was a lot of intro. This teardown has an associated Flickr album containing high-res images, because nothing sucks more than not being able to see the part number on an IC. For convenience, some of the images will be reproduced here, but are limited to Blogger's 1600x1200 resolution.

Overview, Microcontroller, and LVPS

The ARC200 is constructed as a two-board stack; the top board contains the LVPS, microcontroller, and additional logic, and the bottom board carries the power devices, gate drive, and capacitors.

The microprocessor is an STM32F746NGH6, which is a serious part that costs almost $12 in quantity. For better or for worse, this is the largest microcontroller I have seen in a motor control application (it beats out the F446RE in my own designs). In addition, the microcontroller is connected to a 512MB SPI Flash, so there is plenty of room for expansion and future features if so desired. The main logic rail is generated by a LM5116, which is a 100V-capable buck controller IC.

Visible towards the right (near the I/O connectors) are a MCP2562 CAN transceiver, a LMV612 dual op-amp, and a TLP2361 dual optocoupler. The CAN transceiver's purpose is obvious (though it is worth noting CAN is not included in the available external interfaces), the LMV612 probably serves to buffer analog throttle and the TLP2361 probably buffers various forms of digital throttle. Also visible are some small DIP switches, which probably shouldn't be flipped.

Additional Logic

The backside of the logic board reveals several additional components. A number of LMV612's likely provide additional analog functionality. To the left, a TPS542941 buck regulator generates the 5 and 3.3 V rails, augmented by a healthy number of ceramic capacitors. To the right, the capacitors and MOSFETs (low side, high side) for the LVPS are visible - the LM5116 does not integrate switches. A diode likely provides reverse polarity protection. On the upper right, there is an ST M24128 - a somewhat strange choice given the amount of FLASH available on the F7, but perhaps it saves flash wear?

On the bottom left, a tiny Rigado BMD-350 provides BLE connectivity. This is a trick to avoid having to FCC certify the 2.4GHz part of the controller, as well as making layout easier (implementing RF SoC's can be tricky).

Also visible all around the board are the short board stacking headers that connect the logic deck to the bottom board.

Gate Drives

The gate drive uses a standard bootstrapped design with some twists. Because level-shifted gate drive IC's top out at 4A (and even those are somewhat fragile) and fully isolated drives are very expensive, the gate drive stage uses FAN3122 9A discrete drivers, bootstrap diodes, and Silabs SI8620BB digital isolators. In addition to providing high-side control, the isolators serve to somewhat protect the microcontroller from transients in the power stage. A column of tiny linear regulators powers the isolators.

A number of diodes provide various functionalities including bootstrapping, turn-on/turn-off time separation, and gate protection.

Also visible here are the DC link capacitors, six 180uF 63V parts. The lack of ceramic capacitors is disappointing, but realistically, at the RMS currents the ARC200 targets, 100V/200uF of ceramics would have been required, an expensive proposition at any time and especially now during the ceramic capacitor shortage.

Power Stage

The power stage uses four TPH2R608NH devices per switch. These are very economically priced parts, only 67 cents in full reel quantity for a 75V 2 mohm FET with reasonable gate charge. In fact, they are cheap enough that to the right, three more of them are used to provide anti-spark functionality. Current sensing is done with three (!?) huge 250uohm shunts. Appearances are deceiving; the SOIC device next to each shunt is not a shunt amplifier, but rather another SI8620BB. Instead, current sense amplification is likely implemented using several of the many LMV612's on the logic deck.

Mechanical Design

The thermal path to the case is provided by several thick thermal pads. Heatsinking of the power stage is done entirely through the top of the packages; this is often worse than through vias in the board, but allows for more design flexibility (and the Toshiba FETs in particular have pretty thin top epoxy). The springs in the top image are used to keep the logic board pressed down against the power board, which could have implications for reliability in high-g applications. Silicone sealing is visible along the entire seam of the enclosure (the nature of this sealing means once an ARC200 is disassembled, it will be difficult to waterproof it again).

Additional Commentary

This is a pretty good controller. The logic stage is intense, featuring three buck converters, SPI flash memory, and possibly the largest microcontroller I have ever encountered in a motor control application. Freefly's implementation of sensorless FOC is class-leading, and the computing power of the F7 leaves room for potential new algorithms (such as HFI for salient motors). I generally don't believe in sensorless control (magnetic encoders are cheap and easy), but that kind of feature is perhaps relevant at this level. The LVPS is nothing particularly exciting, but the LM5116 is a good chip and the switching devices used in the buck converter are beefy to the point of being excessive.

The gate drive stage seems solid but perhaps a bit unusual, as it utilizes bootstrapped discrete drivers in combination with external RF-based isolators. I would like to do a further analysis for the circuit at some point, as some aspects are not immediately clear (for example, there appear to be six dual-channel isolators for six drivers, and VDD2 on the isolators appears to come from the bootstrapped supply).

The power stage is very solid - the switches use four 2 mohm devices each for a total resistance of 500 uohms per FET. This is much less than the resistance of the motor and at this level, switching losses are a huge part of the total losses, meaning adding more FETs doesn't necessarily improve performance by much.

For the price, I would have liked to see more isolation - it seems possible to build a 5-channel isolated supply for around $10 in 1ku. A fully isolated design is not inherently superior at 48V, but it greatly eases integration if there are multiple inverters in the system. That being said, testing the isolated supply poses its own challenges; at that point, the LVPS approaches the rest of the inverter in complexity. I would also have liked to see an all-ceramic DC link capacitor, but I don't know how that would have affected the retail cost of the controller - a ceramic capacitor suitable for 140Arms operation might very well be an incredibly expensive object.

'unitepower.com' 48V 1800W brushless motor teardown

2019-01-24T23:12:00.001-08:00

A friend recently acquired a '1800W 48V' brushless scooter motor and I decided to have a look inside, ostensibly for the purpose of doing thermal testing.

The rotor measures 63mm (diameter) by 73mm (stack height). This gives an air-gap-area*radius metric of 289 cm^3, which is not too shabby; for comparison, the Sonata HSG, which is a 60Nm motor, is 415 cm^3. The laminations are about .56mm thick, which is not great for high speed performance and is probably a huge reason why these motors are not very efficient.

The stator is surprisingly well-made. The fill factor is OK, and the concentrated windings mean there aren't a ton of end-turn copper losses. The stator also has a pretty high iron-to-copper ratio, which is good for peak torque and not good for efficiency. The large volume of iron in the stator is probably another contributor to the high losses - at peak efficiency, copper losses are less than half of the total losses.

The motor has hall sensors, installed using the standard in-the-slots technique:

The housing is mediocre. The end caps are cast aluminum, and the stator is pressed into a piece of steel tube, which adds quite a bit of weight. The motor is also much longer than it needs to be - out of the 177mm of total length, only 72mm contribute to torque production.

Motor specifications:

Type: Surface PM machine
Pole Pairs: 3
Resistance (line-to-line): 73 mOhms
Inductance (line-to-line): .415 mH
Flux linkage [derived]: 0.036 Vs

Back EMF:

Full of harmonics, but reasonably sinusoidal.

Thermal testing:

Thermal testing was done by passing DC current through a pair of phases while watching the temperature of the end turns with a Flir A65 thermal camera. We initially set a temperature cutoff of 110C, but backed down to 95C after noticing some degradation in the either the enamel or epoxy in the stator at around 100C (this is pretty terrible; good wire can operate at 200C!).

Performance with no additional cooling proved to be rather poor; at 28A (which is the RMS current at 40 peak phase amps), the stator hit 95C and started overheating.

Performance with active cooling (a Sunon PMB1297PYBX-AY 12V blower) proved to be much better; we were able to achieve 33Arms (46 peak amps) with a stator temperature that stabilized at around 90C.

Note that this is a best-case operating scenario; at stall, there are no iron losses. Further testing at speed is planned for a later date.

Conclusion:

This is "a lot of motor" - it can produce huge amounts of peak torque. Unfortunately, terrible efficiency, non-existent high-speed performance, and a dubiously low temperature cutoff all serve to severely limit its applications. Even for its advertised application (small electric scooters) it is a poor choice, as 70% peak efficiency means around 20% of the battery pack is wasted (versus a 90% efficient machine).
These motors are very close to being good - better wire and thinner laminations, both of which wouldn't drastically increase costs, would go a long way to making them more useful. Maybe in the future, we will see an updated version with these improvements, but for now, I would steer away from these motors.

Feiyu Tech A1000 Gimbal Teardown

2019-01-09T04:45:00.000-08:00

The Feiyu Tech A1000 is a midsize handheld gimbal for compact cameras and small mirrorless cameras. I recently acquired one and took a look inside, with the ultimate goal of operating the gimbal without the handle, which contains the batteries and some electronics.

The gimbal consists of a "main unit", which is attached to a handle by the means of a threaded collar. The handle contains the controls for the gimbal, as well as the batteries; it is not possible by default to turn the gimbal on without the handle.

The first task was to disassemble the handle. My hunch was that the handle suppled 7.4V (2S Li-Ion) to the inverters in the main unit, and sent pan and tilt commands via serial to a microcontroller in the main unit that ran the stabilization loop and talked to the inverters and IMU via I2C.

The Feiyu gimbals are remarkably easy to take apart - everything was held together with screws with not a plastic snap in sight. Removing the four Phillips screws from the top of the handle released the connector board:

The contacts for the spring-pins on the main unit are just pads on a matte black (!) PCB; the top board is just connectors with no active components.

Removing the four socket cap screws on the side of the handle reveals the bulk of the circuitry:

The module is an NRF51822 carrier module...with some sort of bonus wire on it to act as an antenna. Completely not OK - the reason manufacturers use carriers is to avoid having to undergo additional FCC certification, and adding the extra antenna defeats this. The chip below it (next to the USB port) is a Silabs USB to UART bridge. This is a notable difference from the smaller Feiyu gimbals, which put the UART bridge inside the USB adapter and run serial over the physical USB connector.

The backside isn't too exciting - a buck converter provides power for the electronics on the board (and possibly logic power for the inverters as well). The connectors are all neatly labeled, a nice touch.

Moving on to the inverters (we look inside one motor, but the other three are nearly identical):

The microprocessor is an STM32F303, an popular choice for gimbals. Two shunts are present - no cost-cutting one-shunt techniques were used here.

The power stage is an MPS6536 integrated brushless driver IC. The position sensor is not on the board; presumably, it is on the other side of the motor.

The connector board on the main unit reveals something surprising: unpopulated pads for a NRF51822 module are present.

Presumably at some point during development, a handle-less version was in fact planned, but was aborted before it reached full production.

Some further analysis:

The handle does have an microcontroller in it - it is possible but unlikely that the NRF51822 (which contains a Cortex-M0) is used for the stabilization loop. However, the only data lines running up into the main unit carry 115200 baud serial on them; standard async serial is not easily daisy-chained, and very few IMU's speak UART. Most likely, one of the inverter microcontrollers also does stabilization (this is the case on the Feiyu Tech wearable gimbals, which have no handle).

Tiny Camera on a Big Lens

2018-12-13T02:07:00.002-08:00

With the release of the Nikon Z6 and Z7, Nikon shooters at long last have a way to add stabilization to unstabilized lenses. This presents some nifty opportunities - a decade's worth of fast AF-S primes are now all stabilized, and some very desirable zooms such as the 14-24 and the Sigma 24-35 ART also gain stabilization.

Much more interestingly, the original AF-S supertelephotos all gain stabilization. The VR versions command a $2000 premium over their unstabilized counterparts, so clearly there are substantial (one Z6 per lens!) savings to be had here. The situation is not as magical as it first seems though - small angular motions transform into huge shifts at the sensor, so in-body stabilization is not as effective for long lenses as lens-based stabilization.

I don't have a Z-series camera, but I do own an A7ii (which has a very similar sensor resolution and stabilization system) and an AF-S 500mm f4, and had been contemplating a Z6, so I was interested in testing the effectiveness of IBIS when used with really long lenses.

Testing stabilization is a little tricky, because there is inherently a human factor involved (some people are really good at keeping cameras stable, some less so). For these tests, I settled on a compromise which I felt would be representative of my shooting situations:

Lens and camera mounted on a gimbal head on a tripod - I think a setup like this will always have some kind of support underneath it, be it tripod or monopod; other than maybe the 500FL no one is going to be handholding a big supertelephoto prime for very long.
Gimbal head locks loosened - if I'm shooting with a long lens, I'm probably also following something that moves. Realistically, the scenario in this test would only show up for slow-moving wildlife or portraiture; any real "action" will require 1/500 or faster anyway to stop subject motion.
Camera triggered by pressing shutter button - in the same line of reasoning as above, I wanted to be able to keep my hand on the grip at all times.

Results

Blogger is not really set up for hosting huge images, so the test results are externally hosted here. 100.png, 200.png, 400.png, and 800.png are, respectively, 1/100, 1/200, 1/400, and 1/800 shutter speeds without stabilization; is100.png, is200.png, is400.png, and is800.png are the same speeds with stabilization.

Analysis

IBIS is effective, even for very long focal lengths. At 1/100 for example, the worst frame (out of 4) with IS off looked like this:

Completely unusable, by most standards. In contrast, the worst frame with IS on looked like this:

Still a bit soft, but this would be usable for smaller output sizes, especially with some careful postprocessing.

Stabilization also helps, but much less visibly, at 1/200:

Off:

On:

However, stabilization is not magic. While the 1/100 shots with IS on are usable, they are still not quite as sharp as a 1/800 image:

The 1/200 shots get pretty close, but are still a bit blurrier (the difference would likely not be perceptible with a softer lens).

Conclusion:

What did we learn? Well, it seems for at least one shooting scenario (lens supported but not completely locked down), IBIS does make a difference, allowing for at least 2 stops of stabilization. Anecdotally at least, this puts it on par with lens-based stabilization. It's a little hard to tell - lens-based stabilization is supposed to be good for 3+ stops, but there's precious little subject matter which needs a big telephoto prime and moves slowly enough to be shot at 1/30.

We also learned that fast shutter speeds are necessary to extract maximum performance from a telephoto prime. While sensor-based stabilization allows for usable shots at slow shutter speeds, reliably achieving the maximum optical performance of the lens still requires 1/(focal length) or shorter exposures.

The other question is how much more stable the viewfinder image is with IBIS. There are some scenarios where it is possible to shoot handheld, at least for a little while, and having IS is quite useful for framing purposes. Unfortunately, this is much harder to test, and I expect the answer to be quite negative, given how much the viewfinder image moves.

Field Weakening, Part 2

2018-07-29T03:00:00.001-07:00

Recall in a previous post we had found an analytic solution to the field weakening problem. Unfortunately, the model is useless in practice; high currents (which are needed to cancel large amounts of PM flux) result in much lower inductances (which serves to decrease the amount of flux being canceled), resulting in numbers which are implausible and wrong.

However, while back EMF depends on the inductances, flux linkage, currents, and speed, torque is independent of speed - the same $(I_d, I_q)$ will always produce the same torque, no matter what speed the motor is at. Furthermore, we already know the relationship between torque and the axis currents from stall testing, and we can used this data as a black box to look up torque outputs from $I_d$ and $I_q$ inputs.

We are going to make an additional huge assumption: at high speeds, the current is low. This is not necessarily true, but for motors designed to be aggressively field weakened, the achievable current is likely low due to the high inductances. This assumptions means we can use the voltage equations to compute the back EMF for most of the field weakened operating regime. Of course, there will be a transition around base speed where this assumption doesn't hold, but we can "fix that in post".

Armed with this, we can write a simple C++ program (source, executable, sample input) to search the entire space of $I_d$ and $I_q$ values. The program is not particularly good or fast, but the brute-force approach makes it very robust and trivially extensible to a saturated motor (just override the Vs2() function in the MotorModel class with a lookup table based one). In contrast, Newton's-method based approaches seem to fail if the voltage surface is too complex.

The program generates some very reasonable output; for example, the following plot of power and torque versus speed for the HSG at 160V:

The flat part of the torque-speed curve extends up to what would traditionally be called "base speed" [1]. A surface PM machine spends most of its time operating in this regime, as operating over base speed results in reduced power output and efficiency. In contrast, an IPM is a constant-power device past base speed; this has several implications for system design:

Hybrid vehicles: Field weakening is very important for hybrid vehicles. Consumer hybrids have electric subsystems optimized for city driving. In order to optimize efficiency in this scenario, it is beneficial to have a high reduction between the motor and the wheels, to reduce the motor current required to accelerate the car. This typically means putting base speed somewhere around 40 mph, which means at highway speeds, the motor is operating well beyond base speed. Being able to produce power at these speeds is important for consistent performance.

There is also a class of emerging high-performance hybrids. Typically, these use a combination of one or two motors, a medium sized (around 5KWh) battery pack, and a very high power forced-induction internal combustion engine. The electric subsystem is used to compensate for the narrow power band of the ICE by adding additional low-speed torque. It also usually provides power to all four wheels, improving handling and launch performance. Finally, it improves the regulatory status of such cars by at least nominally increasing the fuel economy. Once again, we find it beneficial to place base speed at a relatively low speed in order to maximize the launch torque delivered to the wheels (and reduce the weight of motor required to deliver that torque to the wheels); consequently, field weakening is needed to prevent the top speed of the car from being voltage-limited.

Pure electric vehicles: It is widely known that most EV's have a single-speed gearbox. This is entirely due to the power-speed profile of an IPM [2]; as the motor can reach peak power at very low speeds, a variable-speed transmission is not necessary to maximize power output across the entire operating range of the vehicle.

In fact, we can simulate the broad power band of an IPM with a surface PM machine and a continuously-variable transmission. It is usually not desirable to do so [3]; multi-speed transmissions incur additional complexity, weight, cost, and losses, usually negating the improved torque density of the surface PM motor. The only cars that use surface PM motors (Honda, Hyundai) are hybrids which are strongly derived from existing gas-only cars and already have manual transmissions.

Combat robots [4]: Spinner weapons are very similar to cars - both are inertial loads that have highly variable speed profiles. Interior PM machines have obvious mechanical benefits, as the rotors are much more robust. In addition, having a virtually unlimited top speed makes match-ups more consistent. Having moderate weapon speeds is usually beneficial, as it improves energy transfer and tooth engagement. However, in the vertical-on-vertical matchup (which is becoming much more common), the robot with the higher blade speed hits first. In this case, being able achieve very high speeds can greatly improve chances of victory.

And of course, higher-speed weapons hit harder if they do engage, so having the option to spin up to very high energies can be beneficial in certain situations.

Notes

[1] Technically, base speed also depends on stator current, so the correct terminology would be 'the base speed of the motor is 2000 rpm at 180A'.

[2] Induction machines (Tesla) and synchronous reluctance motors (no one yet) have similar characteristics, and trade off torque density for reduced cost.

[3] There are some designs which use a 2-speed transmission to further improve efficiency below base speed.

[4] No one has done this yet, but someone should!

IPM's: an overview

2018-07-28T23:46:00.000-07:00

The brushless motors we typically see on the mass market are "surface PM" machines. In this configuration, the permanent magnets (PM's) are glued to the surface of a steel rotor. Torque is generated by rotating the magnetic field in the stator electronically, which in effect continuously "pulls" the PM's on the rotor towards the coils on the stator.

In contrast, all automotive PM motors are "interior PM" machines. This means the magnets are buried inside a steel rotor. While this seems counter-intuitive at first (doing this moves the magnets further from the stator and makes the rotor heavier), putting the magnets inside a chunk of steel gives the motor several features which are highly beneficial for traction applications.

Greatly increased inductance: The surface PM motor has low inductance. This is because the PM's have a much lower permeability than steel, effectively putting a huge air gap in the flux path. In contrast, the interior PM machine places the rotor steel very close to the stator teeth; the magnetic air gap is only the size of the physical air gap, and this greatly increases the inductance, often by a factor of 10 over a similarly-sized surface PM machine.

Having high inductance is important, because for traction applications, the switching frequency is primarily determined by the allowable current ripple (excessive current ripple increases the resistive losses in the copper and conduction and switching losses in the inverter). Being able to reduce the switching frequency can drastically reduce inverter losses. Conversely, for some types of very low inductance and resistance motors (Emrax, Yasa), system efficiency is much lower than what the motor specifications alone would indicate, as Si IGBT inverters have a hard time efficiently driving these types of motor.

Position varying inductance: Why does this matter? Recall that inductance stores energy, and torque is the angle derivative of the co-energy of a system (or, roughly speaking, the system will try to settle to its lowest energy state). This means that by properly manipulating the stator currents, we can use this varying inductance to generate torque: the so-called reluctance torque. Reluctance torque is beneficial because it behaves very differently from the torque generated by the attraction of the magnets to the stator (the PM torque); it grows with both d and q-axis current, and doesn't necessarily generate additional back EMF.

We typically assume that the inductances vary sinusoidally; the typical model therefore has two inductances, $L_d < L_q$, the "d-axis" and "q-axis" inductances.

Field weakening: Field weakening uses the stator inductance to generate a voltage that counters the back EMF produced by the permanent magnets. This is typically done by injecting current on the d-axis (on a surface PM motor, $I_d$ is normally close to zero). Field weakening is typically presented as an atypical operating regime, a way to get a little extra speed out of your motor after you've run out of volts. This is because surface PM motors have very low inductance and relatively high flux linkage, necessitating a large amount of d-axis current to cancel out the PM flux. Furthermore, $I_d$ only serves to generate heat on surface PM motors, and produces no additional torque.

In contrast, IPM's have a much higher ratio of inductance to flux linkage, which means the d-axis current needed to cancel the PM flux is much lower. Furthermore, because of reluctance torque, the d-axis current generates some torque, so it is not entirely wasted. In fact, well-designed IPM's have virtually no top speed; the top speed is not limited by available voltage, but rather by rotor mechanical integrity and hysteresis losses.

Higher speed operation: The rotor iron has an obvious benefit: it mechanically constrains the PM's and prevents them from flying off at high speeds. Running the motor faster makes the motor better. Being able to run a motor twice as fast means it can make twice the power, so despite their slightly lower torque density, IPM's can have higher power density than their surface PM counterparts.

High Speed VCR 2018

2018-07-28T17:26:00.000-07:00

I'm a huge enthusiast of thin desktops. I have no idea why - normally such systems are used for HTPC duties or in very space constrained labs and offices, but my desk is not particularly small and I don't even own a TV. The low-profile cases are about as small as cases get (they have a smaller interior volume and footprint than the cubes), and fitting everything into <85mm z-height makes for an interesting challenge.

Core Component Thoughts

Most HTPC-type systems are built around the "small" platform - currently, Z370 on the Intel side, X470 on the AMD side. These platforms offer low latencies, high clock speeds, and tons of integrated connectivity, but don't offer many cores compared to the state of the art. In contrast, the "high-end" desktop platforms are derived from server hardware - the boards have loads of PCIe lanes but very little integrated functionality, and the CPU's have many cores lashed together in weird and wonderful ways (rings, grids, clusters, and in the AMD case, multiple dies).

There are currently two possible routes for a USFF high core count system - the current-generation X299e-ITX/ac, or the now-discontinued X99e-ITX/ac. The X299 offers access to the latest platform features and CPU architecture, but as LGA2066 is not shared with any Xeons, the CPU's are quite expensive - the 10c part costs $899 and prices only go up from there. X99, in comparison, is kind of long in the tooth by now, but the CPU's are more accessible; an 18c 2.3/3.6 part used to be about $500 on the used market, and will likely be again once major datacenter upgrades flood eBay with used CPU's. With current pricing, X299 is certainly the correct choice; the 2699 V3 will perform similarly to a 14-core i9, costs about the same right now, and the i9 offers a full generation of platform and core improvements.

There is also no reason to go with anything under 12 cores. Ryzen will get you to 8 cores on a very power efficient platform (trust me, you are not overclocking anything on a computer this dense), and the 10-core i9 costs much more than any of the 8-core processors since Intel charges a "PCIe tax".

Since I had a 2699 V3 available from the $500 days, I went with a X99 build (I had also hoarded an X99e-ITX from when they were $120 on eBay; prices have since jumped up to $200-300). The final selection was:

Motherboard/CPU: ASRock X99e-ITX + E5-2699 V3 - really no other choices here.
RAM: Crucial Ballistix Sport LT DDR4-2400: I really like the Ballistix Sport LT series; the gray heatspreaders are inoffensive and functional, and the DIMMS are pretty low profile - there are no useless protrusions on the heatspreaders to run into the CPU cooler.
Storage: Inland Professional 256GB NVMe - these are just reference Phison PS5008-E8 + Toshiba BiCS drives. They are incredibly cheap and offer better-than-SATA performance. Being M.2 also means one less cable to route in a case that is incredibly cramped with wires. My usual choice would be a Samsung 970 PRO, but at 3.6 GHz you can't really feel the difference between a fast drive and a slow one, especially when you take into account the Windows scheduler adding extra latency by moving threads between the many cores.
Graphics: ...I should really get a real GPU for this thing, but based on previous experiences, anything but the really big cards (Asus STRIX line, I'm looking at you) will fit.

Everything Else

Building these things is really an arts-and-crafts project, especially when you have as many computers as I do. As such, picking the not-computer parts of the computer is much harder than selecting the parts that do the computing.

Case

My usual case for this type of nonsense is the Silverstone ML08, which is nicely priced and is as thin as possible (the minimum allowable clearance for an ATX case is 58mm). Unfortunately, the extra tight cooler clearance makes fastening a cooler to the board nearly impossible, since 2011/2066 heatsink mounting screws have to go in from the top. I was also interesting in trying the latest crop of Silverstone cases, which add an extra inch or so of clearance in order to fit an ATX power supply. All the 83mm-clearance Silverstones are based on the same chassis, just with different trims. I went with the RAVEN RVZ03, since I am a fan of RGB lighting.

Power Supply

The RVZ03 somewhat misleadingly supports ATX power supplies. While it is true that the mounting holes are for an ATX supply, most supplies flat-out don't fit; the case really requires a 140mm or shallower power supply to leave cable clearance. Furthermore, like the ML08, the RVZ03 uses an internal right angle IEC extender to place the power jack on the case somewhere reasonable. This caused a ton of problems - the CX550M I bought had a power jack to close to the left side of the power supply, which cased the extender to collide with the side of the case, and Seasonic Focus+ 550W had a power switch which collided with the molding on the right-angle connector, causing the switch to get stuck in the "off" position.

I eventually gave up and bought Silverstone's own 500W SFX-L supply. The power supply fit great, but as the X99e-ITX has its power connectors rotated 90 degrees from most ITX boards (the 24-pin is in the upper left corner), the stock 24-pin cable wasn't long enough. Thankfully, Silverstone makes a long cable set for this exact purpose; the kit is amazing for small builds since the 24-pin cable is only 550mm long, which is ~100mm shorter than usual.

Cooler

This whole project was made possible by an obscure-and-discontinued Cooler Master GeminII S heatsink. Low-profile LGA20xx coolers are hard to find - the reference socket backplate uses studs that are tightened from the top, meaning the cooler has to leave sufficient clearance to allow the studs to be tightened. My original plan was to use a Hydro H55 with a slim fan; measurements showed that the clearance would be sufficient. Unfortunately, packing the tubing into the case was pretty much impossible - it could be made to fit, but there was no way to gauge if excessive force was being applied to vertical components on the motherboard. Silverstone claims that a slim fan + slim radiator AIO will fit in this case, but even that seems doubtful...

The stock GeminII S doesn't quite fit - the 25mm fan is about 3mm too tall. I started out by mounting a 15mm fan from a GeminII M4, but that wasn't quite enough, so some more work was required...

Stuffing It All In

This was definitely the hardest computer I've ever assembled. The 58mm Silverstone cases are pretty easy to work on - the top and the bottom both come out, the GPU mounts from the back, and there is an access hole behind the socket to install the CPU cooler. In contrast, the 83mm cases only have one removable side, and the GPU is mounted on a plastic subframe that installs from the top; this makes cable routing far less pleasant. Without the 550mm long 24-pin this would probably have been impossible - I don't think another 100mm of cable would have physically fit in the case.

Performance Tuning

The 2699 v3 has a 80C temperature limit - once it hits 80C, it slowly drops out of turbo to stabilize temperatures. It's a graceful falloff - rather than dithering between 800MHz and 2.8GHz like some processors would, it decreases the multiplier a bin at a time until it achieves thermal equilibrium.

Initial performance was poor; the processor would hit 80C and drop to about 2.2GHz, which is below even the base speed of the 2699 v3. More concerningly, Intel's throttling algorithm seems to favor the core over the uncore - uncore speeds were dropping by as much as 50%, which was sure to affect performance in some applications.

Fortunately, upon further investigation it appeared I had plugged the CPU fan in the 'SYS_FAN' header on the board, which caused the CPU fan to get stuck at its lowest speed (SYS_FAN tracks the chipset temperature, not the CPU temperature). Swapping headers greatly improved performance; the CPU now stabilized at 2.5GHz, and the uncore throttling was gone.

But we can do better! Most 25mm fans have a few mm of superfluous plastic on top - by milling that plastic off I was able to get a 25mm thick Corsair fan to barely fit in the case. Installing the thicker fan bumped clock speeds up another 200 MHz, and and dropping Vcore by 50 mV in XTU allowed the processor to maintain 2.8GHz steady state under full load.

LinuxCNC on Laptops

2018-07-18T16:15:00.000-07:00

Most people say LinuxCNC can't be run on laptops. This is false; for low-end applications like those Chinese '3020' routers, software stepping via the parallel port on an old laptop works fine.

Some tweaking is required on almost all laptops - specifically, the system management interrupt (SMI) needs to be disabled. Fortunately, from a fresh install of LinuxCNC this is quite easy.

First, connect to the internet. Then, install the prerequisite packages:

sudo apt-get install libpci-dev vim

Next, grab the smictrl sources from Github. smictrl is a user-space tool to read and write the SMI status register.

git clone https://github.com/zultron/smictrl.git

Build the tool:

cd smictrl
make

Copy it:

sudo cp smictrl /usr/local/bin

Make it start at startup

sudo vim /etc/rc.local

and add

/usr/local/bin/smictrl -s 0
/usr/local/bin/smictrl -c 0x01

before the 'exit 0' line.

Reboot, and go into the BIOS and disable unnecessary peripherals (I've found that disabling everything networking related improves real-time performance) and you should be good to go.

Cineforming!

2018-03-29T04:37:00.001-07:00

Intro

Last November, the excellent Cineform codec went open-source. Cineform is a high-quality intermediate codec in the same spirit as DNxHR and Prores, with the notable distinction that it is based on wavelet, as opposed to DCT, compression.

Wavelets are great for editing; because the underlying transforms operate on the entire frame, wavelet codecs are free of the banding and blocking artifacts that other codecs suffer from when heavily manipulated. The best-known wavelet codec is probably RED's .R3D format, which holds up in post-production almost as well as uncompressed RAW.

Cineform has a few other cool tricks up its sleeve. Firstly, it is fast; the whole program is written using hand-tuned SSE2 intrinsics. It also supports RAW, which is convenient; encoded RAW files can be debayered during decoding into a large variety of RGB or YUV output formats, which helps in maintaining a simple workflow - any editor which supports Cineform can transparently load compressed RAW files.

Benchmarks

I wanted to do some basic benchmarking on 12-bit 4K RAW files to get an idea of what kind of performance the encoder is capable of. All tests were done on a single core of an i7-4712HQ, which for all intents and purposes is a 3GHz Haswell core. As encoding is trivially parallelized (each core encodes one frame), the results should scale almost perfectly to many-core systems.

The test image chosen was the famous 'Bliss' background from Windows XP:

As Bliss is only available in an 8-bit format, for 12-bit encoding tests, the bottom 4 bits were populated with noise (a worst case scenario for the encoder). Frame rates were calculated by measuring the time it took to encode the same frame 10 times with a high resolution timer. As the frames do not fit in L3, discrepancies caused by cached data should not be an issue.

Analysis

All four quality settings can fit a 4K 120 fps 12-bit stream under the bandwidth of a SATA 3.0 link. Furthermore, the data rates are under 350 MB/s, so there exist production SSD's that can sustain the requisite write speeds. Unfortunately, FILMSCAN1 and HIGH require pretty beefy processors (8+ cores) to sustain 120 fps; a 6c/12t 65W Coffee Lake is borderline even with HT (you don't get much headroom for running a preview, rearranging data, etc.). An 8700K (6c/12t, 95W) can handle it with room to spare, but at the expense of power consumption - 8700K's are actually more than 95W under heavy load. MEDIUM and LOW easily fit on a 65W processor. The upcoming Ice Lake (8c/16t, 10nm) processors should improve the situation, allowing for 4K 120 fps to be compressed on a 65W processor at the highest quality setting.

Going beyond, 4K 240 fps seems within reach. Using existing (Q2 '18) hardware, LOW and MEDIUM are borderline for a hotly clocked 8700K, with likelihood of consistent performance increasing if data reordering and preview generation are offloaded. Moving to more exotic hardware, the largest Skylake Xeon-D processors (D-2183IT, D-2187NT, and D-2191) should capable of compressing HIGH in real time, if not at 240 fps then almost certainly at 200 (a lot will depend on thermals, implementation, HT efficiency, and scaling, especially since Xeon-D is very much a constant current, not constant performance, processor).

Anything faster than 4K 240 fps (e.g. a full implementation of the CMV12000, which can do 4K 420 fps) will require some kind of tethered server with at least a 24c Epyc or 18c Xeon-SP processor (and the obvious winner here is Epyc, which is much cheaper than the Xeon).

Quick Update: a Faster Processor

Running a simple test on an aggressively tuned processor (8700K@4.9GHz) we get FILMSCAN1 25.5 fps, HIGH 28.9 fps, MEDIUM 39.6 fps, LOW 50.6 fps. 4.9 GHz is a little beyond the guaranteed frequency range of an 8700K (they can all do 4.7GHz, which is the max single core turbo applied to all cores), but practically all samples can do it anyway. This suggests a neat rule of thumb: LOW is good for twice the frame rate of FILMSCAN1, both in data rate and compression speed.

Addendum: Cineform's packed 12-bit RAW format

I have never seen such an esoteric way to pack 12-bit pixels (and after spending many hours trying to figure it out, I now understand why the poor guy who had to crack the ADFGVX cipher became physically ill while doing it).

The data is packed in rows of most significant bytes interleaved with rows of least significant nibbles (two to a byte). Furthermore, two rows of MSB's (each IMAGE_WIDTH bytes long) are packed, followed by one full-width row (also IMAGE_WIDTH bytes long) containing the least-significant nibbles of the previous two image rows.

To add to the confusion, the rows are packed as R R R ... R G G G ... G or G G G ... G B B B ... B (depending on the which row of the bayer filter the data is from); in other words, the even-column data is packed in a half row, followed by the odd-column data. This results in a final format like so:

R R R ... R G G G ... G
G G G ... G B B B ... B
LSN LSN LSN ... LSN

I am not sure why the data is packed like this (for all I know it's not, and there is a bug in my packing code...) but I suspect it is for some kind of SSE2 efficiency reasons. I also haven't deciphered how the least significant nibbles are packed (there is no easy way to inspect 12-bit image data), but hopefully it is similar to the most significant bytes...

'plotter' and 'logger': Cycle by Cycle Data Logging for Motor Controllers

2018-01-29T08:48:00.000-08:00

Ever since I started writing motor control firmware I've been pursuing higher and higher data logging rates. Back in the days of printing floats over 115200 baud serial on an ATMega328 performance was pretty poor, but now that high-performance ARM devices are available with much better SPI and serial speeds and an order of magnitude more RAM, we can do some pretty cool things. The holy grail is to log relevant data at the switching frequency to some sort of large persistent storage; this gives us the maximum amount of information the controller can see for future analysis.

'logger'

logger is less of a program and more of a set of tricks to maximize transfer and write performance. These tricks include:

Packed byte representation: this one should be pretty obvious; rather than sending floats we can send approximate values with 8 bits of resolution. While we no longer need commas or spaces between data points, it is important to send some sort of unique header byte at the start of each packet; without it, a dropped byte will shift the reading frame during unpacking and cause all subsequent data to become unusable. I use 0xFF (and clip all data values to 0xFE); if more metadata is required, setting the MSB's of the data values to zero gives us 127 different possible header packets at the expense of 1 bit of resolution. The latter method also gives us easy checksumming (the lower 7 bits of the header byte can be the packet bytes XOR'ed together); however, in practice single flipped bits are rare and not that significant during analysis; as it is usually obvious when a bit has been flipped - conversely, if your data is so noisy that you don't notice an MSB being flipped, you probably have other problems on your hands...
Writing entire flash pages at once: this is incredibly important. SD cards (and more fundamentally, all NAND flash) can only be written to in pages, even if there is no filesystem. Writing a byte and writing a page take the same amount of time; on typical SD card, a page is 512 bytes, so buffering incoming data until a full page is received results in a 1-2 order of magnitude improvement in performance.
Dealing with power losses: the above point about how important writing full pages is is actually somewhat facetious. Normally, filesystem drivers and drive controllers will intelligently buffer data to maximize performance, but this is contingent on calling fclose() before the program exits - not calling fclose() or fflush() will possibly result in no data being written to the disk. Having some kind of "logging finished, call fclose() and exit" button is not ideal; if an 'interesting' event happens we usually want to capture it, but in the event of a fault the the user is probably being distracted by other things (battery fire, rampaging robot, imminent risk of death) and is probably not thinking too hard about loss of data. The compromise is to manually call fflush() once every few pages to write save the log to disk without losing too much performance. Depending on the filesystem implementation you are using, data may be flushed automatically at reasonable intervals.
Drive write latency and garbage collection: this is a problem that nearly sunk the SSD industry back in its infancy. Drives which are optimized for sequential transfer (early SSD's and all SD cards) typically have firmware with very poor worst case latencies. Having the card pause for half a second every few tens of megabytes is hardly a problem when the workload is a few 100MB+ sequential writes (photos, videos), but is a huge problem when the workload is many small 4K writes, as some of those writes will take orders of magnitude longer than the others to write. The solution is to keep a long (100 pages) circular buffer with the receiving thread adding to the end and the writing thread clearing page-sized chunks off of the tail. The long buffer amortizes any pauses during writing; as long as the average write speed over the entire buffer is high enough no data will be lost.

Delta compression: I have not tried this, but in theory sending or writing (or both) packed differences between consecutive packets should yield a significant boost in performance by reducing the average amount of data sent. This should be true especially if the sample rate is high (so the difference between consecutive data points is small).

Here is a sample sending program which sends data in 9-byte packets (including header) from a 5KHz interrupt over serial, and here is the matching receiver which writes the binary logs to an SD card with some metadata acquired from an external RTC and IMU module.

'plotter'

I wrote plotter after failing to find a data plotting application capable of dealing with very large data sets. Mathematica lacks basic features such as zooming and panning (excuse me? not acceptable in 2018!), Matlab becomes very slow after a few million points, and Plotly and Excel do all sorts of horrible things after a couple hundred thousand points.

plotter uses a screen-space approach to drawing its graphs in order to scale to arbitrarily large data sets. Traces are first re-sampled along an evenly spaced grid (a sort of rudimentary acceleration structure). Then, at each column of the screen, y-coordinates are interpolated from the grid based on the trace-space x-coordinate of the column. Finally, lines (actually, rectangles) are drawn between the appropriate points in adjacent columns.

The screen-space approach allows performance to be independent of the number of data points, instead, it scales as O(w*n), where w is the screen width and n is the number of traces. It also guarantees that any lines drawn are at most two pixels wide, which allows for fast rectangle-based drawing routines instead of costly generalized line drawing routines (on consumer integrated graphics, the rectangles are several times faster than the corresponding lines). As a result, plotter is capable plotting hundreds of millions of data points at 4K resolutions on modest hardware.

For the sake of generality, the current implementation loads CSV files and internally operates on floating-point numbers. There's a ton of performance to be gained by loading binary files and keeping a 32-bit x-coordinate and a 8-bit y coordinate (which would lower memory usage to 5 bytes per point), but that comes at the expense of interoperability with other programs. The basic controls are:

Cursors:

Clicking places the current active cursor. Clicking on a trace toggles selection on that trace and puts the current active cursor there.
[S] switches active cursor (and allows you to place the second cursor on a freshly opened instance of the program). If visible, clicking on a cursor switches to it.
[C] clears all cursors.

Traces:

Clicking a trace toggles selection.
Clicking on the trace's name in the legend toggles selection. This is useful and necessary when multiple traces are on top of each other.
[H] hides selected traces, [G] hides all but the selected traces, and [F] shows all traces.

Navigation:

The usual actions: click and drag to zoom in on the selected box, scroll to zoom in centered around the cursor, middle click and drag to pan.
Ctrl-scroll and Shift-scroll zoom in on the x and y-axes only, centered around the cursor.
Placing the mouse over the x or y-axis labels and scrolling will zoom in on that axis only, centered around the center of the screen.

File loading:

plotter loads CSV's with floating point entries.
The number of entries in the first row of the input file is used to determine the number of channels. From there on, extra values in rows are ignored, and missing values at the end get copied from the previous row.
In Windows, drag a CSV onto the executable to open it. Note that this will cause the program to silently exit with no error information if the file is invalid.

Configuration:

plotter.txt contains the sample spacing (used to calculate derivatives and generate x-labels), the channel colors, and the channel names. If the config file is missing, all the traces will be black and the channel names will all be 'Test Trace'.
The program will crash if arial.ttf is not in the program directory.

You can get a Windows binary here; source code will be uploaded once it is tweaked to work on Linux.

Fun (?) with an AJA Cion

2018-01-14T04:11:00.001-08:00

Long camera is long

It was Black Friday 2017 and I hadn't bought anything. Thankfully, the fine folks at LensAuthority were running a special on the AJA Cion, an oft-maligned CMV12000-based camera It was real cheap, cheaper than a CMV12000 machine vision camera, and probably more ergonomic as it had the capability to record ProRes internally to proprietary or Cfast 2.0 media at 60p.

General Impressions

You probably don't want this camera for general cinematography; the sensor is noisy enough that you will be spending a ton of time fighting the camera. For example, a simple interior shot of MITERS proved to be too much for the sensor, and MITERS is not exactly a high dynamic range scene. This is further compounded by the fact that the noise is heavily patterned; some rows are noisier than the others (not to be confused with FPN!), which is a lot more distracting that having white noise distributed over the image. And forget about available light shooting, the only ISO you get is 320 (500 and 800 are a joke, the sensor has little enough DR even with no gain applied). Folks can go on about 'great color science' and 'ready to edit codec' all they want, but it is hard to justify a $5K camera with barely 10 stops of dynamic range when the Ursa Mini 4.6K falls in the same price class and is so much better at everything.

The handling...is adequate. The fact that the menus don't show up on the monitoring outputs really puts a damper on operation, as the operator cannot see the settings while the camera is rigged up and shoulder mounted. Even on a tripod, using the click wheel to scroll through 30 menu entries is unpleasant, especially when you can only see one row of the menu at a time (come on AJA, give us a firmware update that fixes this!). Thankfully, operation is very slick when the camera is tethered through an Ethernet cable and operated through the browser interface - the embedded website is intuitive and, more importantly, doesn't seem to exhibit the inconsistent hangups and crashes that plague a ton of my other Ethernet controlled gadgets.

The real strength of this camera, in my opinion, is as a specialized, tethered camera. 4K120 raw is rather state of the art; no other "consumer" camera on the market can do this (RED and Kinefinity can do high framerate wavelet-compressed recording though). As for capturing the HFR output...

Raw Recording

...I'm not sure how I feel about quadruple SDI based output. On the one hand, SDI capture cards are readily available and well-standardized, and I would certainly take four BNC's over one CameraLink cable any day. On the other hand, multi-tap SDI capture is a mess right now (not all cards support combining their inputs out of the box), and RAW transport over SDI is basically a scam, with recorder vendors charging hundreds of dollars for the software licenses to enable RAW recording for each supported camera model.

Image from AJA's site

The only officially supported ways to record the 120p output are via a device called a 'Corvid Ultra' (some sort of $20K box that plugs in via Tesla-style PCIe HIC's), or using an AJA Kona4, a $1995 quad 3G-SDI capture card. The Kona software (AJA Capture Room) has a preset for CION RAW (confusingly enough, the button in the software is not where the manual says it would be). This seems to set each tap as 2K60p, so presumably each frame from each tap encapsulates two consecutive quarter-frames of the full image. It should be possible to record the output as four uncompressed 2K60p Quicktime files on several third-party capture devices, then merge the frames in post; unfortunately, as of this writing there are no small 60p-capable SDI capture devices - all models available have an integrated monitor.

The officially recommended hardware to capture 120 FPS raw is absurd: start with an HP Z820 and a pair of LSI 9721, each equipped with four Intel S3700 or six (!) Intel S3500 drives in RAID 0, then stripe the two RAID 0 volumes together (!!) in Windows to create one large virtual drive. Come on guys, even when the release notes were last updated (2015), NVMe drives and 3D NAND were a thing. I also don't understand the suggestion to use enterprise drives; clearly, if you are running octuple RAID 0 across two RAID cards, you've given up any hope for reliability. A single OCZ RD400 or 960 Pro (512GB or higher) can handle the throughput (even when the drive is nearly full), and a pair in RAID 0 should more than do it. Or, if you feel bleeding edge, a single Optane drive should be able to do it with unbelievable consistency.

For the sake of size, my recording box uses a i3-7100 and a single 512GB RD400 drive; the rather low-performance CPU seems to be OK for Capture Room (which uses a whole core to debayer the preview but otherwise doesn't consume much CPU power). 512GB was chosen as the minimum size needed to achieve the requisite worst-case write speeds, but offers an incredibly mediocre six minutes of recording at 120 FPS. It is important to note that Skylake/Z170 is the oldest "small" platform where the chipset PCIe ports are PCIe 3.0 - anything older and you risk degrading the recording drive's performance.

Capture Room

AJA Capture Room is...very good. This was unexpected, as I am used to very expensive scientific hardware shipping with LabView or Java based garbage that makes your computer feel like its from 1999. At least on Windows, the UI doesn't feel as native as I'd like it to be, and sorting out the dozens of configuration options for the Kona4 requires reading a PDF manual, but I haven't had a crash, and more importantly, it can extract the full performance of the SSD. A lesser program would require 2x overhead to be able to run properly, but clearly someone at AJA actually cared about performance.

Image Quality

(or lack thereof)

Cinema DNG processed to taste in Capture One

The above test scene was shot on a 24mm Art at f/1.4 and processed as a still in Capture One. Exposure was 360 degrees (1/120) and 120 FPS. The primary defect that stands out is the noise in the background; the scene was processed with a 'flat' tone curve that boosted the shadows a substantial amount. The banding (which is caused by the read noise being spatially correlated, not "fixed pattern noise") is incredibly distracting; it is visible in the resulting video as a bright lines scrolling vertically in the shadows. That being said, the colors look beautiful, and you could easily shoot this scene with some fill light and avoid the noise problem altogether.

The other problem is Resolve doesn't process the RAW's nearly as well as C1 does, as it uses fast GPU implementations of pretty basic algorithms. For example, the only sharpening available is a simple unsharp mask; there is no attempt to intelligently detect structure in the image, resulting in sharpening being unusable in the presence of noise. Is the solution to output JPG's from Capture One into Premiere Pro? Probably not; having a program which natively handles raw video is amazing, and so much less clunky than a cobbled together stack of software.

Addendum: a bug appears!

Upon further inspection of the footage, it appears that 120p footage is only recorded as 60p by capture room. AJA insists it is because the computer isn't fast enough, but I think it is due to a bug in Capture Room - among other things, setting the buffer to 4K60p in AJA Control Panel instead of 2K60p results in the correct data rate, but Capture Room segfaults at exit and the resulting files are not usable.