Wednesday, February 21, 2024

About that Groq demo

Recently the good folks at Groq released a formidable demo showing a 70 billion parameter language model inferencing at 300 tokens per second at batch size 1. This immediately elicited two responses from the community:

  1. Wow, this is so fast and the best thing ever
  2. This sucks because it has to run on an entire cluster
In fact, the truth is somewhere in between. Let's take a look.

Configuration. Groq's marketing is somewhat facetious here. The 'LPU' is just their old 14 nm SRAM-based systolic array processor, which is in turn somewhere in between the Graphcore SRAM-based processor array and the big systolic arrays found in Gaudi and TPU. The LPU has 230 Mbytes of SRAM per chip, and some software tricks are used to shard the model across many, many LPUs for inference. If we assume 10 Mbytes activation memory per token, 4K context on a 70B parameter model in int8 comes out to about 110 Gbytes of memory, which requires eight racks (512 devices) to hold. Needless to say, that is a rather voluminous configuration.

Power. At first glance Groq is absolutely boned here, requiring 512 300W devices (150KW) to do their inference. Fortunately, batch size 1 inference doesn't really stress the ALUs even with all of Groq's bandwidth (a forward pass is 140 Gflops so 300 tokens/sec is a tiny fraction of the total throughput of the cluster) so the actual power per chip will be quite low.

Economics. This is where it gets really spicy. Deriders will remark that the Groq device is $20K, but that's in quantity 1 from Mouser for a card built by Bittware, a company notorious for high markups. Groq's chip is fabbed on a '14 nm process' - give or take, the die probably costs $60, 2x that for packaging and testing. Now, here's the magic - because it is SRAM based, the boards are very simple; all in, I would estimate that a Groq board costs under $500 to build.

Suddenly, we're looking cost competitive. The boards for 512 devices come out to $250,000, figure double that once we account for the host servers. Half a million for 10x the performance of an octal A100 ($180,000) is suddenly pretty good. Of course, we're being facetious here, because an octal A100 costs about $50,000 to build, but Groq gets to pay wholesale prices and you don't.

Conclusions. What are our real conclusions here? I'd say there are two, maybe three:

  1. You have to cut the margins out. Nvidia systems are competitive at or near cost, but at their current ludicrous 10x margins you could design and tape out a custom chip for less than the cost of a medium sized cluster. Nvidia is responding to this - their 'cloud and custom' group will allow the sale of semi-custom SKUs to hyperscalers without compromising their annual reports too much, by breaking out the margin reduced parts into a different business unit.
  2. Old nodes are OK. Because inference is so bandwidth bound and memory controllers don't scale well any more, you can squeeze a lot of life out of old nodes. That's a big deal because that lets you second source your design, which puts pressure on the foundry to keep prices down. In contrast, pricing on leading edge TSMC nodes is utter chaos right now because capacity is completely booked.
  3. Engineering is silly. Groq took an ancient accelerator designed for high speed inference of ResNets, and through a monumental feat of system design and clever presentation was able to adapt it to achieve a world record in the hottest field in venture capital right now. In my eyes, that's pretty darn cool.

Saturday, January 6, 2024

How to buy an AI server on eBay

A Quick Primer

A neural network is, in essence, a stack of linear projections with some nonlinear functions interspersed in the stack. Merely stacking linear projections is boring, because the composition of two linear functions is another linear function, but it turns out that inserting even the simplest nonlinearity into the stack works some magic - the composite function "learns" locally linear regions of a complex nonlinear function because the interspersed nonlinearity changes the subspace it projects into. The important takeaway is that computationally, neural networks are a sequence of matrix-vector multiplies.

To train a neural network means to fit it to some set of known input value-function pairs. This is typically done through some variant of first order gradient descent on a loss function L which describes how well we've done the fitting. L and grad L are evaluated with respect to the neural network parameter weights; we typically approximate grad L by the gradient on some subset of the inputs (stochastic gradient descent). At each step the weights W are replaced by W' = W + gamma * grad L; the step size gamma is a key factor in convergence speed.

The widely accepted fastest-converging choice of gamma uses a heuristic called "AdamW", which looks at the first (time average) and second (time average of squares) moments of the gradient to compute gamma. AdamW stores the gradient, first moment, second moment, weight, and "fp16 copy of the weights" (matrix multiplier units typically work on fp16/bf16 inputs), a total of 18 bytes per parameter. This results in a large memory footprint during training - a 7B parameter network requires 126GB, a 34B parameter model 612GB, and 175B parameter model requires a whopping 3150GB. In practice this number is even higher - intermediate activations take up space, especially at large batch size.

Now obviously 3150GB of optimizer states aren't going to fit in a single node, and even 612GB is troublesome. In order to evade having to build ever higher capacity nodes (which is electrically hard), we instead partition the parameters, states, and inputs across nodes, do our computations as much as possible on each node, then synchronize the states across nodes by passing messages. This is nothing new - it's exactly how large scientific simulations on supercomputers work. The ML people have a name for this - sharding - and the scheme used to do the computations is called 'fully sharded data parallel'.

In order to reassemble the gradient, inter-GPU communication is required. Between nodes, this is done through a network which is assumed to be "slow", and painstaking steps are taken to mask the slowness. Internal to a node, the communications are assumed to be "fast", which is where the confusion begins: Nvidia enterprise GPUs can directly pass messages to each other, as well as to Nvidia-branded network controllers, if and only if all devices exist on the same PCI-e controller. If not, the sender has to first write to a CPU buffer, which is then read by the receiver.

Now, there's technically nothing wrong with this: PCI-e 3.0 is good for 128 gbits/sec, the intersocket connection is good for 300+ gbits/sec, and memory is good for many tbits/sec. The issue is all of this back and forth adds latency, and current frameworks generally make no effort to hide intra-node latency (in fact, they may exploit low intra-node latencies to optimize overall performance). This means we need to carefully arrange the GPUs and NICs to be on the same PCI-e controller in order to guarantee expected performance with existing libraries.

With that being said, let's take a look at some good and bad servers on eBay that are marketed as "good for machine learning":

Gigabyte G482

At first glance, this is looking pretty good. For about half of retail, you get a current-generation octal GPU system in 4U. It looks pretty well-engineered, and the vendor's website even says you can use it for AI!

However, this is not a good AI server. In this case, the vendor is as much to blame as the listing. Let's take a look at the block diagram:

As you can see, groups of four GPUs are connected to each socket. An unfortunate quirk of Epyc is each socket contains multiple PCI-e controllers, meaning at best, pairs of GPUs can directly communicate with each other. GPUs from the two hives of four have to cross the intersocket connection to send and receive data. Finally, it's also one slot short - the left CPU doesn't have a slot for a NIC.

This is a fine server for inference if the model fits on one or two GPUs, but then what are you doing running your high capacity inference at home? 🙃

Asus ESC4000's

These are old 2U HPC nodes which make no pretense to being good at AI (they're also not good miners, since you're spending hundreds of extra watts per node running the fans, chipset, and CPUs). Despite their humble origins and low price, these make solid inference servers for latency-sensitive applications - load them up with 4x A5000 and you'll be happily running two instances of llama-2-70B at 20 tokens/sec each. And trust me, quad GPU nodes are way easier to deal with than octal GPU nodes, which tend to be mechanically fragile and difficult to ship and power.

Supermicro SYS-4028

SYS-4028 was an extremely popular AI system in the era of smaller convolutional networks. However, the most common listings come with X9DRG-O-PCIE, which is an older-generation board designed for conventional HPC:
From the 40-lane-per-socket allocation, X9DRG-O-PCIE assigns two groups of 16 lanes each to two PLX switches, which each host two GPUs. So far so good, we get four GPUs able to communicate with each other via P2P. The remaining 8 lanes go to an x8 slot on the same root complex, which is enough for a FDR Infiniband card. That's a pretty robust setup for HPC, where some of the computations might need to happen on the host - hives of four GPUs can communicate over RDMA, and get 32 GB/sec back to the socket. Unfortunately, it's suboptimal for large scale machine learning, since the hives need to communicate over shared memory (and over QPI between the sockets, too!). Instead, what you want is this:

A rather insane layout, to be sure. All eight GPUs (128 lanes) oversubscribe 32 lanes back to one socket, with the remaining 8 lanes from that socket available for connectivity. You wouldn't dream of running your physics simulations on such a system, but the layout is optimal for machine learning: hives of 4 can DMA each other through the 96-lane switch, and the two hives can DMA each other through the root complex.

Happily, X10DRG-O-PCIE is available used for about $700. When all is said and done $1600 for what is the pinnacle of PCI-e based machine learning systems is not bad.

Tuesday, December 19, 2023

Sapphire Rapids for AI: a mini-review

 Intel often touts the performance of their 4th-generation Xeon Scalable "Sapphire Rapids" for generative AI applications, but there are surprisingly few meaningful benchmarks, even from Intel itself. The official Intel slides are as such:

Oof. Three ResNets, two DLRM's, and BERT-large. Come on guys, this is 2023 and no one is buying hardware to run ResNet50. Let's try benchmarking some real workloads instead.

Benchmarked Hardware

Xeon Platinum 8461V ($4,491) on Supermicro X13SEI, default power limits, 256 GB JEDEC DDR5-4800

Stable Diffusion 2.1

Everyone's favorite image generator. SD-2.1 runs on a $200 GPU, so perhaps it doesn't make sense to test performance on a $4,500 Xeon, but as Intel says, every server needs a CPU so the Xeon is in some sense, "free". We use an OpenVINO build of SD-2.1, which contains Intel-specific optimizations, but critically, is not quantized, distilled, or otherwise compressed - it should have comparable FLOPS to the vanilla SD models. Generation is at 512x512 for 20 steps.

Not too shabby. 6.27 it/s puts us somewhere around the ballpark of an RTX3060. On the other hand, you could run your inference on an A4000 (which is licensed for datacenter use) and get higher performance for just $1100, so the Xeon isn't exactly winning on price here.

llama.cpp inference

Inferencing your LLM on a CPU was a bad idea until last week. Microsoft's investments into OpenAI make it almost impossible to compete, price-wise, with gpt-3.5-turbo: since Microsoft is a cloud provider, OpenAI gets to pay deeply discounted rates over the 3-10x markup you would have to pay for cloud infrastructure. You can't escape by building your own datacenter either: without the high occupancy of a cloud datacenter, you still end up paying overhead for idle servers. This left 7B-sized models as the only meaningful ones to self-host, but 7B models are so small that you can run them on a $200 GPU, obviating the need for a huge CPU. (Obviously, there are security-related reasons to self-host, but by and large the bulk of LLM applications are not security-sensitive).

Fortunately for Intel, a medium-sized MoE model, Mixtral-8x7B, with good performance appeared last week. MoE's are unique in that they have the memory footprint of a large model, but the compute (really, bandwidth) requirements of a small model. That sounds like a perfect fit for CPUs, which have tons of memory but limited bandwidth.

It ends up being that llama.cpp, an open-source hobbyist implementation of LLMs on CPUs, is the fastest CPU implementation. At batch size 1 we get about 18 tokens per second in Q4, which is a perfectly usable result (prompt eval time is poor, but I think that's an MoE limitation in llama.cpp which should be fixed shortly?).

Falcon-7B LLM fine-tuning with TRL

This is probably the most interesting benchmark. Full fine-tuning of a 7B parameter LLM needs over 128 GB of memory, requiring the use of unobtanium 4x A100 or 4x A6000 cloud instances or putting $15K into specialized on-premises hardware that will be severely underutilized (given that you are unlikely to be fine-tuning all the time). The big Xeon is able to achieve over 200 tokens per second on this benchmark (with numactl -C 0-47 I was able to achieve about 240 tokens per second on this particular dataset at batch size 16).

It's worth noting that other 7B LLM's perform slower. I don't think this is because Falcon is architecturally different; rather, Falcon was a somewhat off-brand implementation using a bespoke implementation of FlashAttention.

200 tokens per second is pretty decent, allowing 3 epochs of fine tuning on a 10M token dataset in about a day and a half (for reference, openassistant-guanaco, a high quality subset of the guanaco dataset, is about 5M tokens, and a page of text is about 450 tokens).

Conclusions: usable vs useful?

First, without a doubt the three results presented above are usable. A single-socket Sapphire Rapids machine is able to generate images, run inference on a state of the art LLM, and fine-tune a 7B-parameter LLM on millions of tokens, all at speeds which are unlikely to have you tear your hair out. 

On the other hand, is it useful

In the datacenter, we could see a case for image generation, which runs briskly on the Xeon and has a light memory footprint. The problem is, the image generation workload uses all 48 cores for three seconds, which is a poor fit for oversubscribed virtualized environments. On a workstation, the thought of having 48 cores but no GPU is ludicrous, and even AMD GPUs are going to tie the Xeon in Stable Diffusion.

The LLM inferencing use case is a bit different, being primarily bound by bandwidth and memory capacity. The dynamics here are entirely enforced by market conditions, not raw performance or technology supremacy. For example, Sapphire Rapids CPUs are available as spot instances from hyperscalers; GPUs are not. Nvidia also chooses to charge $100 per GB of VRAM on its datacenter parts, but conversely, Intel seems to think its cores are worth $100 each even in bulk. Software support for the Xeon is poor - while llama.cpp is fast, it is primarily a single-user library and has minimal (no?) support for batched serving.

The training benchmark is the most interesting one of all, because it is an example of a workload that will not run at all on most GPU instances, and in fact, there are technology limitations as to why GPUs have trouble reaching 100s of gigabytes of memory. Once again, the Xeon is held back here by Intel's high pricing - a 1S Xeon system costs about $6,000, compared to about $15,000 for a 3x RTX A6000 machine which is significantly faster.

Finally, let's take a look at Intel's most important claim: "the CPU is free because you aren't allowed to not have one". Disregarding hyperscalers which work on a different cost model, a company upgrading to Sapphire Rapids in 2023 is likely still on Skylake. Going from a 20-core Skylake part to a 32-core Sapphire Rapids part represents a 2x boost in general application performance and a ~6x boost in AI performance, except for bandwidth-limited LLM inference where the gains are closer to 2x. Getting 2x more work done with your datacenter and a competitive option to run AI-based analytics or serve AI applications on top of that is a pretty compelling reason to buy a new server, and for a lot of IT departments that's all you need to make the sale.

Tuesday, April 18, 2023

Assorted video decoding tidbits

Some form of hardware acceleration is all-but-mandatory for playing 4K video, especially high bitrate H.265. Nowadays H.265 decoding is commonplace (you need to go all the way back to 2015 to find a CPU or GPU that can't decode HEVC Main10), but presumably as the world transitions to AV1 the same will apply.

Hardware accelerated decode in web browsers on Linux

It works! Well, sort of. On my test machine with an i3-12100F and an ancient Polaris 12 (AMD) GPU running the open source drivers, 1080p H.264 content is properly decoded by UVD but 4K60 VP9 (the famous '4K Costa Rica' demo clip on Youtube) is not. CPU usage seems a bit high in either case, about half a core in the former and 1-2 cores in the latter.

Decode on integrated graphics, display connected to discrete graphics

An esoteric use case. I ran into the H.265 version (!); I had a M4000 I wanted to use for Solidworks and the M4000 does not support HEVC decode, but the iGPU on the i5-12600 it was paired with does.

Unfortunately, this doesn't seem to work. iGPU usage remained zero and the M4000 ran in some sort of weird hybrid decoding mode. Performance, however, was acceptable.

Decode in integrated graphics, multiple GPUs and displays

Can we fix the above by plugging the monitor into the iGPU? The behavior is strange:

Both GPUs are now at 30% usage, but the CPU usage has gone through the roof. Very bad indeed, but possibly fixable with enough effort.

Remarkably, even in this bugged state the Costa Rica clip runs at 60 fps.

VLC + H.265, but the decoding is supposed to be done on the integrated graphics

Unfortunately, the iGPU chooses not to participate, but the hybrid decoding seems to work fine, playing back 4K24 high bitrate video with about 12% CPU usage. It's worth noting however that this is 12% of a 4.8GHz hex core Alder Lake, which is entire laptop CPU from not that long ago or two whole 3GHz Skylake cores.

Heavy decode on integrated graphics

Surprisingly good. On my Kaby Lake laptop, the CPU is able to keep up with about 25% usage while remaining throttled to ~1.6 GHz on battery - the fixed function hardware really does the heavy lifting here and keeps the power consumption down.

Hardware accelerated decode, but there are many cores

The test system was a Epyc 7702 with an RTX 3060, by all means close to the state of the art. I didn't expect problems here, and didn't find any; the 3060 ran at heavy usage on the Costa Rica clip and the CPU was basically idle.

It's unclear what the actual CPU usage was; Task Manager lacks the granularity to deal with so many cores since even 1% is almost an entire core.

Tuesday, February 21, 2023

"Phones are getting better": buying a camera in 2023

It's a tough time to be a camera manufacturer. The mighty ISOCELL HP2 now rules the mobile space, sporting 200 million (!) 0.6um pixels binnable as 12.5M 2.4um pixels. Subelectron read noise, backside illumination, very deep wells, and sophisticated readout schemes allow virtually unlimited dynamic range while not compromising light sensitivity. Practically speaking, the out-of-the-box performance of a state-of-the-art mobile camera greatly exceeds that of any ILC in challenging-but-well-lit conditions: the phone has access to live gyro data for stack alignment, more processing than any ILC could dream of, and is backed by hundreds of millions of dollars of software R&D. It also has access to color science that, no doubt, has been statistically developed to be perceived as "good looking" across a wide demographic of viewers - I'm a firm believer that the best photos are the ones that make other people happy.

We can do some math to see just how screwed ILC's are. The Galaxy S23 Ultra ships with a 23mm f1.7 equivalent lens and a sensor measuring 9.83mm x 7.37mm, for a total sensor area of 72mm2. A full frame sensor measures 864mm2. Light gathering goes as the square of the f-number, so we have the following equivalency:
  • ISOCELL HP2 (1/1.3"): f1.7
  • 4/3": f2.9
  • APS-C: f3.9
  • Full frame: f5.9
At the wide end, ILC's are looking pretty dead: 24mm f5.6 is a reasonable aperture and focal length to shoot at on full frame, and the same performance can be achieved with with a phone. There's some argument that the FF sensor has higher native DR, but the phone has what is more or less a hardware implementation of multi-shot HDR which makes up for the difference. Plus the phone is, you know, a phone, and fits in your pocket.

Astute readers will note that 23mm is awful wide, and it's true - the effective sensor area of the phone decreases if you want a tighter focal length. Taking a look at a 2x crop (46mm equivalent), the sensor area of the phone drops to a rather shabby 18mm2, so the equivalency is now:
  • 4/3": 5.9
  • APS-C: f7.8
  • Full frame: f11.7
The ILC suddenly looks much more compelling - if you forced me to shoot at 45mm f11 all day I'd abandon photography and take up basket weaving.

This makes shopping a whole lot easier: phones obsolescing the moderate-wide-end means that zooms which include 50mm are are pretty useless. Suddenly, 50mm primes look real interesting again, especially since we are now spoiled for choice in the 50mm space. 4/3" cameras, which looked pretty dead for a while, also suddenly look viable: subjects you would shoot with a 50mm prime are often DOF-limited which means the larger sensors can't take advantage of faster apertures.

Things get trickier at longer focal lengths, because you are less likely to be DOF-limited. Wait, what?! Don't telephotos have a shallower depth of field? Well, it turns out for moderate focus distances, the DOF of a lens is proportional to the f-number, and inversely proportional to the square of the magnification. More likely than not, telephoto subjects are large, and since the sensor area is constant in a given camera the DOF actually increases if you stand far away enough to fit the subject on the sensor.

Telephotos provide a compelling argument to buy a full-frame body. Regardless of the sensor format telephotos are going to stay a constant size because they are dominated by their large front elements, and in the types of lighting you might want to use a telephoto lens you are often struggling for light. That premium 35-100/2.8 for 4/3" looks real nice until you remember it is the optical equivalent of a 70-200/5.6 on full frame, a lens so sad that they don't even make one.

Finally, there's image stabilization. I would argue that IS is mandatory for a good experience, especially for new users: for stationary subjects IS gives you something like three stops of improvement, allowing stabilized cameras and lenses to beat un-stabilized cameras with sensors ten times the size. The importance of that cannot be emphasized enough: that 18mm2 of cropped phone sensor can gather as much light as a 180mm2 (nearly 4/3 sized) sensor with an f1.7 lens on it factoring in IS. This unfortunately throws a wrench in many otherwise-sound budget combos: short primes didn't ship with IS until quite recently, and many budget bodies are un-stabilized.

With all that said, here are some buying suggestions:

The $500 "I'm poor" combo

The situation is dire. Long ago, I would have recommended an old 17-50mm f2.8 stabilized zoom from a third party and a used entry-level DSLR. Unfortunately, you'd be insane to tell a new user to buy an entry-level DSLR in 2023 (people expect features like "working live view" and "4K video") and the third-party zooms don't work with most mirrorless cameras. What we really want is a stabilized 40-50mm f2-2.8 equivalent (that's 40-50mm equivalent focal length, f2-2.8 real aperture) on a body that supports 4K video and PDAF, but inexplicably, that combination does not exist, even in the micro-4/3 world (which has had IBIS for a long time).

Consolation prize: any of the 24MP Nikon DSLR's, plus a Sigma 17-50/2.8 OS, used, but I wouldn't recommend it.

The $1000 combo

This used to be a downright reasonable price point, but inflation and feature creep have somewhat diluted it. Fortunately, the long-lived life cycles of Sony cameras help you here: a6500's are regularly available, used, for $600-700 leaving you with $300 for a lens. By some crazy miracle of third-party lenses you can fit two autofocus lenses into $300 - the Rokinon 35/2.8 is cheap and small, and the other can be 'to taste'.

The drawback here is that as a new system, long lenses for Sony tend to be rather inaccessible, but the same can be said of any other mirrorless-only system and the starter offerings for the Canon/Nikon ecosystem are very poor in comparison. It's also worth noting that there's nothing good at this price point new.

Recommendation: the most beat-up a6500 you can find, a used Rokinon 35/2.8, and one other lens, or save up and buy a second lens which costs more than $150 :)

The photographer's special: a D800, 50mm 1.8G, and 70-200 VR1. You lose a lot of features (stabilization, eye AF, 4K video, touchscreen) but optically the D800 is as good as they get, and the two lenses will let you take pictures none of your friends can. Highly recommended if you've spent some time behind a camera before - otherwise, the transition from phone to optical viewfinder may be a bit jarring.

The dubious alternative: an a7R ii and Rokinon 45/1.8. The a7R ii checks every box: stabilization, full frame, BSI, 4K video, but still manages to be a poor user experience nonetheless thanks to its ill-thought-out controls and menus. If you're fine with that, you get unsurpassable (as in, the sensor is limited by the laws of physics) optical performance for $1000.

The $1500 combo

Things start getting a little weird here. The a6500 is a really good camera, and its nice and compact too. The E-mount ecosystem matured quickly, with a ton of off-brand companies making decent prime lenses at ludicrously good prices. I would argue if you're content shooting short-to-moderate focal lengths, you are better off staying in the E-mount ecosystem - you can buy an a7iii and a nice starter prime for $1500, then (quickly) build out the system from there.

If you want to shoot long lenses, Sony no longer looks so sweet. The 70-200 options from the big three are comparable: the Sony GM Mark I is a native lens which is about $1500, the same price as the adaptable-without-penalties Nikon -FL but optically inferior. The EF-mount Mark III is the same price but worse than the FL (and better than the GM); the EF-mount Mark II is $500 less and probably superior to the Nikon VR2. The Nikon VR1 is incredibly cheap for a modern pro lens, but the corners are dubious which is a disaster for some people and a non-issue for others.

Above 200mm, Sony is out - the 200-600 is a very good budget 600mm option but pretty pathetic at 300 and 400mm. It's also very expensive: if you accept the extending barrel, the 150-600 options from third parties are $700 cheaper. Among Canon and Nikon, Nikon wins on the budget end (the Z6 was a very usable camera, the original EOS R was not), but Canon just announced some new releases so we should expect prices to move down across the stack.

Recommendation: a7iii plus your favorite primes (or a7R ii and your favorite primes if you don't shoot video at all)

Recommendation: Z6 Mark I, FTZ, 40mm f2, and 70-200 VR1. This comes out to $1800, and the VR1 recommendation is going to make a lot of people angry, but you'll be getting beautiful images for years to come (or until you drop the extra $1000 on the FL).

"I have money, help me spend it"

I'm a Nikon shooter so I'll just provide a Nikon kit. This assumes you have plenty of cash, but you aren't interesting in wasting it.

Body: Z6 Mark I. The Mark II is not worth the extra cash (unless you really need the second slot). Also, an FTZ to go with it.
Short lens: 40mm f2. There are so many options here and they're all good, so really, pick your poison.
Telephoto: Unfortunately, save up for the FL. It's not that much better than the VR2, but it fixes the focus breathing on the VR2 that was problematic at portrait focal lengths. The FL also performs exceptionally with teleconverters, so pick up a TC14 and TC20 III and get your 280mm f4 and 400mm f5.6 for "free".
Long lens: A sane man would recommend the 200-500 f5.6. I would also put a vote in for the 300/2.8 AF-S, which is only about $700 used and works well with teleconverters to get to 600mm. A madman would go all in and buy a 600/4 but seriously, don't do it - shooting with that lens is a serious commitment.
Ultrawide: Get fucked. You absolutely want a native ultrawide since the short flange distance on the Z bodies makes them much better, but Nikon refuses to provide a 16-35/2.8 which doesn't suck. I guess the 17-28/2.8 will have to do? The runner up would be the 14-30/4, but f4 isn't f2.8.

Sony will give you the same thing, trading the a better ultrawide for a worse long lens. Canon is...honestly superior, I think, but I am not familiar enough with Canon to make a $5000 recommendation.

Friday, September 16, 2022

The best little telescope that'll never get built

 I really love the IMX183 from Sony. For a very good price, you get 20 million BSI pixels with photon-counting levels of read noise and negligible dark current - truly a miracle of the economies of scale. The sensor is also large enough to have good light gathering capabilities, yet small enough to work with compact optics.

The problem is taking advantage of all 20 million of those pixels. 20 MP isn't that much - 5400-pixel horizontal resolution leaves you with precious little room to crop to 4K so actually resolving 20MP is important. You can't really buy a telescope that resolves 2.4um pixels - in the center anything diffraction limited will work, but if you want a fast, wide system getting good performance in the corners isn't happening. Obviously, the answer is to design your own telescope.

Now, I have zero interest in making telescopes - grinding your own lenses is ass and generally a poor use of time. If I wanted to spend time doing repetitive tasks I'd pick up embroidering or something. As such the system is designed to be reasonably manufacturable by an overseas vendor.

The telescope is a Houghton-type design with integrated field flattener, a focal length of 150mm, an entrance pupil of 76mm (3"), and a 6-degree field of view covering a 16mm sensor. The optical performance is pristine:

Normally, I would not vouch for a form like this - the telescope is much longer than its focal length, the image plane is inside the tube, and two full-aperture refractive elements are needed. However, in this case, we are building a small, tightly-integrated system: polishing a 3" lens is easy, the difference between a 6" and 10" OTA is negligible from a practical standpoint, and the IMX183 can easily be fit into the tube.

The glasses aren't great - the automatic glass substitution tool came up with BSM28 and BSM9 which are moderately costly, infrequently-melted glasses. The chemical properties are poor, basically akin to those of ED glass, but 77mm clear filters are cheap and readily available. The elements don't have any serious manufacturing red flags, though the curvatures are a bit steep compared to your usual refractor doublet.

Sunday, March 20, 2022

Image a DLP down through a microscope objective and look at it with a camera through the same objective


For some reason it's taken me a while to actually do this, but its easy to expose high resolution film or resist with a DLP. All you need is a camera, a beam splitter, a DLP, and some microscope parts.

The DLP is an ordinary DLP projector with the lens removed - I used a TI dev kit because the firmware offers nice features like "disable the LEDs" which can be used to, for example, replace the blue LED with a violet one for exposing resist. Unless you have a really exotic objective it is probably best to use a g-line (430nm) LED rather than a 405nm LED - the 430nm parts are hard to find but objectives will be quite dispersive in the far violet unless they are specifically designed to be apochromatic from violet to red. Your best bet is probably to leave the green LED in place for focusing and replace the blue LED with a violet one. A regular pico-projector will work just fine, but is less convenient to work with. You do save about $800 at current street prices though.

The light from the DLP reflects off a beam splitter and passes through the tube lens, which is just a 200mm a(po)chromat. A 200mm doublet would probably work fine here, but you can get better corner performance from a triplet or a quadruplet - the DLP is not very large though, so it might not matter.  The tube lens collimates the light and sends it down the microscope objective, which focuses it on the target.

The camera looks down the other beam splitter path and sees the light scattered off the target. Unfortunately, this technique only works with optically smooth targets - otherwise, the camera sees the surface texture of the target and not the imaged patterns displayed on the DLP.

Parfocalizing the camera and the DLP is easier than it seems - the object side of the tube lens is something like f/50 so the depth of field is very large. Roughly speaking, there is only a small range of distances where the the objective comes into focus at all. Either by reading the datasheet or by trial and error it is possible to roughly set the backfocus, then the camera and target position are adjusted for best focus. Once a starting position is found, the backfocus can be adjusted to minimize spherical aberration (microscope objectives gain some spherical aberration if the conjugates aren't the right ones).

The pixel scale in the capture is 560nm/pixel, so the Windows clock is only a few microns long. Performance is as expected but it is always entertaining to use the Windows desktop on a 1mm wide screen :)