Thursday, March 29, 2018



Last November, the excellent Cineform codec went open-source. Cineform is a high-quality intermediate codec in the same spirit as DNxHR and Prores, with the notable distinction that it is based on wavelet, as opposed to DCT, compression.

Wavelets are great for editing; because the underlying transforms operate on the entire frame, wavelet codecs are free of the banding and blocking artifacts that other codecs suffer from when heavily manipulated. The best-known wavelet codec is probably RED's .R3D format, which holds up in post-production almost as well as uncompressed RAW.

Cineform has a few other cool tricks up its sleeve. Firstly, it is fast; the whole program is written using hand-tuned SSE2 intrinsics. It also supports RAW, which is convenient; encoded RAW files can be debayered during decoding into a large variety of RGB or YUV output formats, which helps in maintaining a simple workflow - any editor which supports Cineform can transparently load compressed RAW files.


I wanted to do some basic benchmarking on 12-bit 4K RAW files to get an idea of what kind of performance the encoder is capable of. All tests were done on a single core of an i7-4712HQ, which for all intents and purposes is a 3GHz Haswell core. As encoding is trivially parallelized (each core encodes one frame), the results should scale almost perfectly to many-core systems.

The test image chosen was the famous 'Bliss' background from Windows XP:

As Bliss is only available in an 8-bit format, for 12-bit encoding tests, the bottom 4 bits were populated with noise (a worst case scenario for the encoder). Frame rates were calculated by measuring the time it took to encode the same frame 10 times with a high resolution timer. As the frames do not fit in L3, discrepancies caused by cached data should not be an issue.


All four quality settings can fit a 4K 120 fps 12-bit stream under the bandwidth of a SATA 3.0 link. Furthermore, the data rates are under 350 MB/s, so there exist production SSD's that can sustain the requisite write speeds. Unfortunately, FILMSCAN1 and HIGH require pretty beefy processors (8+ cores) to sustain 120 fps; a 6c/12t 65W Coffee Lake is borderline even with HT (you don't get much headroom for running a preview, rearranging data, etc.). An 8700K (6c/12t, 95W) can handle it with room to spare, but at the expense of power consumption - 8700K's are actually more than 95W under heavy load. MEDIUM and LOW easily fit on a 65W processor. The upcoming Ice Lake (8c/16t, 10nm) processors should improve the situation, allowing for 4K 120 fps to be compressed on a 65W processor at the highest quality setting.

Going beyond, 4K 240 fps seems within reach. Using existing (Q2 '18) hardware, LOW and MEDIUM are borderline for a hotly clocked 8700K, with likelihood of consistent performance increasing if data reordering and preview generation are offloaded. Moving to more exotic hardware, the largest Skylake Xeon-D processors (D-2183IT, D-2187NT, and D-2191) should capable of compressing HIGH in real time, if not at 240 fps then almost certainly at 200 (a lot will depend on thermals, implementation, HT efficiency, and scaling, especially since Xeon-D is very much a constant current, not constant performance, processor).

Anything faster than 4K 240 fps (e.g. a full implementation of the CMV12000, which can do 4K 420 fps) will require some kind of tethered server with at least a 24c Epyc or 18c Xeon-SP processor (and the obvious winner here is Epyc, which is much cheaper than the Xeon).

Quick Update: a Faster Processor

Running a simple test on an aggressively tuned processor (8700K@4.9GHz) we get FILMSCAN1 25.5 fps, HIGH 28.9 fps, MEDIUM 39.6 fps, LOW 50.6 fps. 4.9 GHz is a little beyond the guaranteed frequency range of an 8700K (they can all do 4.7GHz, which is the max single core turbo applied to all cores), but practically all samples can do it anyway. This suggests a neat rule of thumb: LOW is good for twice the frame rate of FILMSCAN1, both in data rate and compression speed.

Addendum: Cineform's packed 12-bit RAW format

I have never seen such an esoteric way to pack 12-bit pixels (and after spending many hours trying to figure it out, I now understand why the poor guy who had to crack the ADFGVX cipher became physically ill while doing it).

The data is packed in rows of most significant bytes interleaved with rows of least significant nibbles (two to a byte). Furthermore, two rows of MSB's (each IMAGE_WIDTH bytes long) are packed, followed by one full-width row (also IMAGE_WIDTH bytes long) containing the least-significant nibbles of the previous two image rows.

To add to the confusion, the rows are packed as R R R ... R G G G ... G or G G G ... G B B B ... B (depending on the which row of the bayer filter the data is from); in other words, the even-column data is packed in a half row, followed by the odd-column data. This results in a final format like so:

R R R ... R G G G ... G
G G G ... G B B B ... B

I am not sure why the data is packed like this (for all I know it's not, and there is a bug in my packing code...) but I suspect it is for some kind of SSE2 efficiency reasons. I also haven't deciphered how the least significant nibbles are packed (there is no easy way to inspect 12-bit image data), but hopefully it is similar to the most significant bytes...

Monday, January 29, 2018

'plotter' and 'logger': Cycle by Cycle Data Logging for Motor Controllers

Ever since I started writing motor control firmware I've been pursuing higher and higher data logging rates. Back in the days of printing floats over 115200 baud serial on an ATMega328 performance was pretty poor, but now that high-performance ARM devices are available with much better SPI and serial speeds and an order of magnitude more RAM, we can do some pretty cool things. The holy grail is to log relevant data at the switching frequency to some sort of large persistent storage; this gives us the maximum amount of information the controller can see for future analysis.


logger is less of a program and more of a set of tricks to maximize transfer and write performance. These tricks include:
  • Packed byte representation: this one should be pretty obvious; rather than sending floats we can send approximate values with 8 bits of resolution. While we no longer need commas or spaces between data points, it is important to send some sort of unique header byte at the start of each packet; without it, a dropped byte will shift the reading frame during unpacking and cause all subsequent data to become unusable. I use 0xFF (and clip all data values to 0xFE); if more metadata is required, setting the MSB's of the data values to zero gives us 127 different possible header packets at the expense of 1 bit of resolution. The latter method also gives us easy checksumming (the lower 7 bits of the header byte can be the packet bytes XOR'ed together); however, in practice single flipped bits are rare and not that significant during analysis; as it is usually obvious when a bit has been flipped - conversely, if your data is so noisy that you don't notice an MSB being flipped, you probably have other problems on your hands...
  • Writing entire flash pages at once: this is incredibly important. SD cards (and more fundamentally, all NAND flash) can only be written to in pages, even if there is no filesystem. Writing a byte and writing a page take the same amount of time; on typical SD card, a page is 512 bytes, so buffering incoming data until a full page is received results in a 1-2 order of magnitude improvement in performance.
  • Dealing with power losses: the above point about how important writing full pages is is actually somewhat facetious. Normally, filesystem drivers and drive controllers will intelligently buffer data to maximize performance, but this is contingent on calling fclose() before the program exits - not calling fclose() or fflush() will possibly result in no data being written to the disk. Having some kind of "logging finished, call fclose() and exit" button is not ideal; if an 'interesting' event happens we usually want to capture it, but in the event of a fault the the user is probably being distracted by other things (battery fire, rampaging robot, imminent risk of death) and is probably not thinking too hard about loss of data. The compromise is to manually call fflush() once every few pages to write save the log to disk without losing too much performance. Depending on the filesystem implementation you are using, data may be flushed automatically at reasonable intervals.
  • Drive write latency and garbage collection: this is a problem that nearly sunk the SSD industry back in its infancy. Drives which are optimized for sequential transfer (early SSD's and all SD cards) typically have firmware with very poor worst case latencies. Having the card pause for half a second every few tens of megabytes is hardly a problem when the workload is a few 100MB+ sequential writes (photos, videos), but is a huge problem when the workload is many small 4K writes, as some of those writes will take orders of magnitude longer than the others to write. The solution is to keep a long (100 pages) circular buffer with the receiving thread adding to the end and the writing thread clearing page-sized chunks off of the tail. The long buffer amortizes any pauses during writing; as long as the average write speed over the entire buffer is high enough no data will be lost.
  • Delta compression: I have not tried this, but in theory sending or writing (or both) packed differences between consecutive packets should yield a significant boost in performance by reducing the average amount of data sent. This should be true especially if the sample rate is high (so the difference between consecutive data points is small).
Here is a sample sending program which sends data in 9-byte packets (including header) from a 5KHz interrupt over serial, and here is the matching receiver which writes the binary logs to an SD card with some metadata acquired from an external RTC and IMU module.


I wrote plotter after failing to find a data plotting application capable of dealing with very large data sets. Mathematica lacks basic features such as zooming and panning (excuse me? not acceptable in 2018!), Matlab becomes very slow after a few million points, and Plotly and Excel do all sorts of horrible things after a couple hundred thousand points.

plotter uses a screen-space approach to drawing its graphs in order to scale to arbitrarily large data sets. Traces are first re-sampled along an evenly spaced grid (a sort of rudimentary acceleration structure). Then, at each column of the screen, y-coordinates are interpolated from the grid based on the trace-space x-coordinate of the column. Finally, lines (actually, rectangles) are drawn between the appropriate points in adjacent columns.

The screen-space approach allows performance to be independent of the number of data points, instead, it scales as O(w*n), where w is the screen width and n is the number of traces. It also guarantees that any lines drawn are at most two pixels wide, which allows for fast rectangle-based drawing routines instead of costly generalized line drawing routines (on consumer integrated graphics, the rectangles are several times faster than the corresponding lines). As a result, plotter is capable plotting hundreds of millions of data points at 4K resolutions on modest hardware.

For the sake of generality, the current implementation loads CSV files and internally operates on floating-point numbers. There's a ton of performance to be gained by loading binary files and keeping a 32-bit x-coordinate and a 8-bit y coordinate (which would lower memory usage to 5 bytes per point), but that comes at the expense of interoperability with other programs. The basic controls are:
  • Cursors:
    • Clicking places the current active cursor. Clicking on a trace toggles selection on that trace and puts the current active cursor there.
    • [S] switches active cursor (and allows you to place the second cursor on a freshly opened instance of the program). If visible, clicking on a cursor switches to it.
    • [C] clears all cursors.
  • Traces:
    • Clicking a trace toggles selection.
    • Clicking on the trace's name in the legend toggles selection. This is useful and necessary when multiple traces are on top of each other.
    • [H] hides selected traces, [G] hides all but the selected traces, and [F] shows all traces.
  • Navigation:
    • The usual actions: click and drag to zoom in on the selected box, scroll to zoom in centered around the cursor, middle click and drag to pan.
    • Ctrl-scroll and Shift-scroll zoom in on the x and y-axes only, centered around the cursor.
    • Placing the mouse over the x or y-axis labels and scrolling will zoom in on that axis only, centered around the center of the screen.
  • File loading:
    • plotter loads CSV's with floating point entries.
    • The number of entries in the first row of the input file is used to determine the number of channels. From there on, extra values in rows are ignored, and missing values at the end get copied from the previous row.
    • In Windows, drag a CSV onto the executable to open it. Note that this will cause the program to silently exit with no error information if the file is invalid.
  • Configuration:
    • plotter.txt contains the sample spacing (used to calculate derivatives and generate x-labels), the channel colors, and the channel names. If the config file is missing, all the traces will be black and the channel names will all be 'Test Trace'.
    • The program will crash if arial.ttf is not in the program directory.
You can get a Windows binary here; source code will be uploaded once it is tweaked to work on Linux.

Monday, October 23, 2017

Naked Go-Kart

The big go-kart came apart for some cleaning, re-painting, and general beautification recently; this was also a good opportunity to get some shots of a relatively uncluttered frame for documentation purposes.

What a contraption. Towards the back you can see the Polychain GT Carbon drive, which has replaced the old multi-V belts.

Such Art (24mm, to be precise)

Sunday, October 1, 2017

2016 (Third Generation) Smart ForTwo Electric Drive Drive Unit Teardown

In addition to the Leaf drive unit, we recently acquired a Smart ForTwo drive unit. This one is somewhat of a unicorn - there are no images available, and little information other than it uses a Bosch SMG 180/120 motor and is 55KW. It is also of particular interest because the SMG 180/120 is an off-the-shelf motor and therefore should be easy to integrate into other vehicles; in addition, it has a distinguished heritage, being used in such applications as the Fiat 500e and the front axle of the Porsche 918.

Pre-shucking drive unit shot:

Integration is...poor. This makes sense since the Electric Drive was never a high volume vehicle - it was probably a compliance car designed to boost Mercedes' fleet mileage, and relied heavily on subsidies in EV-friendly states to break even.

Donor vehicle's tag, for future reference:

First, a diversion - the air conditioner compressor:

The compressor is fully integrated - DC and CAN in. Internally, it is a scroll compressor:

and an interior PM motor:

with a small inverter mounted to the back. Useful? Unlikely, unless you need an 370VDC air conditioner.

Time to move on to the meatier components. The integrated charger is a 6.6KW unit made by Lear:

Once again, severe coolant draining was required before we could proceed further:

Once the coolant was drained, the charger came off with a few bolts, revealing a nice self-contained unit:

We didn't look further into the charger, as we had minimal interest in turning it on, but it shouldn't be too hard to get running. Of note - the charger doesn't manage balancing, only AC-DC conversion.

Let's look at the inverter next. Aha! A part number! The inverter is an EFP 2-3, made by Continental and sold by Zytek.

A little bit of digging shows that it is a 235A continuous, 355A peak unit. A handful of bolts releases the inverter from the rest of the unit:

Very small and cute. The HV cables probably weigh more than the power electronics.

At this point we had released the motor-gearbox unit from the rest of the drive unit. Already looks promising - nice round motor, practically movable by a single person.

Hi Charles!
After splitting the gearbox from the motor...

...we are left with the most adorable little traction motor:

The mounting pattern is very convenient:

Though dangling the 32kg motor off of those tabs is probably less than wise.

The gearbox is almost identical to the Leaf's, just smaller:

Internally, it is very similar:

We attempted to look inside the inverter next. It started out promising:

Oh hey, pin-style channels, presumably to reduce pressure drop.

The HV cable harness is...

...plugged in via giant blade terminals?!

Unfortunately, at this point we were defeated - the inverter housing had some concerning ribbon cables running to the internal boards, and it was very unclear how to split the housing without tearing up the cables.

Overall, this drive unit is very promising for small vehicle conversions. The poor integration is a blessing, as everything has sensible mounting holes, and there's a slim chance that it is possible to acquire the datasheet for the inverter and motor from their respective OEMs. Unfortunately 55/80KW (the latter is Bosch's peak rating for the motor) and 200Nm is not enough for a full car conversion - you would need a pair of them to get good power.

2013 (Second Generation) Nissan Leaf Drive Unit Teardown

We recently came across a Nissan Leaf drive unit:

Not shown: CHAdeMO charger blob
and of course had to look inside.

First, some outside dimensions for reference:

The drive unit is highly integrated; for example, the motor phase connections are short busbars:

Removing the dozen-odd screws and the three terminal screws allows us to separate the inverter from the motor...

...resulting in a very strange-looking U-shaped motor.

Pulling the next dozen bolts holding the differential and gearbox to the motor separates the gearbox from the motor:

A few more shots of the motor:

The natural next step was to look inside the gearbox...

...which unfortunately meant draining a liter of suspicious red Nissan Leaf fluid out first.

After that, it was simple enough to remove yet another dozen screws and split the gearbox housing:

Nothing much to see here, standard single stage helical gear going into an open differential. More pictures:

Overall reduction is a little over 8:1.

Next we look inside the inverter. First, a quick look at the waterblock channels:

Nothing to see here, standard cast channels.

Cracking the power electronics enclosure open required breaking through a lot of RTV sealant...

...revealing the gorgeous (and never-before-seen!) innards:

The inverter is much less dense than we had anticipated; the IGBT's have their own module (rather than being brazed straight to the waterblock), and there is a lot of empty space over the PM.

Controller is unfortunately based around a datasheet-less Renesas microcontroller, as are all Japanese automotive electronics.

Capacitor is 1088uF, 600V, SH film:

And a lot smaller than the one inside the 2nd-gen Prius.

Few more shots, including gate drive power supplies:

Overall, very well integrated with few surprises. I would not even dream of reprogramming this inverter, as dialing in the motor tuning for something this large would be very involved. My one comment is that this motor is probably good for much more than 80KW peak - judging by its size I would venture to say it is a 200KW-class motor.