Monday, January 29, 2018

'plotter' and 'logger': Cycle by Cycle Data Logging for Motor Controllers

Ever since I started writing motor control firmware I've been pursuing higher and higher data logging rates. Back in the days of printing floats over 115200 baud serial on an ATMega328 performance was pretty poor, but now that high-performance ARM devices are available with much better SPI and serial speeds and an order of magnitude more RAM, we can do some pretty cool things. The holy grail is to log relevant data at the switching frequency to some sort of large persistent storage; this gives us the maximum amount of information the controller can see for future analysis.

'logger'

logger is less of a program and more of a set of tricks to maximize transfer and write performance. These tricks include:
  • Packed byte representation: this one should be pretty obvious; rather than sending floats we can send approximate values with 8 bits of resolution. While we no longer need commas or spaces between data points, it is important to send some sort of unique header byte at the start of each packet; without it, a dropped byte will shift the reading frame during unpacking and cause all subsequent data to become unusable. I use 0xFF (and clip all data values to 0xFE); if more metadata is required, setting the MSB's of the data values to zero gives us 127 different possible header packets at the expense of 1 bit of resolution. The latter method also gives us easy checksumming (the lower 7 bits of the header byte can be the packet bytes XOR'ed together); however, in practice single flipped bits are rare and not that significant during analysis; as it is usually obvious when a bit has been flipped - conversely, if your data is so noisy that you don't notice an MSB being flipped, you probably have other problems on your hands...
  • Writing entire flash pages at once: this is incredibly important. SD cards (and more fundamentally, all NAND flash) can only be written to in pages, even if there is no filesystem. Writing a byte and writing a page take the same amount of time; on typical SD card, a page is 512 bytes, so buffering incoming data until a full page is received results in a 1-2 order of magnitude improvement in performance.
  • Dealing with power losses: the above point about how important writing full pages is is actually somewhat facetious. Normally, filesystem drivers and drive controllers will intelligently buffer data to maximize performance, but this is contingent on calling fclose() before the program exits - not calling fclose() or fflush() will possibly result in no data being written to the disk. Having some kind of "logging finished, call fclose() and exit" button is not ideal; if an 'interesting' event happens we usually want to capture it, but in the event of a fault the the user is probably being distracted by other things (battery fire, rampaging robot, imminent risk of death) and is probably not thinking too hard about loss of data. The compromise is to manually call fflush() once every few pages to write save the log to disk without losing too much performance. Depending on the filesystem implementation you are using, data may be flushed automatically at reasonable intervals.
  • Drive write latency and garbage collection: this is a problem that nearly sunk the SSD industry back in its infancy. Drives which are optimized for sequential transfer (early SSD's and all SD cards) typically have firmware with very poor worst case latencies. Having the card pause for half a second every few tens of megabytes is hardly a problem when the workload is a few 100MB+ sequential writes (photos, videos), but is a huge problem when the workload is many small 4K writes, as some of those writes will take orders of magnitude longer than the others to write. The solution is to keep a long (100 pages) circular buffer with the receiving thread adding to the end and the writing thread clearing page-sized chunks off of the tail. The long buffer amortizes any pauses during writing; as long as the average write speed over the entire buffer is high enough no data will be lost.
  • Delta compression: I have not tried this, but in theory sending or writing (or both) packed differences between consecutive packets should yield a significant boost in performance by reducing the average amount of data sent. This should be true especially if the sample rate is high (so the difference between consecutive data points is small).
Here is a sample sending program which sends data in 9-byte packets (including header) from a 5KHz interrupt over serial, and here is the matching receiver which writes the binary logs to an SD card with some metadata acquired from an external RTC and IMU module.

'plotter'

I wrote plotter after failing to find a data plotting application capable of dealing with very large data sets. Mathematica lacks basic features such as zooming and panning (excuse me? not acceptable in 2018!), Matlab becomes very slow after a few million points, and Plotly and Excel do all sorts of horrible things after a couple hundred thousand points.


plotter uses a screen-space approach to drawing its graphs in order to scale to arbitrarily large data sets. Traces are first re-sampled along an evenly spaced grid (a sort of rudimentary acceleration structure). Then, at each column of the screen, y-coordinates are interpolated from the grid based on the trace-space x-coordinate of the column. Finally, lines (actually, rectangles) are drawn between the appropriate points in adjacent columns.

The screen-space approach allows performance to be independent of the number of data points, instead, it scales as O(w*n), where w is the screen width and n is the number of traces. It also guarantees that any lines drawn are at most two pixels wide, which allows for fast rectangle-based drawing routines instead of costly generalized line drawing routines (on consumer integrated graphics, the rectangles are several times faster than the corresponding lines). As a result, plotter is capable plotting hundreds of millions of data points at 4K resolutions on modest hardware.

For the sake of generality, the current implementation loads CSV files and internally operates on floating-point numbers. There's a ton of performance to be gained by loading binary files and keeping a 32-bit x-coordinate and a 8-bit y coordinate (which would lower memory usage to 5 bytes per point), but that comes at the expense of interoperability with other programs. The basic controls are:
  • Cursors:
    • Clicking places the current active cursor. Clicking on a trace toggles selection on that trace and puts the current active cursor there.
    • [S] switches active cursor (and allows you to place the second cursor on a freshly opened instance of the program). If visible, clicking on a cursor switches to it.
    • [C] clears all cursors.
  • Traces:
    • Clicking a trace toggles selection.
    • Clicking on the trace's name in the legend toggles selection. This is useful and necessary when multiple traces are on top of each other.
    • [H] hides selected traces, [G] hides all but the selected traces, and [F] shows all traces.
  • Navigation:
    • The usual actions: click and drag to zoom in on the selected box, scroll to zoom in centered around the cursor, middle click and drag to pan.
    • Ctrl-scroll and Shift-scroll zoom in on the x and y-axes only, centered around the cursor.
    • Placing the mouse over the x or y-axis labels and scrolling will zoom in on that axis only, centered around the center of the screen.
  • File loading:
    • plotter loads CSV's with floating point entries.
    • The number of entries in the first row of the input file is used to determine the number of channels. From there on, extra values in rows are ignored, and missing values at the end get copied from the previous row.
    • In Windows, drag a CSV onto the executable to open it. Note that this will cause the program to silently exit with no error information if the file is invalid.
  • Configuration:
    • plotter.txt contains the sample spacing (used to calculate derivatives and generate x-labels), the channel colors, and the channel names. If the config file is missing, all the traces will be black and the channel names will all be 'Test Trace'.
    • The program will crash if arial.ttf is not in the program directory.
You can get a Windows binary here; source code will be uploaded once it is tweaked to work on Linux.

Sunday, January 14, 2018

Fun (?) with an AJA Cion

Long camera is long
It was Black Friday 2017 and I hadn't bought anything. Thankfully, the fine folks at LensAuthority were running a special on the AJA Cion, an oft-maligned CMV12000-based camera  It was real cheap, cheaper than a CMV12000 machine vision camera, and probably more ergonomic as it had the capability to record ProRes internally to proprietary or Cfast 2.0 media at 60p.

General Impressions

You probably don't want this camera for general cinematography; the sensor is noisy enough that you will be spending a ton of time fighting the camera. For example, a simple interior shot of MITERS proved to be too much for the sensor, and MITERS is not exactly a high dynamic range scene. This is further compounded by the fact that the noise is heavily patterned; some rows are noisier than the others (not to be confused with FPN!), which is a lot more distracting that having white noise distributed over the image. And forget about available light shooting, the only ISO you get is 320 (500 and 800 are a joke, the sensor has little enough DR even with no gain applied). Folks can go on about 'great color science' and 'ready to edit codec' all they want, but it is hard to justify a $5K camera with barely 10 stops of dynamic range when the Ursa Mini 4.6K falls in the same price class and is so much better at everything.

The handling...is adequate. The fact that the menus don't show up on the monitoring outputs really puts a damper on operation, as the operator cannot see the settings while the camera is rigged up and shoulder mounted. Even on a tripod, using the click wheel to scroll through 30 menu entries is unpleasant, especially when you can only see one row of the menu at a time (come on AJA, give us a firmware update that fixes this!). Thankfully, operation is very slick when the camera is tethered through an Ethernet cable and operated through the browser interface - the embedded website is intuitive and, more importantly, doesn't seem to exhibit the inconsistent hangups and crashes that plague a ton of my other Ethernet controlled gadgets.

The real strength of this camera, in my opinion, is as a specialized, tethered camera. 4K120 raw is rather state of the art; no other "consumer" camera on the market can do this (RED and Kinefinity can do high framerate wavelet-compressed recording though). As for capturing the HFR output...

Raw Recording

...I'm not sure how I feel about quadruple SDI based output. On the one hand, SDI capture cards are readily available and well-standardized, and I would certainly take four BNC's over one CameraLink cable any day. On the other hand, multi-tap SDI capture is a mess right now (not all cards support combining their inputs out of the box), and RAW transport over SDI is basically a scam, with recorder vendors charging hundreds of dollars for the software licenses to enable RAW recording for each supported camera model.

Image from AJA's site
The only officially supported ways to record the 120p output are via a device called a 'Corvid Ultra' (some sort of $20K box that plugs in via Tesla-style PCIe HIC's), or using an AJA Kona4, a $1995 quad 3G-SDI capture card. The Kona software (AJA Capture Room) has a preset for CION RAW (confusingly enough, the button in the software is not where the manual says it would be). This seems to set each tap as 2K60p, so presumably each frame from each tap encapsulates two consecutive quarter-frames of the full image. It should be possible to record the output as four uncompressed 2K60p Quicktime files on several third-party capture devices, then merge the frames in post; unfortunately, as of this writing there are no small 60p-capable SDI capture devices - all models available have an integrated monitor.

The officially recommended hardware to capture 120 FPS raw is absurd: start with an HP Z820 and a pair of LSI 9721, each equipped with four Intel S3700 or six (!) Intel S3500 drives in RAID 0, then stripe the two RAID 0 volumes together (!!) in Windows to create one large virtual drive. Come on guys, even when the release notes were last updated (2015), NVMe drives and 3D NAND were a thing. I also don't understand the suggestion to use enterprise drives; clearly, if you are running octuple RAID 0 across two RAID cards, you've given up any hope for reliability. A single OCZ RD400 or 960 Pro (512GB or higher) can handle the throughput (even when the drive is nearly full), and a pair in RAID 0 should more than do it. Or, if you feel bleeding edge, a single Optane drive should be able to do it with unbelievable consistency.

For the sake of size, my recording box uses a i3-7100 and a single 512GB RD400 drive; the rather low-performance CPU seems to be OK for Capture Room (which uses a whole core to debayer the preview but otherwise doesn't consume much CPU power). 512GB was chosen as the minimum size needed to achieve the requisite worst-case write speeds, but offers an incredibly mediocre six minutes of recording at 120 FPS. It is important to note that Skylake/Z170 is the oldest "small" platform where the chipset PCIe ports are PCIe 3.0 - anything older and you risk degrading the recording drive's performance.

Capture Room

AJA Capture Room is...very good. This was unexpected, as I am used to very expensive scientific hardware shipping with LabView or Java based garbage that makes your computer feel like its from 1999. At least on Windows, the UI doesn't feel as native as I'd like it to be, and sorting out the dozens of configuration options for the Kona4 requires reading a PDF manual, but I haven't had a crash, and more importantly, it can extract the full performance of the SSD. A lesser program would require 2x overhead to be able to run properly, but clearly someone at AJA actually cared about performance.

Image Quality

(or lack thereof)

Cinema DNG processed to taste in Capture One
The above test scene was shot on a 24mm Art at f/1.4 and processed as a still in Capture One. Exposure was 360 degrees (1/120) and 120 FPS. The primary defect that stands out is the noise in the background; the scene was processed with a 'flat' tone curve that boosted the shadows a substantial amount. The banding (which is caused by the read noise being spatially correlated, not "fixed pattern noise") is incredibly distracting; it is visible in the resulting video as a bright lines scrolling vertically in the shadows. That being said, the colors look beautiful, and you could easily shoot this scene with some fill light and avoid the noise problem altogether.

The other problem is Resolve doesn't process the RAW's nearly as well as C1 does, as it uses fast GPU implementations of pretty basic algorithms. For example, the only sharpening available is a simple unsharp mask; there is no attempt to intelligently detect structure in the image, resulting in sharpening being unusable in the presence of noise. Is the solution to output JPG's from Capture One into Premiere Pro? Probably not; having a program which natively handles raw video is amazing, and so much less clunky than a cobbled together stack of software.

Addendum: a bug appears!

Upon further inspection of the footage, it appears that 120p footage is only recorded as 60p by capture room. AJA insists it is because the computer isn't fast enough, but I think it is due to a bug in Capture Room - among other things, setting the buffer to 4K60p in AJA Control Panel instead of 2K60p results in the correct data rate, but Capture Room segfaults at exit and the resulting files are not usable.