Thursday, March 29, 2018

Cineforming!

Intro

Last November, the excellent Cineform codec went open-source. Cineform is a high-quality intermediate codec in the same spirit as DNxHR and Prores, with the notable distinction that it is based on wavelet, as opposed to DCT, compression.

Wavelets are great for editing; because the underlying transforms operate on the entire frame, wavelet codecs are free of the banding and blocking artifacts that other codecs suffer from when heavily manipulated. The best-known wavelet codec is probably RED's .R3D format, which holds up in post-production almost as well as uncompressed RAW.

Cineform has a few other cool tricks up its sleeve. Firstly, it is fast; the whole program is written using hand-tuned SSE2 intrinsics. It also supports RAW, which is convenient; encoded RAW files can be debayered during decoding into a large variety of RGB or YUV output formats, which helps in maintaining a simple workflow - any editor which supports Cineform can transparently load compressed RAW files.

Benchmarks

I wanted to do some basic benchmarking on 12-bit 4K RAW files to get an idea of what kind of performance the encoder is capable of. All tests were done on a single core of an i7-4712HQ, which for all intents and purposes is a 3GHz Haswell core. As encoding is trivially parallelized (each core encodes one frame), the results should scale almost perfectly to many-core systems.

The test image chosen was the famous 'Bliss' background from Windows XP:


As Bliss is only available in an 8-bit format, for 12-bit encoding tests, the bottom 4 bits were populated with noise (a worst case scenario for the encoder). Frame rates were calculated by measuring the time it took to encode the same frame 10 times with a high resolution timer. As the frames do not fit in L3, discrepancies caused by cached data should not be an issue.


Analysis

All four quality settings can fit a 4K 120 fps 12-bit stream under the bandwidth of a SATA 3.0 link. Furthermore, the data rates are under 350 MB/s, so there exist production SSD's that can sustain the requisite write speeds. Unfortunately, FILMSCAN1 and HIGH require pretty beefy processors (8+ cores) to sustain 120 fps; a 6c/12t 65W Coffee Lake is borderline even with HT (you don't get much headroom for running a preview, rearranging data, etc.). An 8700K (6c/12t, 95W) can handle it with room to spare, but at the expense of power consumption - 8700K's are actually more than 95W under heavy load. MEDIUM and LOW easily fit on a 65W processor. The upcoming Ice Lake (8c/16t, 10nm) processors should improve the situation, allowing for 4K 120 fps to be compressed on a 65W processor at the highest quality setting.

Going beyond, 4K 240 fps seems within reach. Using existing (Q2 '18) hardware, LOW and MEDIUM are borderline for a hotly clocked 8700K, with likelihood of consistent performance increasing if data reordering and preview generation are offloaded. Moving to more exotic hardware, the largest Skylake Xeon-D processors (D-2183IT, D-2187NT, and D-2191) should capable of compressing HIGH in real time, if not at 240 fps then almost certainly at 200 (a lot will depend on thermals, implementation, HT efficiency, and scaling, especially since Xeon-D is very much a constant current, not constant performance, processor).

Anything faster than 4K 240 fps (e.g. a full implementation of the CMV12000, which can do 4K 420 fps) will require some kind of tethered server with at least a 24c Epyc or 18c Xeon-SP processor (and the obvious winner here is Epyc, which is much cheaper than the Xeon).

Quick Update: a Faster Processor

Running a simple test on an aggressively tuned processor (8700K@4.9GHz) we get FILMSCAN1 25.5 fps, HIGH 28.9 fps, MEDIUM 39.6 fps, LOW 50.6 fps. 4.9 GHz is a little beyond the guaranteed frequency range of an 8700K (they can all do 4.7GHz, which is the max single core turbo applied to all cores), but practically all samples can do it anyway. This suggests a neat rule of thumb: LOW is good for twice the frame rate of FILMSCAN1, both in data rate and compression speed.

Addendum: Cineform's packed 12-bit RAW format

I have never seen such an esoteric way to pack 12-bit pixels (and after spending many hours trying to figure it out, I now understand why the poor guy who had to crack the ADFGVX cipher became physically ill while doing it).

The data is packed in rows of most significant bytes interleaved with rows of least significant nibbles (two to a byte). Furthermore, two rows of MSB's (each IMAGE_WIDTH bytes long) are packed, followed by one full-width row (also IMAGE_WIDTH bytes long) containing the least-significant nibbles of the previous two image rows.

To add to the confusion, the rows are packed as R R R ... R G G G ... G or G G G ... G B B B ... B (depending on the which row of the bayer filter the data is from); in other words, the even-column data is packed in a half row, followed by the odd-column data. This results in a final format like so:

R R R ... R G G G ... G
G G G ... G B B B ... B
LSN LSN LSN ... LSN

I am not sure why the data is packed like this (for all I know it's not, and there is a bug in my packing code...) but I suspect it is for some kind of SSE2 efficiency reasons. I also haven't deciphered how the least significant nibbles are packed (there is no easy way to inspect 12-bit image data), but hopefully it is similar to the most significant bytes...