Monday, February 25, 2013

h.264... @#$%!#$

I've been playing around with video encoding lately, especially focusing on learning the algorithmic design that goes into something so complex. If you're not familiar with video encoding, consider the following scenario: a 5 second video clip at 30 frames per second and each frame's dimensions are 640x480 using 24-bit to represent every pixel in an image. 5 seconds of uncompressed video? You're easily looking at 138,240,000 bytes (5 * 30 * 640 * 480 * 3 bytes). Despite the size of hard drives these days, that's not maintainable.

There are a number of video formats out there - .avi, .wmv, .mpg, .mov (QuickTime), .mp4. h.264 is the encoding format for the video stream, and can fit in a .mov, .mp4, .m4v or other MPEG-4 files. And for better or worse, this is the encoding that I'm currently focusing on.

If you're wanting to write your own h.264 encoder, you need to read the spec. However, the spec is written in spec language, a really obscure dialect of the English language, where practically nothing makes sense. So, if you're writing an encoder, here is the spec, made free to download.

There are a number of web sites and pages that you can scour on the web to look for information about h.264. A few keywords that might be useful: CAVLC, CABAC, intra-prediction, inter-prediction, quantization, DCT, I-frames, B-frames, P-frames.

One interesting website with a very basic look at h.264: a blog entry at cardinalpeak.com that illustrates a basic encoder for 128x96 video using all I-frames, with no compression or prediction. It's a good start towards understanding h.264, but I found while trying to understand portions of this, that it was lacking in the following descriptions:
-h.264 is a bitstream format. If you do a lot of work on computer architecture, you end up thinking a lot in terms of endianness, bytes, words, and double words. As a bitstream format, all tables and field sequences specified for h.264 must be understood and translated as a sequence of bits, with only some nominal padding to convert a sequence of bits into a byte. When encoding headers, it may be useful to think of each header as a stream of 0's and 1's, starting from the left and going to the right. If you see 0x80 in the next byte to be read, the next bit that is read by bit stream is 1, not 0.

-Exponential Golomb codes are used to encode variable length fields, and need to be understood and correctly encoded in order for your encoded stream to be correctly decoded. There really isn't a workaround to this, it just has to be done. I've not done a comparison of a Golomb lookup table versus an algorithmic transformation, but there is value in understanding algorithmically understanding how an exponential Golomb calculator works. The Wikipedia entry is a pretty good start. If you're looking for code, something like this should work (in pseudo-code):

function eGolombCalc(num) {
   finalBits = -1;
   returnVaue = num + 1;
   num++;
   while (num) {
     finalBits += 2;
     num = num >> 1;
   }

   return finalBits, returnValue;
}

You'll always end up with an odd number of total bits in an exponential Golomb code, and since the first half of the bits are leading zeroes, it can help to keep track of the total number of bits in the resulting code. The non-zero portion of the sequence is simply the original value + 1.

So, there's a start towards a full h.264 encoder. Building on the website links and other information here, my best advice to build off of this is to look at the following (in order): sequence parameter set (SPS) header, picture parameter set (PPS) header, slice header, macroblock header, intra-prediction, CAVLC, inter-prediction, CABAC, and quantization and other topics. I may touch upon these topics in the future.

No comments:

Post a Comment