Awesome job! This topic reminded me of a video player I was writing back in 2009 but never finished. My video codec only supports monochrome videos (only dithering, no grayscale), though, so it's not really the same as yours. With the way it works, it would probably be difficult to adapt to grayscale too.
I wrote an encoder and decoder that run on my computer, but I designed the compression format to make it fast and easy to decode on a 6MHz Z80 (I was targeting my trusty ol' TI-86). The GIF animation I attached shows the output of the decoder (after scaling and stretching it to the same aspect ratio as the TI-86 screen and converting to GIF). The video is 4:36 long and runs at about 8 fps. The compressed video is 577 KB long, so it's about 260 bytes per frame, or 2KB/s. The source is Red Hot Chili Pepper's "Can't Stop" music video.
I think I used FFMPEG to dump the video to individual images (I can't remember what format it dumps to, probably JPEG), eventually converted them to Portable Pixmap (PPM) format, scaled and cropped them, dithered them, and then concatenated them into one file, for my encoder to read. These steps are all fairly easy to do in batch with the NetPBM utilities.
Here's how the decoder works: the input file contains a list of 8x8 pixel tiles. The decoder saves these for later. Each frame consists of 16x8 tiles (128x64 screen / 8x8) running from top to bottom, left to right (column-major order). For each tile, the decoder reads a byte. If the byte is less than 128, then it is an index into the list of tiles. If the byte is 255, then the decoder reads in 8 more bytes and uses those as a "literal" tile. Whichever tile it uses (either from the list or a "literal" tile), the decoder XOR's the tile onto the previous frame at the current tile location. I chose to use XOR so the video can be played backwards (eg, rewinding) easily. To play a video backwards, the decoder XOR's the tiles exactly the same way.
Besides 0-127 and 255, there are several other byte values that are special too. The values between 128 and 143 mean "rotate the tile in the frame". By rotate I mean rotate each byte left or right, and rotate the bytes up or down. This is kind of like shifting the tile horizontally and vertically, but since I'm rotating the tiles, it's reversible (again, so the video can be played backwards). Bits 0 and 1 indicate how many pixels to rotate horizontally (-2, -1, 1, or 2), and bits 2 and 3 are for the vertical shift amount (same as the horizontal). Like the tile XOR operations, tile rotations operate on the tile in the previous frame. To play a video backward, the decoder has to rotate each tile in the opposite direction.
This leaves byte values 144 through 254 for other reversible operations. These are currently unused in my decoder but might be used in the future.
Here's the overall byte format:
video header:
1 byte=n (number of tiles)
8*n bytes=tile data
frame:
1 byte=frame type (most are 2 for "cumulative", but other values are possible, eg, 0 for end of stream)
128 tiles
tile:
1 byte=transformation type
8 bytes=literal tile (if type=255)
I haven't done this yet, but at the end of each frame I should probably write a 2-byte value that says how large that frame was, so the decoder can skip to the beginning of each frame when playing a video backwards.
One other idea I was playing with is to losslessly compress each lossy-compressed frame, but initial results show larger file sizes when this is done. It could be that the RLE compression method that I used on the frame data just isn't suitable for that type of data (I think it was basically the RLE used by the PCX format). This is an area I'll investigate further.
Based on this description of the decoder, it should be obvious how the encoder works, right?
It basically just has to come up with a good set of tiles to pass to the decoder, and then for each tile within each frame, decide which tile adjustment (using a tile index, literal tile, or other transformation) results in the "best-looking" tile in the output. "Best looking" is subjective, so the encoder really determines the quality/size tradeoff (up to the hard limits of the format, which is currently 129 bytes/frame minimum).
Edit: tiles are actually stored in column-major order, not row-major order. That should make decoding slightly faster than row-major order.