Author Topic: A "new" compression format [subject to changes] (Read 11406 times)

harold · « **on:** November 09, 2013, 03:17:32 pm »

I was trying to write a Deflate decompressor, and that's perfectly doable, but occasionally annoying. The "new" format I'm suggesting is basically Deflate with some changes. Some to make it less annoying, others because "why not". The goal is mostly "improved decompression speed".

The main structure is the same. Matches in a sliding window, with literals and match-lengths (and End of Block marker) in the same alphabet, and Huffman coding.

But there are changes.
The main change is that the Huffman codes are packed in dwords (specifically to help decoding with a 32bit "window"), starting at the most significant bit (helps with decoding). Those dwords are then saved with little-endian byte-order (to help x86 - ARM can work this more easily than x86 could in the reverse situation). They're aligned to their natural boundary, so there may be a little padding after the header (the point is to read in dwords, aligning them makes that more efficient on some architectures and only costs 3 bytes at worst).
The Huffman codes are, of course, canonical, but a different kind of canonical than in Deflate: long codes must begin with zeroes. IE, using the rule "shorter codes have a numerically (if filled with zeros to the right) higher value than longer codes". This ensures that no more than 9 rightmost bits can ever differ from zero, which keeps your decoding tables small without trickery.
The maximum length for a Huffman code is 31, up from 15. (that usually won't help, I'm throwing that in "because why not")
Furthermore, the match-length symbols start at 256, with EndOfBlock assigned the code 288. This makes it slightly more convenient to index a table directly with the match-symbol.
All 32 match-length symbols are used, following the same pattern of "extra bits" as in Deflate, but extended. The last 4 match-length-symbols have 6 "extra bits".
The distance codes work like in Deflate64 (which is like Deflate, but with codes 30 and 31 being valid and getting 14 "extra bits").

The suggested way to decode this is to take two dwords, shift them left (with the bits of the second dword appearing in the lsb of the first dword) by some amount, count the leading zeroes (you can OR with 1 and use BSR since the maximum length of one symbol is 31 bits anyway), shift right by some amount depending on the number of leading zeroes, add some offset depending on the number of leading zeroes, then index a table with that.
Alternatively, take a (possibly unaligned) qword, rotate it by 32 bits, then shift left (normally), count leading zeroes, etc..

The header is changed to this:
if the first byte is 0, end of stream
otherwise, the first dword (little-endian) has some bitfields:
bit 0(lsb): 1 (constant) (to ensure the first byte is not zero)
bit 1: 0 if the block is compressed, 1 is the block is stored uncompressed
bit 2-31: length of this block
if block is stored: just the bytes, raw.
if block is compressed:
dword (little endian): length of this block when decompressed
uint16 (little endian): is_present, bit[n] (numbered from lsb up) indicates that the range of symbols [n * 16 .. n * 16 + 15] is part of the alphabet (1) or not (0)
uint16 (little endian): is_default, bit[n] (see above) indicates that the range (see above) has its own 16bit mask (0) or is completely present (1)
For every range that needs a mask (ie is present and not default), an uint16 (little endian) that indicates for every symbol whether it's part of the alphabet (1) or not(0)
Two more masks for the match-lengths (these two masks are always present)
Then, some bytes to store the code lengths:
The lowest 5 bits indicate the code length (a length of 0 may be coded - enables you to drop a subrange mask and save a byte sometimes), the upper 3 bits are a repeat-count. If the repeat-count is 0, the repeat count is 8 + next_byte.
[up to 3 bytes padding to align the next part]

None of those changes are really spectacular, but IMO they all contribute a little bit to make the format less annoying.

The part that I'm least sure about is the header. I'm also not sure whether allowing longer match distances is a mistake or not. Actually I'm not sure whether the whole deal for encoding matches is the right way to go or not. I couldn't come up with anything better, but it's a pain in the behind to parse. For most lengths `n`, parsing the match takes so much time that decoding `n` literals would have been faster.

As always, suggestions are welcome (that's why I'm posting this in the first place). Especially if you've written a Deflate decompressor before (otherwise the changes might seem random).

SpiroH · « **Reply #1 on:** November 09, 2013, 04:14:13 pm »

This sounds interesting although a bit too technical for many of the users. As you seem to be mostly interested in performance, i would suggest you carry out some benchmarking and then try to 'sell it' afterwards. Good luck with it.

harold · « **Reply #2 on:** November 09, 2013, 04:23:29 pm »

I'll benchmark it of course, but it'll be hard to filter out the contribution of the format. I already have an implementation mostly done.

harold · « **Reply #3 on:** November 12, 2013, 06:20:58 am »

Ok, here are the first benchmark results.
For the file alice29.txt from the Canterbury Corpus, using only Huffman compression and no matches (those don't quite work yet), the decompression time on my machine is (approximately and rounded upwards) 1.2 miliseconds.
That's purely the decompression time, ie not counting file IO and such.
The uncompressed file size is 152,089 bytes, the compressed file size is 87,784 bytes (it would be a lot smaller with matches, about 55kb, but they don't really work yet).
That gives a output-throughput of 120MB/s. For reference, 7zip decompression speed (as measured by its built-in benchmarker) is about 50MB/s for a single core on my machine (limiting to a single core because my code is currently single-threaded, but that's not a limitation due to the format).
I expect it to slow down when I enable matches, though. Long matches will be fast (as fast as memcpy, essentially), but the more common short matches have a lot of overhead.

It isn't the most relevant comparison of course, LZMA doesn't even use Huffman compression at all. But it was the easiest comparison to make, and it's something.

DJ Omnimaga · « **Reply #4 on:** November 12, 2013, 11:54:07 am »

I wonder if such compression would be fast enough for calcs?

harold · « **Reply #5 on:** November 12, 2013, 12:03:06 pm »

Depends. On ARM calcs, no problem. But decoding the Huffman codes will probably suck on the z80 calcs.

Streetwalrus · « **Reply #6 on:** November 12, 2013, 03:37:12 pm »

It sounds really cool. I hope you can finish it and it becomes widely used.

thepenguin77 · « **Reply #7 on:** November 12, 2013, 03:45:21 pm »

My fourVid used deflate to store the videos. That's how I was getting like 30% compressed video sizes. Some quick math tells me that it was decompressing at least 30 KB/s on the z80s. (Though it might be much higher, that's just the lower bound.)

What I find interesting about your method is that you are kind of abusing Huffman codes in a way that makes them no longer Huffman codes. I thought the idea was the huffman codes are shorter for the more common values, but yours all end up taking the same amount of space. This would of course greatly speed up decompression, but, it's not really huffman anymore

Wouldn't you be able to get better ratios if instead of using huffman, you simply made your lookup tables bigger and used all of the bits in your huffman fields?

harold · « **Reply #8 on:** November 12, 2013, 03:48:17 pm »

No they're Huffman codes, with longer ones and shorter ones.. the thing about the "9 rightmost bits" doesn't change that. That's just a property that those Huffman codes automatically have due to their construction and the size of the alphabet.

thepenguin77 · « **Reply #9 on:** November 12, 2013, 03:54:01 pm »

Oh, alright. I guess I misunderstood. Your algorithm is a lot more complex than what I had originally realized.

harold · « **Reply #10 on:** November 12, 2013, 03:56:02 pm »

I guess I'll do a quick example: suppose you have the data 4, 5, 5, 4, 0, 1, 2, 3, 4, 5, 5, EOS
The codelengths and codes assigned to those symbols will be
0: 3 001
1: 3 010
2: 4 0000
3: 4 0001
4: 2 10
5: 2 11
EOS: 3 011
So the message would be 10 11 11 10 001 010 0000 0001 10 11 11 011
That would be packed as 0xBE2806F6 or, as bytes, F6 06 28 BE (and a header of course but that one's simple)

harold · « **Reply #11 on:** November 23, 2013, 04:54:38 am »

Alright, new benchmark results.
With sliding-window back-matches enabled, the compressed file (still with the same input, alice29.txt from the Canterbury Corpus, which is 152,089 bytes of ascii text) is now 56,052 bytes (this is not optimal, I haven't implemented things like lazy matches (yet?) in the compressor). That's about 55kb, with a high quality Deflate compressor you can get 51kb - but like I said, no lazy matches.
More importantly, it takes about 0.5 miliseconds to decompress it (it varies a bit between 0.450 and 0.520, more often on the low side, but let's just round it up to 0.500).
That's 290MB/s of output-throughput, or if you'd rather measure input-throughput, 106MB/s (that's not a common measure afaik, but there you go). I really expected it to get slower, since the histogram of match lengths showed the vast majority of them to be tiny, and tiny matches have a lot of overhead per byte.

Will edit with code soon (it is at this point just test code though, not a usable product. Various paths are hardcoded. And the decompressor will only run on Haswell CPUs)

harold · « **Reply #12 on:** November 25, 2013, 01:26:56 pm »

Here's a version that works on older CPUs as well, probably a bit more useful
It's slightly slower, about 4 to 5%, which means I can still round it off to 0.5 miliseconds on the test file that I've been using so far. The difference is real and detectable though, not just noise.
Various paths and buffer sizes are still hardcoded, but it shouldn't be hard to get it to work.
The only changes are in the files decoder.asm and altdeflate_infl_ref.h

Eiyeron · « **Reply #13 on:** November 25, 2013, 01:54:21 pm »

Now make it work on calc! :p

Is the decompress or light enough for calc? If yes, will it give better results than PuCrunch?

harold · « **Reply #14 on:** November 25, 2013, 02:08:14 pm »

It would be significantly heavier than PuCrunch, and it wouldn't help for small data at all, due to the header.

Author Topic: A "new" compression format [subject to changes] (Read 11406 times)

harold

A "new" compression format [subject to changes]

SpiroH

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

DJ Omnimaga

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

Streetwalrus

Re: A "new" compression format [subject to changes]

thepenguin77

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

thepenguin77

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]

Eiyeron

Re: A "new" compression format [subject to changes]

harold

Re: A "new" compression format [subject to changes]