Update time! Added a very simple trick. It now takes about 0.4 miliseconds on the same test data (it varies between 350 and 400, most commonly 390).
That's varying between 360 to 410 MB/s, usually 370MB/s.
There's also a very easy way to make it faster: use a simpler way to encode length/distance pairs. But that would change the format a lot, and I intended to stay close to Deflate.