DEFLATE is a lossless data compression algorithm, it’s a combination of the LZ77 algorithm and Huffman coding.

deflate blocks

deflate compressed data format consists a series of blocks.
Each block has a 3 bit header, they are independent with each other:

bit 0 BFINAL
bit 1-2 BTYPE

Note that the header bits do not necessarily begin on a byte boundary,
since a block does not necessarily occupy an integral number of bytes.

BFINAL is set if and only if this is the last block of the data set.

BTYPE specifies the block type:

b’00 no compression
b’01 compressed with fixed Huffman codes
b’10 compressed with dynamic Huffman codes
b’11 reserved (error)

no compression block (aka stored block)

no compression block contains the uncompressed original literal data.

type size description
BFINAL 1 bit b’* (0 or 1)
BTYPE 2 bits b’00
Reserved bits < 8 bits make sure the offset of LEN is byte aligned
LEN 2 bytes number of data bytes in the block
NLEN 2 bytes NLEN = 65535 - LEN, used for the number validation
Literal data LEN in bytes uncompressed data

no compression block has the limited block sise, due to the maxium value of LEN is 65535.

dynamic Huffman block

Dynamic Huffman block is compressed using a combination of the LZ77 algorithm and Huffman coding.
And the Huffman coding tree is dynamically generated at the compression stage, it’s stored within the block.

type size description
BFINAL 1 bit b’* (0 or 1)
BTYPE 2 bits b’10
HLIT 5 bits number of Literal/Length codes - 257 (257 - 286)
HDIST 5 bits number of Distance codes - 1 (1 - 32)
HCLEN 4 bits number of Code Length codes - 4 (4 - 19)
Code Length codes 3 bits * (HCLEN + 4) code lengths for the code length table
literal/length distance table * bits (related to the HLIT, HDIST and the code length table) 2 huffman tree stored
huffman encoded bit stream * bits end with EOB symbol (256)

Note: The huffman encoded bit stream has different bit order compared to other fields. (In the literal/length distance table, it is also the huffman encoded bit stream but for the literal/length distance code length info)

In the huffman encoded bit stream, it consist of sequences of symbols drawn from three conceptually distinct alphabets: either literal bytes, from the alphabet of byte values (0..255), or (length, backward distance) pairs.

The representation used in this bit stream limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of the block,
thus, the length is drawn from (3..258) and the distance is drawn from (1..32,768).

In fact, the literal and length alphabets are merged into a single alphabet (0..285),
where values 0..255 represent literal bytes, the value 256 indicates end-of-block.
And values 257..285 represent length codes (possibly in conjunction with extra bits following the symbol code) as follows:

code extra bits length code extra bits length code extra bits length
257 0 3 267 1 15,16 277 4 67-82
258 0 4 268 1 17,18 278 4 83-98
259 0 5 269 2 19-22 279 4 99-114
260 0 6 270 2 23-26 280 4 115-130
261 0 7 271 2 27-30 281 5 131-162
262 0 8 272 2 31-34 282 5 163-194
263 0 9 273 3 35-42 283 5 195-226
264 0 10 274 3 43-50 284 5 227-257
265 1 11,12 275 3 51-58 285 0 258
266 1 13,14 276 3 59-66

The distance code has another alphabet, the value is from 0 to 29:

code extra bits distance code extra bits distance code extra bits distance
0 0 1 10 4 33-48 20 9 1025-1536
1 0 2 11 4 49-64 21 9 1537-2048
2 0 3 12 5 65-96 22 10 2049-3072
3 0 4 13 5 97-128 23 10 3073-4096
4 1 5,6 14 6 129-192 24 11 4097-6144
5 1 7,8 15 6 193-256 25 11 6145-8192
6 2 9-12 16 7 257-384 26 12 8193-12288
7 2 13-16 17 7 385-512 27 12 12289-16384
8 3 17-24 18 8 513-768 28 13 16385-24576
9 3 25-32 19 8 769-1024 29 13 24577-32768

Generally, the data in the huffman encoded bit stream has the below format:

1
2
3
4
5
literal huffman code,
.. ,
length huffman code, (extra bits), distance huffman code, (extra bits),
.. ,
end of block huffman code.

And after the huffman decode, it translate to:

1
2
3
4
5
literal symbol (0-255),
.. ,
length symbol (257-285), (extra bits), distance symbol (0-29), (extra bits),
.. ,
end of block symbol (256).

Due to there has 2 alphabets: literal and length shared the same alphabet, distance used another alphabet.
There has 2 huffman coding trees, one is for the literal and length code, another is for the distance code. But they don’t get stored into the block directly.

Below steps show how to get these 2 huffman tables from the info stored within the block:

First thing is deflate also pre-defined a code length table:

index code length description
0 16 repeat last length for (3 + extra: 3-6) times, follow with extra info (2 bits)
1 17 repeat zero for (3 + extra: 3-10) times, follow with extra info (3 bits)
2 18 repeat zero for (11 + extra: 11-138) times, follow with extra info (7 bits)
3 0 literal/length & distance code length encoded to 0 bit
4 8 literal/length & distance code length encoded to 8 bits
5 7 literal/length & distance code length encoded to 7 bits
6 9 literal/length & distance code length encoded to 9 bits
7 6 literal/length & distance code length encoded to 6 bits
8 10 literal/length & distance code length encoded to 10 bits
9 5 literal/length & distance code length encoded to 5 bits
10 11 literal/length & distance code length encoded to 11 bits
11 4 literal/length & distance code length encoded to 4 bits
12 12 literal/length & distance code length encoded to 12 bits
13 3 literal/length & distance code length encoded to 3 bits
14 13 literal/length & distance code length encoded to 13 bits
15 2 literal/length & distance code length encoded to 2 bits
16 14 literal/length & distance code length encoded to 14 bits
17 1 literal/length & distance code length encoded to 1 bit
18 15 literal/length & distance code length encoded to 15 bits

4 bits HCLEN shows that the size of the stored code length info, the value is from 0 to 15,
and it indicates the number of the stored code length info, from 4 to 19.

For example, if HCLEN=12, then the first 16 code length info get stored with the block with 3-bit size each.
And from the index 16 to index 18, these code length info are just not stored, it means they are not used.
Also, if there has the 0 value with the 3-bits, that means the correspond code length is not used.

Let’s assume, these 16 code length info has the listed 3-bits value: [1,0,5,5,3,0,5,0,4,0,0,4,0,5,0,3]
Then, it tells us:

index code length description
0 16 code length code 16 huffman encoded length: 1 bit
1 17 code length code 17 not used
2 18 code length code 18 huffman encoded length: 5 bits
3 0 code length code 0 huffman encoded length: 5 bits
4 8 code length code 8 huffman encoded length: 3 bits
5 7 code length code 7 not used
6 9 code length code 9 huffman encoded length: 5 bits
7 6 code length code 6 not used
8 10 code length code 10 huffman encoded length: 4 bits
9 5 code length code 5 not used
10 11 code length code 11 not used
11 4 code length code 4 huffman encoded length: 4 bits
12 12 code length code 12 not used
13 3 code length code 3 huffman encoded length: 5 bits
14 13 code length code 13 not used
15 2 code length code 2 huffman encoded length: 3 bits
16 14 code length code 14 not used, value not stored
17 1 code length code 1 not used, value not stored
18 15 code length code 15 not used, value not stored

we can define the Huffman code for an alphabet just by giving the bit lengths of the codes for each symbol of the alphabet in order; this is sufficient to determine the actual codes.

For this example, let’s sort it as:

code length symbol encoded length
16 1 bit
2 3 bit
8 3 bit
4 4 bit
10 4 bit
0 5 bit
3 5 bit
9 5 bit
18 5 bit

Step 1: Count the number of codes for each code length.
Let bl_count[N] be the number of codes of length N, N >= 1.

code length symbol encoded length number of code of length
16 1 bit 1 symbol encoded to 1 bit
2 3 bit 2 symbol encoded to 3 bits
8 3 bit -
4 4 bit 2 symbol encoded to 4 bits
10 4 bit -
0 5 bit 4 symbol encoded to 5 bits
3 5 bit -
9 5 bit -
18 5 bit -

So, the bl_count has value [0, 1, 0, 2, 2, 4]

Step 2: Find the numerical value of the smallest code for each code length:

1
2
3
4
5
6
code = 0;
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code;
}
code length symbol encoded length number of code of length encoded code
16 1 bit bl_count[1] = 1 0 (b’0)
2 3 bit bl_count[3] = 2 4 (b’100)
8 3 bit -
4 4 bit bl_count[4] = 2 12 (b’1100)
10 4 bit -
0 5 bit bl_count[5] = 4 28 (b’11100)
3 5 bit -
9 5 bit -
18 5 bit -

Step 3: Assign numerical values to all codes, using consecutive values for all codes of the same length with the base values determined at step 2.

code length symbol encoded length number of code of length encoded code
16 1 bit bl_count[1] = 1 0 (b’0)
2 3 bit bl_count[3] = 2 4 (b’100)
8 3 bit - 5 (b’101)
4 4 bit bl_count[4] = 2 12 (b’1100)
10 4 bit - 13 (b’1101)
0 5 bit bl_count[5] = 4 28 (b’11100)
3 5 bit - 29 (b’11101)
9 5 bit - 30 (b’11110)
18 5 bit - 31 (b’11111)

for now, we have constructed the code length huffman tree as shown in below:

index code length description
0 16 code length code 16 encoded to 0 (b’0)
1 17 -
2 18 code length code 18 encoded to 31 (b’11111)
3 0 code length code 0 encoded to 28 (b’11100)
4 8 code length code 8 encoded to 5 (b’101)
5 7 -
6 9 code length code 9 encoded to 30 (b’11110)
7 6 -
8 10 code length code 10 encoded to 13 (b’1101)
9 5 -
10 11 -
11 4 code length code 4 encoded to 12 (b’1100)
12 12 -
13 3 code length code 3 encoded to 29 (b’11101)
14 13 -
15 2 code length code 2 encoded to 4 (b’100)
16 14 -
17 1 -
18 15 -

Note: Codes that are never used (which have a bit length of zero) must not be assigned a value.

This code length huffman tree is used to decode the following literal/length distance table.
And from this table, we will construct the literal/length huffman tree and the distance huffman tree.

We know that there has 0-285 value in the literal/length alphabet, and 0-29 value in the distance alphabet.
But not all of them may get used in the current dynamic huffman block, thus, we have the stored value 5 bits HLIT and the 5 bits HDIST to show the symbol usage info.

For example, if the HLIT value is 1, that means the first (1 + 257) value in the literal/length alphabet are get stored in the literal/length distance table, and the value 258 to 285 are not in the literal/length distance table. (for sure, they are not get used within the current block.)

If the HDIST value is 23, that means the first (23 + 1) value in the distance alphabet are get stored in the literal/length distance table. (after the stored literal/length symbol value), and the value 24 to 29 are not in the literal/length distance table. (for sure, they are not get used within the current block.)

Previously, we have construct the code length huffman tree, and it can be used to decode the literal/length distance table. Assume, below value is huffman decoded from the literal/length distance table:

1
2
3
4
5
5, 
0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3),
0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3),
5, 30, 5, 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3), 0, (3),
13, 13, 31, (6), 12, 12, 28, 4, 4, 4, 29

According this decoded value, we can get the below table:

index value from literal/length distance table symbol code length
0 5 8 bits
1-6 0, (3) 16, repeat previous length code 8 for 6 times
7-12 0, (3) 16, repeat previous length code 8 for 6 times
13-18 0, (3) 16, repeat previous length code 8 for 6 times
19-25 0, (3) 16, repeat previous length code 8 for 6 times
31-37 0, (3) 16, repeat previous length code 8 for 6 times
43-48 0, (3) 16, repeat previous length code 8 for 6 times
49-54 0, (3) 16, repeat previous length code 8 for 6 times
55-60 0, (3) 16, repeat previous length code 8 for 6 times
61-66 0, (3) 16, repeat previous length code 8 for 6 times
67-72 0, (3) 16, repeat previous length code 8 for 6 times
73-78 0, (3) 16, repeat previous length code 8 for 6 times
79-84 0, (3) 16, repeat previous length code 8 for 6 times
85-90 0, (3) 16, repeat previous length code 8 for 6 times
91-96 0, (3) 16, repeat previous length code 8 for 6 times
97-102 0, (3) 16, repeat previous length code 8 for 6 times
103-108 0, (3) 16, repeat previous length code 8 for 6 times
109-114 0, (3) 16, repeat previous length code 8 for 6 times
115-120 0, (3) 16, repeat previous length code 8 for 6 times
121-126 0, (3) 16, repeat previous length code 8 for 6 times
127-132 0, (3) 16, repeat previous length code 8 for 6 times
133-138 0, (3) 16, repeat previous length code 8 for 6 times
139-144 0, (3) 16, repeat previous length code 8 for 6 times
145-150 0, (3) 16, repeat previous length code 8 for 6 times
151-156 0, (3) 16, repeat previous length code 8 for 6 times
157-162 0, (3) 16, repeat previous length code 8 for 6 times
163-167 0, (3) 16, repeat previous length code 8 for 6 times
169-174 0, (3) 16, repeat previous length code 8 for 6 times
175-180 0, (3) 16, repeat previous length code 8 for 6 times
181-186 0, (3) 16, repeat previous length code 8 for 6 times
187-182 0, (3) 16, repeat previous length code 8 for 6 times
193-198 0, (3) 16, repeat previous length code 8 for 6 times
199-204 0, (3) 16, repeat previous length code 8 for 6 times
205 5 8 bits
206 30 9 bits
207 5 8 bits
208-213 0, (3) 16, repeat previous length code 8 for 6 times
214-219 0, (3) 16, repeat previous length code 8 for 6 times
220-225 0, (3) 16, repeat previous length code 8 for 6 times
226-231 0, (3) 16, repeat previous length code 8 for 6 times
232-237 0, (3) 16, repeat previous length code 8 for 6 times
238-243 0, (3) 16, repeat previous length code 8 for 6 times
244-249 0, (3) 16, repeat previous length code 8 for 6 times
250-255 0, (3) 16, repeat previous length code 8 for 6 times
256 13 10 bits
257 13 10 bits
index value from literal/length distance table symbol code length
0-16 31, (6) 18, repeat length code 0 for 17 times
17 12 4 bits
18 12 4 bits
19 28 0 bits
20 4 2 bits
21 4 2 bits
22 4 2 bits
23 29 3 bits

With the same method of the construct of the code length huffman tree, we can construct the literal/length huffman tree and distance huffman tree.

fixed Huffman block

For the fixed Huffman block, the Huffman codes for the two alphabets are fixed, and are not represented explicitly in the data:

type size description
BFINAL 1 bit b’* (0 or 1)
BTYPE 2 bits b’01
huffman encoded bit stream * bits end with EOB symbol (256)

The Huffman code lengths for the literal/length alphabet are:

index code length
0-143 8 bits
144-255 9 bits
256-279 7 bits
280-285 8 bits

The Huffman code lengths for the distance alphabet are:

index code length
0-29 5 bits

With these info, we can easily construct the 2 huffman trees.

reference

https://datatracker.ietf.org/doc/html/rfc1951