deflate format
DEFLATE is a lossless data compression algorithm, it’s a combination of the LZ77 algorithm and Huffman coding.
deflate blocks
deflate compressed data format consists a series of blocks.
Each block has a 3 bit header, they are independent with each other:
bit 0 | BFINAL |
bit 1-2 | BTYPE |
Note that the header bits do not necessarily begin on a byte boundary,
since a block does not necessarily occupy an integral number of bytes.
BFINAL is set if and only if this is the last block of the data set.
BTYPE specifies the block type:
b’00 | no compression |
b’01 | compressed with fixed Huffman codes |
b’10 | compressed with dynamic Huffman codes |
b’11 | reserved (error) |
no compression block (aka stored block)
no compression block contains the uncompressed original literal data.
type | size | description |
---|---|---|
BFINAL | 1 bit | b’* (0 or 1) |
BTYPE | 2 bits | b’00 |
Reserved bits | < 8 bits | make sure the offset of LEN is byte aligned |
LEN | 2 bytes | number of data bytes in the block |
NLEN | 2 bytes | NLEN = 65535 - LEN, used for the number validation |
Literal data | LEN in bytes | uncompressed data |
no compression block has the limited block sise, due to the maxium value of LEN is 65535.
dynamic Huffman block
Dynamic Huffman block is compressed using a combination of the LZ77 algorithm and Huffman coding.
And the Huffman coding tree is dynamically generated at the compression stage, it’s stored within the block.
type | size | description |
---|---|---|
BFINAL | 1 bit | b’* (0 or 1) |
BTYPE | 2 bits | b’10 |
HLIT | 5 bits | number of Literal/Length codes - 257 (257 - 286) |
HDIST | 5 bits | number of Distance codes - 1 (1 - 32) |
HCLEN | 4 bits | number of Code Length codes - 4 (4 - 19) |
Code Length codes | 3 bits * (HCLEN + 4) | code lengths for the code length table |
literal/length distance table | * bits (related to the HLIT, HDIST and the code length table) | 2 huffman tree stored |
huffman encoded bit stream | * bits | end with EOB symbol (256) |
Note: The huffman encoded bit stream has different bit order compared to other fields. (In the literal/length distance table, it is also the huffman encoded bit stream but for the literal/length distance code length info)
In the huffman encoded bit stream, it consist of sequences of symbols drawn from three conceptually distinct alphabets: either literal bytes, from the alphabet of byte values (0..255), or (length, backward distance) pairs.
The representation used in this bit stream limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of the block,
thus, the length is drawn from (3..258) and the distance is drawn from (1..32,768).
In fact, the literal and length alphabets are merged into a single alphabet (0..285),
where values 0..255 represent literal bytes, the value 256 indicates end-of-block.
And values 257..285 represent length codes (possibly in conjunction with extra bits following the symbol code) as follows:
code | extra bits | length | code | extra bits | length | code | extra bits | length |
---|---|---|---|---|---|---|---|---|
257 | 0 | 3 | 267 | 1 | 15,16 | 277 | 4 | 67-82 |
258 | 0 | 4 | 268 | 1 | 17,18 | 278 | 4 | 83-98 |
259 | 0 | 5 | 269 | 2 | 19-22 | 279 | 4 | 99-114 |
260 | 0 | 6 | 270 | 2 | 23-26 | 280 | 4 | 115-130 |
261 | 0 | 7 | 271 | 2 | 27-30 | 281 | 5 | 131-162 |
262 | 0 | 8 | 272 | 2 | 31-34 | 282 | 5 | 163-194 |
263 | 0 | 9 | 273 | 3 | 35-42 | 283 | 5 | 195-226 |
264 | 0 | 10 | 274 | 3 | 43-50 | 284 | 5 | 227-257 |
265 | 1 | 11,12 | 275 | 3 | 51-58 | 285 | 0 | 258 |
266 | 1 | 13,14 | 276 | 3 | 59-66 |
The distance code has another alphabet, the value is from 0 to 29:
code | extra bits | distance | code | extra bits | distance | code | extra bits | distance |
---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 10 | 4 | 33-48 | 20 | 9 | 1025-1536 |
1 | 0 | 2 | 11 | 4 | 49-64 | 21 | 9 | 1537-2048 |
2 | 0 | 3 | 12 | 5 | 65-96 | 22 | 10 | 2049-3072 |
3 | 0 | 4 | 13 | 5 | 97-128 | 23 | 10 | 3073-4096 |
4 | 1 | 5,6 | 14 | 6 | 129-192 | 24 | 11 | 4097-6144 |
5 | 1 | 7,8 | 15 | 6 | 193-256 | 25 | 11 | 6145-8192 |
6 | 2 | 9-12 | 16 | 7 | 257-384 | 26 | 12 | 8193-12288 |
7 | 2 | 13-16 | 17 | 7 | 385-512 | 27 | 12 | 12289-16384 |
8 | 3 | 17-24 | 18 | 8 | 513-768 | 28 | 13 | 16385-24576 |
9 | 3 | 25-32 | 19 | 8 | 769-1024 | 29 | 13 | 24577-32768 |
Generally, the data in the huffman encoded bit stream has the below format:
1 | literal huffman code, |
And after the huffman decode, it translate to:
1 | literal symbol (0-255), |
Due to there has 2 alphabets: literal and length shared the same alphabet, distance used another alphabet.
There has 2 huffman coding trees, one is for the literal and length code, another is for the distance code. But they don’t get stored into the block directly.
Below steps show how to get these 2 huffman tables from the info stored within the block:
First thing is deflate also pre-defined a code length table:
index | code length | description |
---|---|---|
0 | 16 | repeat last length for (3 + extra: 3-6) times, follow with extra info (2 bits) |
1 | 17 | repeat zero for (3 + extra: 3-10) times, follow with extra info (3 bits) |
2 | 18 | repeat zero for (11 + extra: 11-138) times, follow with extra info (7 bits) |
3 | 0 | literal/length & distance code length encoded to 0 bit |
4 | 8 | literal/length & distance code length encoded to 8 bits |
5 | 7 | literal/length & distance code length encoded to 7 bits |
6 | 9 | literal/length & distance code length encoded to 9 bits |
7 | 6 | literal/length & distance code length encoded to 6 bits |
8 | 10 | literal/length & distance code length encoded to 10 bits |
9 | 5 | literal/length & distance code length encoded to 5 bits |
10 | 11 | literal/length & distance code length encoded to 11 bits |
11 | 4 | literal/length & distance code length encoded to 4 bits |
12 | 12 | literal/length & distance code length encoded to 12 bits |
13 | 3 | literal/length & distance code length encoded to 3 bits |
14 | 13 | literal/length & distance code length encoded to 13 bits |
15 | 2 | literal/length & distance code length encoded to 2 bits |
16 | 14 | literal/length & distance code length encoded to 14 bits |
17 | 1 | literal/length & distance code length encoded to 1 bit |
18 | 15 | literal/length & distance code length encoded to 15 bits |
4 bits HCLEN shows that the size of the stored code length info, the value is from 0 to 15,
and it indicates the number of the stored code length info, from 4 to 19.
For example, if HCLEN=12, then the first 16 code length info get stored with the block with 3-bit size each.
And from the index 16 to index 18, these code length info are just not stored, it means they are not used.
Also, if there has the 0 value with the 3-bits, that means the correspond code length is not used.
Let’s assume, these 16 code length info has the listed 3-bits value: [1,0,5,5,3,0,5,0,4,0,0,4,0,5,0,3]
Then, it tells us:
index | code length | description |
---|---|---|
0 | 16 | code length code 16 huffman encoded length: 1 bit |
1 | 17 | code length code 17 not used |
2 | 18 | code length code 18 huffman encoded length: 5 bits |
3 | 0 | code length code 0 huffman encoded length: 5 bits |
4 | 8 | code length code 8 huffman encoded length: 3 bits |
5 | 7 | code length code 7 not used |
6 | 9 | code length code 9 huffman encoded length: 5 bits |
7 | 6 | code length code 6 not used |
8 | 10 | code length code 10 huffman encoded length: 4 bits |
9 | 5 | code length code 5 not used |
10 | 11 | code length code 11 not used |
11 | 4 | code length code 4 huffman encoded length: 4 bits |
12 | 12 | code length code 12 not used |
13 | 3 | code length code 3 huffman encoded length: 5 bits |
14 | 13 | code length code 13 not used |
15 | 2 | code length code 2 huffman encoded length: 3 bits |
16 | 14 | code length code 14 not used, value not stored |
17 | 1 | code length code 1 not used, value not stored |
18 | 15 | code length code 15 not used, value not stored |
we can define the Huffman code for an alphabet just by giving the bit lengths of the codes for each symbol of the alphabet in order; this is sufficient to determine the actual codes.
For this example, let’s sort it as:
code length symbol | encoded length |
---|---|
16 | 1 bit |
2 | 3 bit |
8 | 3 bit |
4 | 4 bit |
10 | 4 bit |
0 | 5 bit |
3 | 5 bit |
9 | 5 bit |
18 | 5 bit |
Step 1: Count the number of codes for each code length.
Let bl_count[N] be the number of codes of length N, N >= 1.
code length symbol | encoded length | number of code of length |
---|---|---|
16 | 1 bit | 1 symbol encoded to 1 bit |
2 | 3 bit | 2 symbol encoded to 3 bits |
8 | 3 bit | - |
4 | 4 bit | 2 symbol encoded to 4 bits |
10 | 4 bit | - |
0 | 5 bit | 4 symbol encoded to 5 bits |
3 | 5 bit | - |
9 | 5 bit | - |
18 | 5 bit | - |
So, the bl_count has value [0, 1, 0, 2, 2, 4]
Step 2: Find the numerical value of the smallest code for each code length:
1 | code = 0; |
code length symbol | encoded length | number of code of length | encoded code |
---|---|---|---|
16 | 1 bit | bl_count[1] = 1 | 0 (b’0) |
2 | 3 bit | bl_count[3] = 2 | 4 (b’100) |
8 | 3 bit | - | |
4 | 4 bit | bl_count[4] = 2 | 12 (b’1100) |
10 | 4 bit | - | |
0 | 5 bit | bl_count[5] = 4 | 28 (b’11100) |
3 | 5 bit | - | |
9 | 5 bit | - | |
18 | 5 bit | - |
Step 3: Assign numerical values to all codes, using consecutive values for all codes of the same length with the base values determined at step 2.
code length symbol | encoded length | number of code of length | encoded code |
---|---|---|---|
16 | 1 bit | bl_count[1] = 1 | 0 (b’0) |
2 | 3 bit | bl_count[3] = 2 | 4 (b’100) |
8 | 3 bit | - | 5 (b’101) |
4 | 4 bit | bl_count[4] = 2 | 12 (b’1100) |
10 | 4 bit | - | 13 (b’1101) |
0 | 5 bit | bl_count[5] = 4 | 28 (b’11100) |
3 | 5 bit | - | 29 (b’11101) |
9 | 5 bit | - | 30 (b’11110) |
18 | 5 bit | - | 31 (b’11111) |
for now, we have constructed the code length huffman tree as shown in below:
index | code length | description |
---|---|---|
0 | 16 | code length code 16 encoded to 0 (b’0) |
1 | 17 | - |
2 | 18 | code length code 18 encoded to 31 (b’11111) |
3 | 0 | code length code 0 encoded to 28 (b’11100) |
4 | 8 | code length code 8 encoded to 5 (b’101) |
5 | 7 | - |
6 | 9 | code length code 9 encoded to 30 (b’11110) |
7 | 6 | - |
8 | 10 | code length code 10 encoded to 13 (b’1101) |
9 | 5 | - |
10 | 11 | - |
11 | 4 | code length code 4 encoded to 12 (b’1100) |
12 | 12 | - |
13 | 3 | code length code 3 encoded to 29 (b’11101) |
14 | 13 | - |
15 | 2 | code length code 2 encoded to 4 (b’100) |
16 | 14 | - |
17 | 1 | - |
18 | 15 | - |
Note: Codes that are never used (which have a bit length of zero) must not be assigned a value.
This code length huffman tree is used to decode the following literal/length distance table.
And from this table, we will construct the literal/length huffman tree and the distance huffman tree.
We know that there has 0-285 value in the literal/length alphabet, and 0-29 value in the distance alphabet.
But not all of them may get used in the current dynamic huffman block, thus, we have the stored value 5 bits HLIT and the 5 bits HDIST to show the symbol usage info.
For example, if the HLIT value is 1, that means the first (1 + 257) value in the literal/length alphabet are get stored in the literal/length distance table, and the value 258 to 285 are not in the literal/length distance table. (for sure, they are not get used within the current block.)
If the HDIST value is 23, that means the first (23 + 1) value in the distance alphabet are get stored in the literal/length distance table. (after the stored literal/length symbol value), and the value 24 to 29 are not in the literal/length distance table. (for sure, they are not get used within the current block.)
Previously, we have construct the code length huffman tree, and it can be used to decode the literal/length distance table. Assume, below value is huffman decoded from the literal/length distance table:
1 | 5, |
According this decoded value, we can get the below table:
index | value from literal/length distance table | symbol code length |
---|---|---|
0 | 5 | 8 bits |
1-6 | 0, (3) | 16, repeat previous length code 8 for 6 times |
7-12 | 0, (3) | 16, repeat previous length code 8 for 6 times |
13-18 | 0, (3) | 16, repeat previous length code 8 for 6 times |
19-25 | 0, (3) | 16, repeat previous length code 8 for 6 times |
31-37 | 0, (3) | 16, repeat previous length code 8 for 6 times |
43-48 | 0, (3) | 16, repeat previous length code 8 for 6 times |
49-54 | 0, (3) | 16, repeat previous length code 8 for 6 times |
55-60 | 0, (3) | 16, repeat previous length code 8 for 6 times |
61-66 | 0, (3) | 16, repeat previous length code 8 for 6 times |
67-72 | 0, (3) | 16, repeat previous length code 8 for 6 times |
73-78 | 0, (3) | 16, repeat previous length code 8 for 6 times |
79-84 | 0, (3) | 16, repeat previous length code 8 for 6 times |
85-90 | 0, (3) | 16, repeat previous length code 8 for 6 times |
91-96 | 0, (3) | 16, repeat previous length code 8 for 6 times |
97-102 | 0, (3) | 16, repeat previous length code 8 for 6 times |
103-108 | 0, (3) | 16, repeat previous length code 8 for 6 times |
109-114 | 0, (3) | 16, repeat previous length code 8 for 6 times |
115-120 | 0, (3) | 16, repeat previous length code 8 for 6 times |
121-126 | 0, (3) | 16, repeat previous length code 8 for 6 times |
127-132 | 0, (3) | 16, repeat previous length code 8 for 6 times |
133-138 | 0, (3) | 16, repeat previous length code 8 for 6 times |
139-144 | 0, (3) | 16, repeat previous length code 8 for 6 times |
145-150 | 0, (3) | 16, repeat previous length code 8 for 6 times |
151-156 | 0, (3) | 16, repeat previous length code 8 for 6 times |
157-162 | 0, (3) | 16, repeat previous length code 8 for 6 times |
163-167 | 0, (3) | 16, repeat previous length code 8 for 6 times |
169-174 | 0, (3) | 16, repeat previous length code 8 for 6 times |
175-180 | 0, (3) | 16, repeat previous length code 8 for 6 times |
181-186 | 0, (3) | 16, repeat previous length code 8 for 6 times |
187-182 | 0, (3) | 16, repeat previous length code 8 for 6 times |
193-198 | 0, (3) | 16, repeat previous length code 8 for 6 times |
199-204 | 0, (3) | 16, repeat previous length code 8 for 6 times |
205 | 5 | 8 bits |
206 | 30 | 9 bits |
207 | 5 | 8 bits |
208-213 | 0, (3) | 16, repeat previous length code 8 for 6 times |
214-219 | 0, (3) | 16, repeat previous length code 8 for 6 times |
220-225 | 0, (3) | 16, repeat previous length code 8 for 6 times |
226-231 | 0, (3) | 16, repeat previous length code 8 for 6 times |
232-237 | 0, (3) | 16, repeat previous length code 8 for 6 times |
238-243 | 0, (3) | 16, repeat previous length code 8 for 6 times |
244-249 | 0, (3) | 16, repeat previous length code 8 for 6 times |
250-255 | 0, (3) | 16, repeat previous length code 8 for 6 times |
256 | 13 | 10 bits |
257 | 13 | 10 bits |
index | value from literal/length distance table | symbol code length |
---|---|---|
0-16 | 31, (6) | 18, repeat length code 0 for 17 times |
17 | 12 | 4 bits |
18 | 12 | 4 bits |
19 | 28 | 0 bits |
20 | 4 | 2 bits |
21 | 4 | 2 bits |
22 | 4 | 2 bits |
23 | 29 | 3 bits |
With the same method of the construct of the code length huffman tree, we can construct the literal/length huffman tree and distance huffman tree.
fixed Huffman block
For the fixed Huffman block, the Huffman codes for the two alphabets are fixed, and are not represented explicitly in the data:
type | size | description |
---|---|---|
BFINAL | 1 bit | b’* (0 or 1) |
BTYPE | 2 bits | b’01 |
huffman encoded bit stream | * bits | end with EOB symbol (256) |
The Huffman code lengths for the literal/length alphabet are:
index | code length |
---|---|
0-143 | 8 bits |
144-255 | 9 bits |
256-279 | 7 bits |
280-285 | 8 bits |
The Huffman code lengths for the distance alphabet are:
index | code length |
---|---|
0-29 | 5 bits |
With these info, we can easily construct the 2 huffman trees.
reference
本文标题:deflate format
文章作者:Mr Bluyee
发布时间:2024-03-23
最后更新:2024-03-28
原始链接:https://www.mrbluyee.com/2024/03/23/deflate%20format/
版权声明:The author owns the copyright, please indicate the source reproduced.