huffman coding in deflate | Mr Bluyee's Blog

In deflate dynamic huffman blocks, the huffman tree is stored with the canonical type.

Canonical Huffman code

a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

Translate a normal huffman coding to the canonical huffman code

Suppose we have the following non-canonical codebook:

A = 11
B = 0
C = 101
D = 100

Here the letter A has been assigned 2 bits, B has 1 bit, and C and D both have 3 bits. To make the code a canonical Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sorted first by codeword length and secondly by alphabetical value of the letter:

B = 0
A = 11
C = 101
D = 100

Each of the existing codes are replaced with a new one of the same length, using the following algorithm:

The first symbol in the list gets assigned a codeword which is the same length as the symbol’s original codeword but all zeros. This will often be a single zero (‘0’).
Each subsequent symbol is assigned the next binary number in sequence, ensuring that following codes are always higher in value.
When you reach a longer codeword, then after incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as a left shift.

By following these three rules, the canonical version of the code book produced will be:

B = 0
A = 10
C = 110
D = 111

store canonical huffman code efficiently

A = 10    (code value: 2 decimal, bits: 2)
B = 0     (code value: 0 decimal, bits: 1)
C = 110   (code value: 6 decimal, bits: 3)
D = 111   (code value: 7 decimal, bits: 3)

Assume the alphabets are start from A,B,C,D, …
This alphabets should be the predefined, and be known from both compression and decompression side. Thus, only the below info need to be stored:

1 2	Used symbol length： 4 Number of bits for each symbol: 2, 1, 3, 3

get the canonical huffman code from the stored info

Used symbol length： 4
Number of bits for each symbol: 2, 1, 3, 3

Get the used symbol info and the correspond bit length:
('A',2), ('B',1), ('C',3), ('D',3)

Sort it with bit length:

1 2	bit length: [1, 2, 3, 3] used symbol: [B, A, C, D]

Pseudocode to get the canonical huffman code:

code := 0
while more symbols do
    print symbol = code
    code := (code + 1) << ((bit length of the next symbol) − (current bit length))

output:
B = 0
A = 10
C = 110
D = 111

decompression with the canonical huffman code

For the decompression, we do not need to create the codes or the tree in order to decode canonical codes. All we need is the list of symbols in order and the count of symbols in each code length.

Since the canonical codes within a code length are sequential binary integers, you can simply do integer comparisons to see if the bits you have fall within that code range, and if it is, an integer subtraction to determine which symbol it is.

example code from https://github.com/madler/zlib/blob/master/contrib/puff/puff.c

#define MAXBITS 15              /* maximum bits in a code */

/*
 * Huffman code decoding tables.  count[1..MAXBITS] is the number of symbols of
 * each length, which for a canonical code are stepped through in order.
 * symbol[] are the symbol values in canonical order, where the number of
 * entries is the sum of the counts in count[].  The decoding process can be
 * seen in the function decode() below.
 */
struct huffman {
    short *count;       /* number of symbols of each length */
    short *symbol;      /* canonically ordered symbols */
};

/*
 * Decode a code from the stream s using huffman table h.  Return the symbol or
 * a negative value if there is an error.  If all of the lengths are zero, i.e.
 * an empty code, or if the code is incomplete and an invalid code is received,
 * then -10 is returned after reading MAXBITS bits.
 *
 * Format notes:
 *
 * - The codes as stored in the compressed data are bit-reversed relative to
 *   a simple integer ordering of codes of the same lengths.  Hence below the
 *   bits are pulled from the compressed data one at a time and used to
 *   build the code value reversed from what is in the stream in order to
 *   permit simple integer comparisons for decoding.  A table-based decoding
 *   scheme (as used in zlib) does not need to do this reversal.
 *
 * - The first code for the shortest length is all zeros.  Subsequent codes of
 *   the same length are simply integer increments of the previous code.  When
 *   moving up a length, a zero bit is appended to the code.  For a complete
 *   code, the last code of the longest length will be all ones.
 *
 * - Incomplete codes are handled by this decoder, since they are permitted
 *   in the deflate format.  See the format notes for fixed() and dynamic().
 */
local int decode(struct state *s, const struct huffman *h)
{
    int len;            /* current number of bits in code */
    int code;           /* len bits being decoded */
    int first;          /* first code of length len */
    int count;          /* number of codes of length len */
    int index;          /* index of first code of length len in symbol table */

    code = first = index = 0;
    for (len = 1; len <= MAXBITS; len++) {
        code |= bits(s, 1);             /* get next bit */
        count = h->count[len];
        if (code - count < first)       /* if length len, return symbol */
            return h->symbol[index + (code - first)];
        index += count;                 /* else update for next length */
        first += count;
        first <<= 1;
        code <<= 1;
    }
    return -10;                         /* ran out of codes */
}

Follow with the previous example, we will have:

B = 0
A = 10
C = 110
D = 111

bit length: [1, 2, 3, 3]
used symbol: [B, A, C, D]

MAXBITS = 3
h->count[MAXBITS + 1] = [0, 1, 1, 2]  (number of symbols of each length)
h->symbol = [B, A, C, D]

Assume the bit stream is:

A    B   C     D
b'10 b'0 b'110 b'111

code = first = index = 0
loop 1:
code = 0 | bits(s, 1) //code = 1
count = h->count[1] //count = 1
code - count = 0 // code - count == first
index += count //index = 1
first += count //first = 1
first <<= 1 //first = b'10
code <<= 1 //code = b'10

loop 2:
code = b'10 | bits(s, 1) //code = b'10
count = h->count[2] //count = 1
code - count = 1 //code - count < first
index + (code - first) //1 + 2 - 2 = 1
return h->symbol[1] //return A

generate the dynamic huffman code

analysis code from:
https://github.com/madler/zlib/blob/master/trees.c#L1093
https://github.com/madler/zlib/blob/develop/deflate.h#L333

There are 2 huffman table for the deflate dynamic huffman blocks,
one is the literal/length table, another is the distance table.

And we know that the LZ77 processed data will have the below format:

literal symbol (0-255),
.. , 
length (3-258), distance (1-32768),
.. ,
end of block symbol (256).

So, there has 2 kind of tally functions to calculate the symbol frequency when parsing each symbol within the block.

Related data structures and define:

#define LENGTH_CODES 29
/* number of length codes, not counting the special END_BLOCK code */

#define LITERALS  256
/* number of literal bytes 0..255 */

#define L_CODES (LITERALS+1+LENGTH_CODES)
/* number of Literal or Length codes, including the END_BLOCK code */

#define D_CODES   30
/* number of distance codes */

#define BL_CODES  19
/* number of codes used to transfer the bit lengths */

#define HEAP_SIZE (2*L_CODES+1)
/* maximum heap size */

/* Data structure describing a single value and its code string. */
typedef struct ct_data_s {
    union {
        ush  freq;       /* frequency count */
        ush  code;       /* bit string */
    } fc;
    union {
        ush  dad;        /* father node in Huffman tree */
        ush  len;        /* length of bit string */
    } dl;
} FAR ct_data;

#define Freq fc.freq
#define Code fc.code
#define Dad  dl.dad
#define Len  dl.len

struct static_tree_desc_s {
    const ct_data *static_tree;  /* static tree or NULL */
    const intf *extra_bits;      /* extra bits for each code or NULL */
    int     extra_base;          /* base index for extra_bits */
    int     elems;               /* max number of elements in the tree */
    int     max_length;          /* max bit length for the codes */
};

typedef struct tree_desc_s {
    ct_data *dyn_tree;           /* the dynamic tree */
    int     max_code;            /* largest code with non zero frequency */
    const static_tree_desc *stat_desc;  /* the corresponding static tree */
} FAR tree_desc;

typedef struct internal_state {
    ...
    struct ct_data_s dyn_ltree[HEAP_SIZE];   /* literal and length tree */
    struct ct_data_s dyn_dtree[2*D_CODES+1]; /* distance tree */
    struct ct_data_s bl_tree[2*BL_CODES+1];  /* Huffman tree for bit lengths */
    /* dyn_ltree is used for store the generated literal and length tree,
    * The elements in dyn_ltree[:L_CODES] are leaves, middle nodes get stored
    * after it. same as the dyn_dtree and bl_tree.
    */

    struct tree_desc_s l_desc;               /* desc. for literal tree */
    struct tree_desc_s d_desc;               /* desc. for distance tree */
    struct tree_desc_s bl_desc;              /* desc. for bit length tree */

    ush bl_count[MAX_BITS+1];
    /* number of codes at each bit length for an optimal tree */

    int heap[2*L_CODES+1];      /* heap used to build the Huffman trees */
    int heap_len;               /* number of elements in the heap */
    int heap_max;               /* element of largest frequency */
    /* The sons of heap[n] are heap[2*n] and heap[2*n+1]. heap[0] is not used.
     * The same heap array is used to build all trees.
     */

    uch depth[2*L_CODES+1];
    /* Depth of each subtree used as tie breaker for trees of equal frequency
     */
    ...
}

2 tally functions:

//if it is a literal symbol (0-255, 256):

_tr_tally_lit(s, c, flush)
     s->dyn_ltree[cc].Freq++;

//if it is a length & distance pair:

_tr_tally_dist(s, distance, length, flush)
    s->dyn_ltree[_length_code[len]+LITERALS+1].Freq++;
    s->dyn_dtree[d_code(dist)].Freq++;

Assume, we have A,B,C,D four symbols, and the frequency shown as below:

symbol index	symbol	frequency
0	A	4
1	B	5
2	C	1
3	D	2

The huffman table is created by the build_tree() function：

/* ===========================================================================
 * Construct one Huffman tree and assigns the code bit strings and lengths.
 * Update the total bit length for the current block.
 * IN assertion: the field freq is set for all tree elements.
 * OUT assertions: the fields len and code are set to the optimal bit length
 *     and corresponding code. The length opt_len is updated; static_len is
 *     also updated if stree is not null. The field max_code is set.
 */
local void build_tree(deflate_state *s, tree_desc *desc)

    ct_data *tree         = desc->dyn_tree;
    const ct_data *stree  = desc->stat_desc->static_tree;
    int elems             = desc->stat_desc->elems;
    int n, m;          /* iterate over heap elements */
    int max_code = -1; /* largest code with non zero frequency */
    int node;          /* new node being created */

    /* Construct the initial heap, with least frequent element in heap[SMALLEST].
     * The sons of heap[n] are heap[2*n] and heap[2*n + 1].
     * heap[0] is not used.
     * HEAP_SIZE = (2*L_CODES+1)
     */
    s->heap_len = 0, s->heap_max = HEAP_SIZE;

    /* Initial the heap, put all the elems which Freq is not 0 to the heap.
    * The elems with 0 Freq will be set the 0 to the Len.
    * The elements heap[heap_len/2 + 1 .. heap_len] are leaves of the tree.
    * for l_tree, elems = L_CODES = (256 + 1 + 29) = 286
    * for d_tree, elems = D_CODES = 30
    * for bl_tree, elems = BL_CODES = 19
    */

    for (n = 0; n < elems; n++) {
        if (tree[n].Freq != 0) {
            s->heap[++(s->heap_len)] = max_code = n;
            s->depth[n] = 0;
        } else {
            tree[n].Len = 0;
        }
    }

For the A, B, C, D four symbols, the heap size should be 2 * 4 + 1 = 9, elems = 4.
So, at the initial heap stage, we will get:

heap index	symbol	symbol frequency	symbol depth
0	not used	-	-
1	A	4	0
2	B	5	0
3	C	1	0
4	D	2	0

heap_len = 4
max_code = 3

/* The elements heap[heap_len/2 + 1 .. heap_len] are leaves of the tree,
 * establish sub-heaps of increasing lengths:
 */
for (n = s->heap_len/2; n >= 1; n--) pqdownheap(s, tree, n);

/* pqdownheap(deflate_state *s, ct_data *tree, int k) is used to 
 * restore the heap property by moving down the tree starting at node k,
 * exchanging a node with the smallest of its two sons if necessary, stopping
 * when the heap property is re-established (each father smaller than its
 * two sons).

We can build a tree with the initial heap:
          A(4)
      /         \
    B(5)      C(1)
  /
D(2)

So, the elements heap[3 (4 / 2 + 1)] = C, heap[4] = D are leaves of the tree.
And the for loop will do pqdownheap() for the heap[2] = B, heap[1] = A:

loop 1:
pqdownheap(s, tree, B):
          A(4)
      /         \
    D(2)      C(1)
  /
B(5)

loop 2:
pqdownheap(s, tree, A):
          C(1)
      /         \
    D(2)      A(4)
  /
B(5)

/* Construct the Huffman tree by repeatedly combining the least two
 * frequent nodes.
 */
node = elems;              /* next internal node of the tree */
do {
    pqremove(s, tree, n);  /* n = node of least frequency */
    /*
     * Remove the smallest element from the heap and recreate the heap with
     * one less element. Updates heap and heap_len.         
    */

    m = s->heap[SMALLEST]; /* m = node of next least frequency (SMALLEST = 1) */

    s->heap[--(s->heap_max)] = n; /* keep the nodes sorted by frequency */
    s->heap[--(s->heap_max)] = m;

    /* Create a new node father of n and m */
    tree[node].Freq = tree[n].Freq + tree[m].Freq;
    s->depth[node] = (uch)((s->depth[n] >= s->depth[m] ?
                            s->depth[n] : s->depth[m]) + 1);
    tree[n].Dad = tree[m].Dad = (ush)node;

    /* and insert the new node in the heap */
    s->heap[SMALLEST] = node++;
    pqdownheap(s, tree, SMALLEST);

} while (s->heap_len >= 2);

heap_len = 4
SMALLEST = 1
node = elems = 4

#define pqremove(s, tree, top) \
{\
    top = s->heap[SMALLEST]; \
    s->heap[SMALLEST] = s->heap[s->heap_len--]; \
    pqdownheap(s, tree, SMALLEST); \
}

loop 1:
pqremove() step 1:
n = C

pqremove() step 2:
s->heap_len = 3
          B(5)
      /         \
    D(2)      A(4)

pqremove() step 3:
          D(2)
      /         \
    B(5)      A(4)

m = D
heap[8] = C
heap[7] = D

Create a new node father of n and m:
tree[4] = C.freq + D.freq = 1 + 2 = 3

insert the new node in the heap:
          (3)
      /         \
    B(5)      A(4)

pqdownheap(s, tree, SMALLEST):
          (3)
      /         \
    B(5)      A(4)

heap index	symbol	symbol frequency	symbol dad
0	not used	-	-
1	(3)	3	-
2	B	5	-
3	A	4	-
4	-	-	-
5	-	-	-
6	-	-	-
7	D	2	(3)
8	C	1	(3)

loop 2:
pqremove() step 1:
n = (3)

pqremove() step 2:
s->heap_len = 2
          A(4)
      /
    B(5)

pqremove() step 3:
          A(4)
      /
    B(5)

m = A
heap[6] = (3)
heap[5] = A

Create a new node father of n and m:
tree[5] = (3).freq + A.freq = 3 + 4 = 7

insert the new node in the heap:
          (7)
      /
    B(5)

pqdownheap(s, tree, SMALLEST):
          B(5)
      /
    (7)

heap index	symbol	symbol frequency	symbol dad
0	not used	-	-
1	B	5	-
2	(7)	7	-
3	-	-	-
4	-	-	-
5	A	4	(7)
6	(3)	3	(7)
7	D	2	(3)
8	C	1	(3)

loop 3:
pqremove() step 1:
n = B

pqremove() step 2:
s->heap_len = 1
(7)

pqremove() step 3:
(7)

m = (7)
heap[4] = B
heap[3] = (7)

Create a new node father of n and m:
tree[6] = (7).freq + B.freq = 7 + 5 = 12

insert the new node in the heap:
(12)

pqdownheap(s, tree, SMALLEST):
(12)

heap index	symbol	symbol frequency	symbol dad
0	not used	-	-
1	(12)	12	-
2	-	-	-
3	(7)	7	(12)
4	B	5	(12)
5	A	4	(7)
6	(3)	3	(7)
7	D	2	(3)
8	C	1	(3)

1	s->heap[--(s->heap_max)] = s->heap[SMALLEST];

heap_max = 2

for now, we have the elements in heap:

heap index	symbol	symbol frequency	symbol dad
0	not used	-	-
1	-	-	-
2	(12)	12	-
3	(7)	7	(12)
4	B	5	(12)
5	A	4	(7)
6	(3)	3	(7)
7	D	2	(3)
8	C	1	(3)

for now, we have the elements in tree:

tree index	symbol	symbol frequency	symbol dad
0	A	4	(7)
1	B	5	(12)
2	C	1	(3)
3	D	2	(3)
4	(3)	3	(7)
5	(7)	7	(12)
6	(12)	12	-
7
8

/* At this point, the fields freq and dad are set. We can now
 * generate the bit lengths.
 */
gen_bitlen(s, (tree_desc *)desc);

The symbol bit length is generated by the gen_bitlen() function：
(below function not considered the overflow case, if overflow happened, adjust and recalculate)

/* In a first pass, compute the optimal bit lengths (which may
 * overflow in the case of the bit length tree).
 */
tree[s->heap[s->heap_max]].Len = 0; /* root of the heap */

for (h = s->heap_max + 1; h < HEAP_SIZE; h++) {
    n = s->heap[h];
    bits = tree[tree[n].Dad].Len + 1;
    if (bits > max_length) bits = max_length, overflow++;
    tree[n].Len = (ush)bits;
    /* We overwrite tree[n].Dad which is no longer needed */

    if (n > max_code) continue; /* not a leaf node */

    s->bl_count[bits]++;
    xbits = 0;
    if (n >= base) xbits = extra[n - base];
    f = tree[n].Freq;
    s->opt_len += (ulg)f * (unsigned)(bits + xbits);
    if (stree) s->static_len += (ulg)f * (unsigned)(stree[n].Len + xbits);
}

tree index	symbol	symbol frequency	symbol dad	symbol len
0	A	4	(7)	2
1	B	5	(12)	1
2	C	1	(3)	3
3	D	2	(3)	3
4	(3)	3	(7)	2
5	(7)	7	(12)	1
6	(12)	12	-	0
7
8

bl_count: (only calculate the leaves)

bit length	symbol count
1	1
2	1
3	2

/* The field len is now set, we can generate the bit codes */
gen_codes ((ct_data *)tree, max_code, s->bl_count):
    unsigned code = 0;  
    for (bits = 1; bits <= MAX_BITS; bits++) {
        code = (code + bl_count[bits - 1]) << 1;
        next_code[bits] = (ush)code;
    }

bit length	symbol count	code
1	1	b’0
2	1	b’10
3	2	b’110

for (n = 0;  n <= max_code; n++) {
    int len = tree[n].Len;
    if (len == 0) continue;
    /* Now reverse the bits */
    tree[n].Code = (ush)bi_reverse(next_code[len]++, len);

    Tracecv(tree != static_ltree, (stderr,"\nn %3d %c l %2d c %4x (%x) ",
        n, (isgraph(n) ? n : ' '), len, tree[n].Code, next_code[len] - 1));
}

Finally we get:

tree index	symbol	symbol frequency	symbol len	symbol code	reversed code
0	A	4	2	b’10	b’01
1	B	5	1	b’0	b’0
2	C	1	3	b’110	b’011
3	D	2	3	b’111	b’111

reference

https://en.wikipedia.org/wiki/Canonical_Huffman_code
https://github.com/madler/zlib/blob/master/contrib/puff/puff.c
https://github.com/madler/zlib/blob/develop/trees.c
https://stackoverflow.com/questions/29575309/decoding-huffman-file-from-canonical-form