Data Compression: From Huffman Codes to the Shannon Limit
Formula
\text{Compressed size} \geq N \cdot H \text{ bits}\text{Compression ratio} = \frac{\text{Original size}}{\text{Compressed size}}H \leq \bar{L} < H + 1 \quad \text{(single-symbol Huffman)}\frac{H}{k} \leq \bar{L}_k < \frac{H}{k} + \frac{1}{k} \quad \text{(block-}k\text{ Huffman)} FAQ
How does Huffman coding work?
Huffman coding assigns shorter binary codes to more frequent symbols and longer codes to rare symbols. The algorithm builds a binary tree bottom-up: repeatedly merge the two lowest-frequency nodes until one tree remains. Symbols at shallow leaves get short codes; deep leaves get long codes. The result is a prefix-free code — no codeword is a prefix of another — that minimizes the expected code length for the given frequency distribution.
What is the Shannon source coding theorem?
Shannon's source coding theorem (noiseless coding theorem) states that a source with entropy H bits per symbol cannot be losslessly compressed below H bits per symbol on average. Conversely, it can be compressed to a rate arbitrarily close to H using sufficiently long block codes. This establishes entropy as the fundamental limit of lossless compression.
Why can't you compress random data?
Truly random data has maximum entropy — every symbol is equally likely and there are no patterns. Since entropy equals log₂(N) for N equally likely symbols, the compressed size equals the original size. By pigeonhole principle, any compression scheme that shortens some inputs must lengthen others. For uniform random data, no input is more likely than another, so no systematic compression is possible.
What is the difference between lossless and lossy compression?
Lossless compression (Huffman, LZ77, DEFLATE) preserves every bit of the original data — the decompressed output is identical to the input. Lossy compression (JPEG, MP3, H.264) discards information deemed perceptually unimportant, achieving much higher compression ratios at the cost of fidelity. Shannon's source coding theorem governs lossless compression; rate-distortion theory governs lossy compression.
Sources
- [object Object]
- [object Object]
- [object Object]