Data Compression: From Huffman Codes to the Shannon Limit

simulation intermediate ~12 min
Loading simulation...

Formula

\text{Compressed size} \geq N \cdot H \text{ bits}
\text{Compression ratio} = \frac{\text{Original size}}{\text{Compressed size}}
H \leq \bar{L} < H + 1 \quad \text{(single-symbol Huffman)}
\frac{H}{k} \leq \bar{L}_k < \frac{H}{k} + \frac{1}{k} \quad \text{(block-}k\text{ Huffman)}
Every time you send a photo, stream a video, or download a file, compression algorithms are silently at work — reducing data to its essential information content. The theoretical foundation for all of this is Shannon's source coding theorem. The theorem states a remarkable fact: for any source producing symbols with entropy H bits per symbol, it is impossible to losslessly compress the data below N·H total bits (where N is the message length), and it is possible to get arbitrarily close to this limit. Entropy is therefore the ultimate compression benchmark. Huffman coding, invented by David Huffman in 1952 while a graduate student at MIT, provides a practical algorithm that approaches this limit. The key idea is variable-length coding: assign short binary codewords to common symbols and long codewords to rare symbols. The algorithm constructs an optimal prefix-free code by building a binary tree from the bottom up, merging the two least-probable symbols at each step. This simulator generates messages from four different source types with vastly different entropy profiles. English-like text has skewed letter frequencies (E is 13%, Z is 0.07%), giving entropy around 4.0 bits — well below the 4.7-bit maximum for 26 letters. Random text is nearly incompressible. Repetitive text has very low entropy and compresses dramatically. DNA sequences, with four roughly equal bases, sit at about 2 bits per base. Block coding extends the principle by treating groups of symbols as single units. Since real sources have inter-symbol dependencies (in English, 'q' is almost always followed by 'u'), coding blocks of symbols together captures these correlations and pushes compression closer to the true entropy rate.

FAQ

How does Huffman coding work?

Huffman coding assigns shorter binary codes to more frequent symbols and longer codes to rare symbols. The algorithm builds a binary tree bottom-up: repeatedly merge the two lowest-frequency nodes until one tree remains. Symbols at shallow leaves get short codes; deep leaves get long codes. The result is a prefix-free code — no codeword is a prefix of another — that minimizes the expected code length for the given frequency distribution.

What is the Shannon source coding theorem?

Shannon's source coding theorem (noiseless coding theorem) states that a source with entropy H bits per symbol cannot be losslessly compressed below H bits per symbol on average. Conversely, it can be compressed to a rate arbitrarily close to H using sufficiently long block codes. This establishes entropy as the fundamental limit of lossless compression.

Why can't you compress random data?

Truly random data has maximum entropy — every symbol is equally likely and there are no patterns. Since entropy equals log₂(N) for N equally likely symbols, the compressed size equals the original size. By pigeonhole principle, any compression scheme that shortens some inputs must lengthen others. For uniform random data, no input is more likely than another, so no systematic compression is possible.

What is the difference between lossless and lossy compression?

Lossless compression (Huffman, LZ77, DEFLATE) preserves every bit of the original data — the decompressed output is identical to the input. Lossy compression (JPEG, MP3, H.264) discards information deemed perceptually unimportant, achieving much higher compression ratios at the cost of fidelity. Shannon's source coding theorem governs lossless compression; rate-distortion theory governs lossy compression.

Sources

View source on GitHub