Decoding Text: A Huffman Codec in Haskell

In the autumn of 2015 I was taking a programming-languages course at Florida Tech, and the third project — “Decoding Text” — asked for something deceptively small: given a Huffman encoding tree and a stream of bits, reconstruct the original message. My lab partner Benjamin Yue and I wrote it in Haskell, then rounded the decoder out into a full codec that compresses too. (The repository is named DecodeLZW, which is a small historical lie — there is no LZW here at all. It is Huffman coding from top to bottom.)

Huffman coding answers a simple question: if some characters appear far more often than others, why spend the same number of bits on each? Give the frequent characters short codes and the rare ones longer codes, and the total shrinks. The one rule is that no code may be a prefix of another — otherwise the decoder can’t tell where one character ends and the next begins. A binary tree gives you that for free: every character lives at a leaf, and its code is the path from the root.

The tree is the whole program

Everything in the codec is organised around one tiny type:

data Tree = Leaf Char | Branch Tree Tree

A leaf carries a character; a branch carries a left sub-tree (reached by a 0 bit) and a right one (a 1). Here is the tree the program actually builds for the word banana. The letter a occurs three times, so it sits one step from the root, while the rarer b and n sit a level deeper:

The Huffman tree for banana: the frequent a gets a one-bit code, and the tree itself ships as **bna.

Read the path to each leaf and you have the code book: a is 1, n is 01, b is 00. So banana becomes 001011011 — nine bits, where a fixed byte-per-character layout would spend forty-eight.

Writing the tree down

To decode a message you need its tree, so the tree has to travel with the bits. The format is one line of prefix notation: a * marks a branch (followed immediately by its left then right sub-trees), and any other character is a leaf. The banana tree serialises to **bna. Two mutually-inverse functions do the work, and they read almost like the grammar they implement:

serializeTree :: Tree -> String
serializeTree (Leaf c)     = [c]
serializeTree (Branch l r) = '*' : serializeTree l ++ serializeTree r

parseTree :: String -> Tree
parseTree = snd . parse
  where
    parse ('*':rest) = let (rest1, left)  = parse rest
                           (rest2, right) = parse rest1
                       in (rest2, Branch left right)
    parse (c:rest)   = (rest, Leaf c)

The encoded file is then just that tree line, followed by one 0/1 bit string per message line — so an empty input line round-trips to an empty output line.

Decoding: walk the tree, one bit at a time

With a tree in hand, decoding is a walk. Start at the root; each 0 steps left and each 1 steps right; when you land on a leaf you’ve recovered a character, and the next bit starts the next one:

decodeChar :: Tree -> (String, String) -> (String, String)
decodeChar (Leaf c)        (decoded, rest)    = (decoded ++ [c], rest)
decodeChar (Branch left _) (decoded, '0':bs)  = decodeChar left  (decoded, bs)
decodeChar (Branch _ rght) (decoded, '1':bs)  = decodeChar rght  (decoded, bs)
decodeChar _ _ = error "Either the message or the encoding is corrupted."

Feed it 001011011 against the tree above and it traces 00 → b, 1 → a, 01 → n, and on — spelling banana straight back out.

Building the tree, and a sharp edge

Compression is the inverse. Count how often each character occurs, then repeatedly fuse the two lowest-weight sub-trees into one until a single tree remains — the classic greedy construction:

buildTree :: String -> Tree
buildTree text =
    case frequencies text of
        []        -> error "Huffman.buildTree: empty input"
        [(c, _)]  -> Branch (Leaf c) (Leaf c)
        freqs     -> combine (map (\(c, w) -> (w, Leaf c)) freqs)
  where
    combine [(_, tree)] = tree
    combine forest      =
        let (w1, t1) : (w2, t2) : rest = sortBy (comparing fst) forest
        in combine ((w1 + w2, Branch t1 t2) : rest)

The frequencies helper leans on the standard library — map (\cs -> (head cs, length cs)) . group . sort — sorting the characters so identical ones cluster, then measuring each run.

That middle case is the interesting one. What is the Huffman tree of a string with only one distinct character, like aaaa? There is nothing to branch on — yet the decoder is built entirely around consuming a bit at every branch. The fix is to hand it a deliberately degenerate tree, Branch (Leaf c) (Leaf c), so the lone character still gets a one-bit code and the stream stays non-empty. It is the kind of corner you only find by actually running the thing.

A front end, and a backward-compatible default

The command-line wrapper is four lines of dispatch over interact, which pipes stdin through a String -> String function to stdout:

main = do
    args <- getArgs
    case args of
        ["compress"]   -> interact compress
        ["decompress"] -> interact decompress
        []             -> interact decompress
        _              -> usage

With no argument it decodes — the original assignment only ever asked for a decoder, so the default stays faithful to it. The whole codec is under a hundred lines, and the round trip holds: compress any text, pipe it through decompress, and the bytes come back unchanged.

What I remember most is how little ceremony Haskell needed. The encoding tree is one algebraic data type; serialising and parsing it are a near-symmetric pair; decoding is structural recursion that mirrors the tree it walks. The program more or less is the mathematics — which, years on, still feels like the point.