This post is about the data compression method called arithmetic coding, by which a text is encoded as a subinterval of the unit interval, which is then represented as a bit sequence. It can often encode more effectively than Huffman encoding, because it doesn’t have the restriction of Huffman that each symbol be encoded as a positive whole number of bits; moreover, it readily accommodates adaptive models of the text, which “learn” about the text being encoded while encoding it. It is based on lecture notes that I wrote in 2002 with Richard Bird, although the presentation here is somewhat simplified; it is another application of streaming. There’s quite a lot to cover, so in this post I’ll just set up the problem by implementing a basic encoder and decoder. In the next post, I’ll show how they can both be streamed. (We won’t get into the intricacies of restricting to fixed-precision arithmetic—perhaps I can cover that in a later post.)
The basic idea behind arithmetic coding is essentially to encode an input text as a subinterval of the unit interval, based on a model of the text symbols that assigns them to a partition of the unit interval into non-empty subintervals. For the purposes of this post, we will deal mostly with half-open intervals, so that the interval contains values such that , where are rationals.
For example, with just two symbols “a” and “b”, and a static model partitioning the unit interval into for “a” and for “b”, the symbols in the input text “aba” successively narrow the unit interval to , and the latter interval is the encoding of the whole input. And in fact, it suffices to pick any single value in this final interval, as long as there is some other way to determine the end of the encoded text (such as the length, or a special end-of-text symbol).
We introduce the following basic definitions for intervals:
We’ll write “” for , and “” for .
A crucial operation on intervals is narrowing of one interval by another, where is to as is to the unit interval:
We’ll write “” for . Thus, is “proportionately of the way between and “, and we have
Conversely, we can widen one interval by another:
We’ll write “” for . Note that is inverse to , in the sense
and consequently widening is inverse to narrowing:
We work with inputs consisting of sequences of symbols, which might be characters or some higher-level tokens:
The type then must provide the following operations:
- a way to look up a symbol, obtaining the corresponding interval:
- conversely, a way to decode a value, retrieving a symbol:
- an initial model:
- a means to adapt the model on seeing a new symbol:
The central property is that encoding and decoding are inverses, in the following sense:
There are no requirements on and , beyond the latter being a total function.
For example, we might support adaptive coding via a model that counts the occurrences seen so far of each of the symbols, represented as a histogram:
This naive implementation works well enough for small alphabets. One might maintain the histogram in decreasing order of counts, so that the most likely symbols are at the front and are therefore found quickest. For larger alphabets, it is better to maintain the histogram as a binary search tree, ordered alphabetically by symbol, and caching the total counts of every subtree.
Now encoding is straightforward to define. The function takes an initial model and a list of symbols, and returns the list of intervals obtained by looking up each symbol in turn, adapting the model at each step:
We then narrow the unit interval by each of these subintervals, and pick a single value from the resulting interval:
All we require of is that ; then yields a fraction in the unit interval. For example, we might set , where
So much for encoding; how do we retrieve the input text? In fact, we can retrieve the first symbol simply by using . Expanding the encoding of a non-empty text, we have:
The proof obligation, left as an exercise, is to show that
which holds when is of the form for some .
and indeed, encoding yields a fraction in the unit interval, so this recovers the first symbol correctly. This is the foothold that allows the decoding process to make progress; having obtained the first symbol using , it can adapt the model in precisely the same way that the encoding process does, then retrieve the second symbol using that adapted model, and so on. The only slightly tricky part is that when decoding an initial value , having obtained the first symbol , decoding should continue on some modified value ; what should the modification be? It turns out that the right thing to do is to scale by the interval associated in the model with symbol , since scaling is the inverse operation to the s that take place during encoding. That is, we define:
(Of course, , by the inverse requirement on models, and so the new scaled value is again within the unit interval.)
Note that decoding yields an infinite list of symbols; the function is always productive. Nevertheless, that infinite list starts with the encoded text, as we shall now verify. Define the round-trip function
Then we have:
From this it follows that indeed the round-trip recovers the initial text, in the sense that yields an infinite sequence that starts with ; in fact,
yielding the original input followed by some junk, the latter obtained by decoding the fraction (the encoding of ) from the final model that results from adapting the initial model to each symbol in in turn. To actually retrieve the input text with no junk suffix, one could transmit the length separately (although that doesn’t sit well with streaming), or append a distinguished end-of-text symbol.
So far we have an encoder and a decoder, and a proof that the decoder successfully decodes the encoded text. In the next post, we’ll see how to reimplement both as streaming processes.