## Arithmetic Coding

This post is about the data compression method called arithmetic coding, by which a text is encoded as a subinterval of the unit interval, which is then represented as a bit sequence. It can often encode more effectively than Huffman encoding, because it doesn’t have the restriction of Huffman that each symbol be encoded as a positive whole number of bits; moreover, it readily accommodates adaptive models of the text, which “learn” about the text being encoded while encoding it. It is based on lecture notes that I wrote in 2002 with Richard Bird, although the presentation here is somewhat simplified; it is another application of streaming. There’s quite a lot to cover, so in this post I’ll just set up the problem by implementing a basic encoder and decoder. In the next post, I’ll show how they can both be streamed. (We won’t get into the intricacies of restricting to fixed-precision arithmetic—perhaps I can cover that in a later post.)

The basic idea behind arithmetic coding is essentially to encode an input text as a subinterval of the unit interval, based on a model of the text symbols that assigns them to a partition of the unit interval into non-empty subintervals. For the purposes of this post, we will deal mostly with half-open intervals, so that the interval ${[l,r)}$ contains values ${x}$ such that ${l \le x < r}$, where ${l,r,x}$ are rationals.

For example, with just two symbols “a” and “b”, and a static model partitioning the unit interval into ${[0, \frac 1 3)}$ for “a” and ${[\frac 1 3, 1)}$ for “b”, the symbols in the input text “aba” successively narrow the unit interval to ${[0,\frac 1 3), [\frac 1 9, \frac 1 3), [\frac 1 9, \frac 5 {27})}$, and the latter interval is the encoding of the whole input. And in fact, it suffices to pick any single value in this final interval, as long as there is some other way to determine the end of the encoded text (such as the length, or a special end-of-text symbol).

## Intervals

We introduce the following basic definitions for intervals:

$\displaystyle \begin{array}{@{}l} \mathbf{type}\;\mathit{Interval} = (\mathit{Rational}, \mathit{Rational}) \vrule width0pt depth2ex \\ \mathit{unit} :: \mathit{Interval} \\ \mathit{unit} = (0,1) \vrule width0pt depth2ex \\ \mathit{contains} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Bool} \\ \mathit{contains}\;(l,r)\;x = l \le x \land x < r \vrule width0pt depth2ex \\ \mathit{includes} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Bool} \\ \mathit{includes}\;(l,r)\;(p,q) = l \le p \land q \le r \end{array}$

We’ll write “${i \ni x}$” for ${\mathit{contains}\;i\;x}$, and “${i \supseteq j}$” for ${\mathit{includes}\;i\;j}$.

A crucial operation on intervals is narrowing of one interval by another, where ${\mathit{narrow}\;i\;j}$ is to ${i}$ as ${j}$ is to the unit interval:

$\displaystyle \begin{array}{@{}l} \mathit{narrow} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Interval} \\ \mathit{narrow}\;i\;(p,q) = (\mathit{weight}\;i\;p, \mathit{weight}\;i\;q) \vrule width0pt depth2ex \\ \mathit{weight} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{weight}\;(l,r)\;x = l + (r-l) \times x \end{array}$

We’ll write “${i \mathbin{\triangleright} j}$” for ${\mathit{narrow}\;i\;j}$. Thus, ${\mathit{weight}\;(l,r)\;x}$ is “proportionately ${x}$ of the way between ${l}$ and ${r}$“, and we have

$\displaystyle \begin{array}{@{}lcl} i \ni \mathit{weight}\;i\;x & \Leftarrow& \mathit{unit} \ni x \\ i \supseteq i \mathbin{\triangleright} j &\Leftarrow& \mathit{unit} \supseteq j \end{array}$

Conversely, we can widen one interval by another:

$\displaystyle \begin{array}{@{}l} \mathit{widen} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Interval} \\ \mathit{widen}\;i\;(p,q) = (\mathit{scale}\;i\;p, \mathit{scale}\;i\;q) \vrule width0pt depth2ex \\ \mathit{scale} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{scale}\;(l,r)\;x = (x-l)/(r-l) \end{array}$

We’ll write “${i \mathbin{\triangleleft} j}$” for ${\mathit{widen}\;i\;j}$. Note that ${\mathit{scale}}$ is inverse to ${\mathit{weight}}$, in the sense

$\displaystyle y = \mathit{weight}\;i\;x \Leftrightarrow \mathit{scale}\;i\;y = x$

and consequently widening is inverse to narrowing:

$\displaystyle i \mathbin{\triangleleft} (i \mathbin{\triangleright} j) = j$

## Models

We work with inputs consisting of sequences of symbols, which might be characters or some higher-level tokens:

$\displaystyle \mathbf{type}\;\mathit{Symbol} = \mathit{Char}$

The type ${\mathit{Model}}$ then must provide the following operations:

• a way to look up a symbol, obtaining the corresponding interval:

$\displaystyle \mathit{encodeSym} :: \mathit{Model} \rightarrow \mathit{Symbol} \rightarrow \mathit{Interval}$

• conversely, a way to decode a value, retrieving a symbol:

$\displaystyle \mathit{decodeSym} :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow \mathit{Symbol}$

• an initial model:

$\displaystyle \mathit{initial} :: \mathit{Model}$

• a means to adapt the model on seeing a new symbol:

$\displaystyle \mathit{newModel} :: \mathit{Model} \rightarrow \mathit{Symbol} \rightarrow \mathit{Model}$

The central property is that encoding and decoding are inverses, in the following sense:

$\displaystyle \mathit{decodeSym}\;m\;x = s \quad \Leftrightarrow \quad \mathit{encodeSym}\;m\;s \ni x$

There are no requirements on ${\mathit{initial}}$ and ${\mathit{newModel}}$, beyond the latter being a total function.

For example, we might support adaptive coding via a model that counts the occurrences seen so far of each of the symbols, represented as a histogram:

$\displaystyle \mathbf{type}\;\mathit{Model} = [(\mathit{Symbol},\mathit{Integer})]$

This naive implementation works well enough for small alphabets. One might maintain the histogram in decreasing order of counts, so that the most likely symbols are at the front and are therefore found quickest. For larger alphabets, it is better to maintain the histogram as a binary search tree, ordered alphabetically by symbol, and caching the total counts of every subtree.

## Encoding

Now encoding is straightforward to define. The function ${\mathit{encodeSyms}}$ takes an initial model and a list of symbols, and returns the list of intervals obtained by looking up each symbol in turn, adapting the model at each step:

$\displaystyle \begin{array}{@{}l} \mathit{encodeSyms} :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Interval}] \\ \mathit{encodeSyms}\; m = \mathit{map}\;\mathit{snd} \cdot \mathit{tail} \cdot \mathit{scanl}\;\mathit{next}\;(m,\mathit{unit}) \\ \quad \mathbf{where}\; \begin{array}[t]{@{}l} \mathit{next} :: (\mathit{Model},\mathit{Interval}) \rightarrow \mathit{Symbol} \rightarrow (\mathit{Model},\mathit{Interval}) \\ \mathit{next}\;(m,i)\;s = (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s) \end{array} \end{array}$

That is,

$\displaystyle \begin{array}{@{}lcl} \mathit{encodeSyms}\;m\;[\,] &=& [\,] \\ \mathit{encodeSyms}\;m\;(s:ss) &=& \mathit{encodeSym}\;m\;s : \mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss \end{array}$

We then narrow the unit interval by each of these subintervals, and pick a single value from the resulting interval:

$\displaystyle \begin{array}{@{}l} \mathit{encode}_0 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow \mathit{Rational} \\ \mathit{encode}_0\;m = \mathit{pick} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \end{array}$

All we require of ${\mathit{pick} :: \mathit{Interval} \rightarrow \mathit{Rational}}$ is that ${i \ni \mathit{pick}\;i}$; then ${\mathit{encode}_0}$ yields a fraction in the unit interval. For example, we might set ${\mathit{pick} = \mathit{midpoint}}$, where

$\displaystyle \textstyle \mathit{midpoint}\;i = \mathit{weight}\;i\;(\frac 1 2)$

## Decoding

So much for encoding; how do we retrieve the input text? In fact, we can retrieve the first symbol simply by using ${\mathit{decodeSym}}$. Expanding the encoding of a non-empty text, we have:

$\displaystyle \begin{array}{@{}cl} & \mathit{encode}_0\;m\;(s:ss) \\ = & \qquad \{ \mathit{encode}_0 \mbox{ and } \mathit{encodeSyms} \mbox{, as above; let } i = \mathit{encodeSym}\;m\;s \} \\ & \mathit{pick}\;(\mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(i : \mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss)) \\ = & \qquad \{ \mbox{fold} \} \\ & \mathit{pick}\;(i \mathbin{\triangleright} \mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(\mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss)) \\ = & \qquad \{ \mathit{pick}\;(i \mathbin{\triangleright} j) = \mathit{weight}\;i\;(\mathit{pick}\;j) \mbox{ (see below)} \} \\ & \mathit{weight}\;i\;(\mathit{pick}\;(\mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(\mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss))) \\ = & \qquad \{ \mathit{encode}_0 \mbox{ and } \mathit{encodeSyms} \mbox{ again} \} \\ & \mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss) \end{array}$

The proof obligation, left as an exercise, is to show that

$\displaystyle \mathit{pick}\;(i \mathbin{\triangleright} j) = \mathit{weight}\;i\;(\mathit{pick}\;j)$

which holds when ${\mathit{pick}\;i}$ is of the form ${\mathit{weight}\;i\;x}$ for some ${x}$.

Now

$\displaystyle \begin{array}{@{}cl} & \mathit{decodeSym}\;m\;(\mathit{encode}_0\;m\;(s:ss)) = s \\ \Leftrightarrow & \qquad \{ \mbox{expansion of } \mathit{encode}_0 \mbox{, as above; let } i = \mathit{encodeSym}\;m\;s \} \\ & \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss)) = s \\ \Leftrightarrow & \qquad \{ \mbox{requirement on models} \} \\ & i \ni \mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss) \\ \Leftarrow & \qquad \{ \mathit{weight} \} \\ & \mathit{unit} \ni \mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss \end{array}$

and indeed, encoding yields a fraction in the unit interval, so this recovers the first symbol correctly. This is the foothold that allows the decoding process to make progress; having obtained the first symbol using ${\mathit{decodeSym}}$, it can adapt the model in precisely the same way that the encoding process does, then retrieve the second symbol using that adapted model, and so on. The only slightly tricky part is that when decoding an initial value ${x}$, having obtained the first symbol ${s}$, decoding should continue on some modified value ${x'}$; what should the modification be? It turns out that the right thing to do is to scale ${x}$ by the interval associated in the model with symbol ${s}$, since scaling is the inverse operation to the ${\mathit{weight}}$s that take place during encoding. That is, we define:

$\displaystyle \begin{array}{@{}l} \mathit{decode}_0 :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_0\;m\;x = \mathit{unfoldr}\;\mathit{step}\;(m,x) \vrule width0pt depth2ex \\ \mathit{step} :: (\mathit{Model}, \mathit{Rational}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model},\mathit{Rational})) \\ \mathit{step}\;(m,x) = \mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{scale}\;(\mathit{encodeSym}\;m\;s)\;x)) \\ \quad \mathbf{where}\;s = \mathit{decodeSym}\;m\;x \end{array}$

(Of course, ${\mathit{encodeSym}\;m\;s \ni x}$, by the inverse requirement on models, and so the new scaled value is again within the unit interval.)

Note that decoding yields an infinite list of symbols; the function ${\mathit{step}}$ is always productive. Nevertheless, that infinite list starts with the encoded text, as we shall now verify. Define the round-trip function

$\displaystyle \mathit{round}_0\;m = \mathit{decode}_0\;m \cdot \mathit{encode}_0\;m$

Then we have:

$\displaystyle \begin{array}{@{}cl} & \mathit{round}_0\;m\;(s:ss) \\ = & \qquad \{ \mbox{definition of } \mathit{round}_0 \} \\ & \mathit{decode}_0\;m\;(\mathit{encode}_0\;m\;(s:ss)) \\ = & \qquad \{ \mathit{encode}_0 \mbox{; let } i = \mathit{encodeSym}\;m\;s, m' = \mathit{newModel}\;m\;s \} \\ & \mathit{decode}_0\;m\;(\mathit{weight}\;i\;(\mathit{encode}_0\;m'\;ss)) \\ = & \qquad \{ \mathit{decode}_0 \mbox{; first decoded symbol is correct, as above} \} \\ & s : \mathit{decode}_0\;m'\;(\mathit{scale}\;i\;(\mathit{weight}\;i\;(\mathit{encode}_0\;m'\;ss))) \\ = & \qquad \{ \mathit{scale}\;i\;(\mathit{weight}\;i\;x) = x \} \\ & s : \mathit{decode}_0\;m'\;(\mathit{encode}_0\;m'\;ss) \\ = & \qquad \{ \mbox{definition of } \mathit{round}_0 \} \\ & s : \mathit{round}_0\;m'\;ss \end{array}$

From this it follows that indeed the round-trip recovers the initial text, in the sense that ${\mathit{round}_0\;m\;ss}$ yields an infinite sequence that starts with ${ss}$; in fact,

$\displaystyle \mathit{round}_0\;m\;ss = ss \mathbin{{+}\!\!\!{+}} \mathit{round}_0\;(\mathit{foldl}\;\mathit{newModel}\;m\;ss)\;[\,]$

yielding the original input followed by some junk, the latter obtained by decoding the fraction ${\frac 1 2}$ (the encoding of ${[\,]}$) from the final model ${\mathit{foldl}\;\mathit{newModel}\;m\;ss}$ that results from adapting the initial model to each symbol in ${ss}$ in turn. To actually retrieve the input text with no junk suffix, one could transmit the length separately (although that doesn’t sit well with streaming), or append a distinguished end-of-text symbol.

## What’s next

So far we have an encoder and a decoder, and a proof that the decoder successfully decodes the encoded text. In the next post, we’ll see how to reimplement both as streaming processes.