The previous two posts discussed arithmetic coding (AC). In this post I’ll discuss a related but more recent entropy encoding technique called asymmetric numeral systems (ANS), introduced by Jarek Duda and explored in some detail in a series of 12 blog posts by Charles Bloom. ANS combines the effectiveness of AC (achieving approximately the theoretical optimum Shannon entropy) with the efficiency of Huffman coding. On the other hand, whereas AC acts in a first-in first-out manner, ANS acts last-in first-out, in the sense that the decoded text comes out in reverse; this is not a problem in a batched context, but it makes ANS unsuitable for encoding a communications channel, and also makes it difficult to employ adaptive text models.
Arithmetic coding, revisited
Here is a simplified version of arithmetic coding; compared to the previous posts, we use a fixed rather than adaptive model, and rather than ing some arbitrary element of the final interval, we pick explicitly the lower bound of the interval (by , when it is represented as a pair). We suppose a function
that returns a positive count for every symbol, representing predicted frequencies of occurrence, from which it is straightforward to derive functions
that work on a fixed global model, satisfying the central property
For example, an alphabet of three symbols `a’, `b’, `c’, with counts 2, 3, and 5 respectively, could be encoded by the intervals . We have operations on intervals:
Then we have:
For example, with the alphabet and intervals above, the input text “abc” gets encoded symbol by symbol, from left to right (because of the ), to the narrowing sequence of intervals from which we select the lower bound of the final interval. (This version doesn’t stream well, because of the decision to pick the lower bound of the final interval as the result. But it will suffice for our purposes.)
Note now that the symbol encoding can be “fissioned” out of , using map fusion for backwards; moreover, is associative (and its neutral element), so we can replace the with a ; and finally, now using fusion forwards, we can fuse the with the . That is,
and we can fuse the back into the ; so we have , where
Now encoding is a simple and decoding a simple . This means that we can prove that decoding inverts encoding, using the “Unfoldr–Foldr Theorem” stated in the Haskell documentation for (in fact, the only property of stated there!). The theorem states that if
for all and , then
for all finite lists . Actually, that’s too strong for us here, because our decoding always generates an infinite list; a weaker variation (hereby called the “Unfoldr–Foldr Theorem With Junk”) states that if
for all and , then
for all finite lists —that is, the inverts the except for appending some junk to the output. It’s a nice exercise to prove this weaker theorem, by induction.
We check the condition:
Now, we hope to recover the first symbol, ie we hope that :
Fortunately, it is an invariant of the encoding process that —the result of —is in the unit interval, so indeed . Continuing:
as required. Therefore decoding inverts encoding, apart from the fact that it continues to produce junk having decoded all the input; so
for all finite .
From fractions to integers
Whereas AC encodes longer and longer messages as more and more precise fractions, ANS encodes them as larger and larger integers. Given the function on symbols, we can easily derive definitions
such that gives the cumulative counts of all symbols preceding (say, in alphabetical order), the total of all symbol counts, and looks up an integer :
(for example, we might tabulate the function using a scan).
We encode a text as an integer , containing bits of information (all logs in this blog are base 2). The next symbol to encode has probability , and so requires an additional bits; in total, that’s bits. So entropy considerations tell us that, roughly speaking, to incorporate symbol into state we want to map to . Of course, in order to decode, we need to be able to invert this transformation, to extract and from ; this suggests that we should do the division by first:
so that the multiplication by the known value can be undone first:
How do we reconstruct ? Well, there is enough headroom in to add any non-negative value less than without affecting the division; in particular, we can add to , and then we can use on the remainder:
But we have still lost some information from the lower end of through the division, namely ; so we can’t yet reconstruct . Happily, for any , so there is still precisely enough headroom in to add this lost information too without affecting the , allowing us also to reconstruct :
We therefore define
Using similar reasoning as for AC, it is straightforward to show that a decoding step inverts an encoding step:
Therefore decoding inverts encoding, modulo final junk, by the Unfoldr–Foldr Theorem With Junk:
for all finite . For example, with the same alphabet and symbol weights as before, encoding the text “abc” (now from right to left, because of the ) proceeds through the steps , and the last element is the encoding.
In fact, the inversion property holds whatever value we use to start encoding with (since it isn’t used in the proof); in the next section, we start encoding with a certain lower bound rather than . Moreover, is strictly increasing on states strictly greater than zero, and strictly decreasing; which means that the decoding process can stop when it returns to the lower bound. That is, if we pick some and define
then the stronger version of the Unfoldr–Foldr Theorem holds, and we have
for all finite , without junk.
The previous versions all used arbitrary-precision arithmetic, which is expensive. We now change the approach slightly to use only bounded-precision arithmetic. As usual, there is a trade-off between effectiveness (a bigger bound means more accurate approximations to ideal entropies) and efficiency (a smaller bound generally means faster operations). Fortunately, the reasoning does not depend on much about the actual bounds. We will represent the integer accumulator as a pair which we call a :
such that is a list of digits in a given base , and is an integer with (where we define for the upper bound of the range), under the abstraction relation
We call the “window” and the “remainder”. For example, with and , the pair represents ; and indeed, for any there is a unique representation satisfying the invariants. According to Duda, should divide into evenly. On a real implementation, and should be powers of two, such that arithmetic on values up to fits within a single machine word.
The encoding step acts on the window in the accumulator using , which risks making it overflow the range; we therefore renormalize with by shifting digits from the window to the remainder until this overflow would no longer happen, before consuming the symbol.
Some calculation, using the fact that divides , allows us to rewrite the guard in :
For example, again encoding the text “abc”, again from right to left, with and , the accumulator starts with , and goes through the states and on consuming the `c’ then the `b’; directly consuming the `a’ would make the window overflow, so we renormalize to ; then it is safe to consume the `a’, leading to the final state .
Decoding is an unfold using the accumulator as state. We repeatedly output a symbol from the window; this may make the window underflow the range, in which case we renormalize if possible by injecting digits from the remainder (and if this is not possible, because there are no more digits to inject, it means that we have decoded the entire text).
Note that decoding is of course symmetric to encoding; in particular, when encoding we renormalize before consuming a symbol; therefore when decoding we renormalize after emitting a symbol. For example, decoding the final encoding starts by computing ; the window value 68 has underflowed, so renormalization consumes the remaining digit 3, leading to the accumulator ; then decoding proceeds to extract the `b’ and `c’ in turn, returning the accumulator to .
We can again prove that decoding inverts encoding, although we need to use yet another variation of the Unfoldr–Foldr Theorem, hereby called the “Unfoldr–Foldr Theorem With Invariant”. We say that predicate is an invariant of and if
The theorem is that if is such an invariant, and the conditions
hold for all and , then
for all finite lists , provided that the initial state satisfies the invariant . In our case, the invariant is that the window is in range (), which is indeed maintained by and . Then it is straightforward to verify that
when is in range, making use of a lemma that
when is in range. Therefore decoding inverts encoding:
for all finite .
Trading in sequences
The version of encoding in the previous section yields a , that is, a pair consisting of an integer window and a digit sequence remainder. It would be more conventional for encoding to take a sequence of symbols to a sequence of digits alone, and decoding to take the sequence of digits back to a sequence of symbols. For encoding, we have to flush the remaining digits out of the window at the end of the process, reducing the window to zero:
For example, . Then we can define
Correspondingly, decoding should start by populating an initially-zero window from the sequence of digits:
For example, . Then we can define
One can show that
provided that is initially in range, and therefore again decoding inverts encoding:
for all finite .
Note that this is not just a data refinement—it is not the case that yields the digits of the integer computed by . For example, , whereas . We have really changed the effectiveness of the encoding in return for the increased efficiency.
We would now like to stream encoding and decoding as metamorphisms: encoding should start delivering digits before consuming all the symbols, and conversely decoding should start delivering symbols before consuming all the digits.
For encoding, we have
with a fold; can we turn into an unfold? Yes, we can! The remainder in the accumulator should act as a queue of digits: digits get enqueued at the most significant end, so we need to dequeue them from the least significant end. So we define
to “deal out” the elements of a list one by one, starting with the last. Of course, this reverses the list; so we have the slightly awkward
Now we can use unfold fusion to fuse and into a single unfold, deriving the coalgebra
by considering each of the three cases, getting where
As for the input part, we have a where we need a . Happily, that is easy to achieve, at the cost of an additional , since:
on finite . So we have , where
It turns out that the streaming condition does not hold for and —although we can stream digits out of the remainder:
we cannot stream the last few digits out of the window:
We have to resort to flushing streaming, which starts from an apomorphism
rather than an unfold. Of course, any unfold is trivially an apo, with the trivial flusher that always yields the empty list; but more interestingly, any unfold can be factored into a “cautious” phase (delivering elements only while a predicate holds) followed by a “reckless” phase (discarding the predicate, and delivering the elements anyway)
In particular, the streaming condition may hold for the cautious coalgebra when it doesn’t hold for the reckless coalgebra itself. We’ll use the cautious coalgebra
which carefully avoids the problematic case when the remainder is empty. It is now straightforward to verify that the streaming condition holds for and , and therefore , where
and where streams its output. On the downside, this has to read the input text in reverse, and also write the output digit sequence in reverse.
Fortunately, decoding is rather easier to stream. We have
with an unfold; can we turn into a fold? Yes, we can! In fact, we have , where
That is, starting with for the accumulator, digits are injected one by one into the window until this is in range, and thereafter appended to the remainder.
Now we have decoding as an after a , and it is straightforward to verify that the streaming condition holds for and . Therefore on finite inputs, where
and streams. It is perhaps a little more symmetric to move the from the final step of encoding to the initial step of decoding—that is, to have the least significant digit first in the encoded text, rather than the most significant—and then to view both the encoding and the decoding process as reading their inputs from right to left.
Inlining definitions and simplifying, we have ended up with encoding as a flushing stream:
and decoding as an ordinary stream:
Note that the two occurrences of can be taken to mean that encoding and decoding both process their input from last to first. The remainder acts as a queue, with digits being added at one end and removed from the other, so should be represented to make that efficient. But in fact, the remainder can be eliminated altogether, yielding simply the window for the accumulator—for encoding:
and for decoding:
This gives rather simple programs, corresponding to those in Duda’s paper; however, the calculation of this last version from the previous steps currently eludes me.