Patterns in Functional Programming

Richard Bird, 1943-2022

jeremygibbons — Mon, 06 Jun 2022 10:30:18 +0000

My mentor, colleague, and friend Richard Bird died in April 2022 after a long battle with cancer. I wrote an obituary of him for The Guardian, his favoured newspaper; this post is a hybrid of that obituary and a eulogy I delivered at his funeral.

Richard was born in 1943 in London. His parents Kay and Jack were landlords of a series of pubs in South London. Richard attended St Olave’s Grammar School, then matriculated to read mathematics at Cambridge in 1960. After graduation, he had a brief spell working in sales for International Computers and Tabulators, the mainframe computer company that eventually became ICL. He was evidently effective in this role, being as personable as he was; but it involved flying around the country in small planes, which Richard did not enjoy at all. No amount of money or approval could persuade him to stay, so he left after a year to do postgraduate study at the University of London Institute of Computer Science.

Richard met Norma Lapworth at a friend’s 21st birthday party. Norma trained as an ecologist, later becoming a teacher, education advisor, and Ofsted schools inspector. Richard and Norma married in 1967.

After a year lecturing in Vancouver 1971-1972, Richard and Norma returned to the UK for Richard to take up a lectureship at the University of Reading. He was given the time finally to complete his PhD thesis, on Computational Complexity on Register Machines, in 1973. But Reading was a rather linguistically oriented department, which did not suit Richard’s mathematical temperament. So he leapt at the opportunity to join Tony Hoare’s expanding Programming Research Group at Oxford University Computing Laboratory in 1983. He was appointed as a University Lecturer in Computation, with a non-tutorial fellowship at St Cross College, assuming responsibility as Director of the MSc in Computation. He moved to a Tutorial Fellowship at Lincoln College in 1988. Richard stayed at the PRG for the remainder of his career, being promoted to Reader in Computation in 1990 and then to Professor in 1996, serving as Director of OUCL from 1998 to 2003, and finally retiring in 2008 after 25 years.

Richard’s research area was functional programming—an approach to computer programming centred around obeying the conventions of traditional mathematics. To Richard, it was self-evident that programs are mathematical entities, and that you can manipulate them in precisely the same way as you do quadratic equations. No immediate need for consideration of the capabilities of computer hardware; no entanglement in the vagaries of specific programming languages. Just algebra, pure and simple.

His PhD thesis in the early 1970s was already concerned with program transformation: the rules by which one program is shown to be equivalent to another. He wasn’t so interested in the program you actually end up with, so much as the process by which you got there, and how you know you ended up in the right place, and what alternative directions you didn’t follow en route. Specifically, he was interested in the calculus of equivalences with which you could conduct equational reasoning on programs, just as in high-school algebra.

In 1980 he was invited to participate in IFIP Working Group 2.1 on Algorithmic Languages and Calculi. This was the group that had designed the seminal programming language Algol68, and that continues to research into notations and techniques for designing and describing algorithms today. Richard had a very fruitful collaboration in WG2.1 with Lambert Meertens from the CWI at Amsterdam, developing what came to be known as the Bird-Meertens Formalism, or Squiggol to its friends. This was a concise and somewhat cryptic algebraic program notation with a powerful accompanying corpus of theorems, supporting the equational reasoning style that Richard sought.

Richard is known and respected around the world for his elegant writing. He published about 100 scientific papers in his life—not especially prolific for a scientist, but every one of those papers was very well polished. He also wrote or co-wrote seven books, starting with Programs and Machines (1976), which featured among other constructions a machine called NoRMa, the Number-Theoretic Register Machine.

I won’t go through all seven books, much as I would like to; I will just pick out a few highlights. Richard’s second book, Introduction to Functional Programming (1988, with Phil Wadler), set out his vision for teaching functional programming. Many people love this book. Graham Hutton wrote to me to say:

Richard set the standards for elegance and clarity, and the original Bird & Wadler is my all-time favourite book. It still stands up today as one of the best programming books ever written—every page is a gem.

Richard’s next book, Algebra of Programming (1996, with Oege de Moor), was much more research-oriented, extending his Squiggol agenda from functions to relations. Tim Sears tweeted:

This book blew my mind regarding what the relationship could be between math, programming and systems. I never met Richard, but his work has had a profound impact on me.

Richard’s fifth book Pearls of Functional Algorithm Design collected and repolished some of the Functional Pearls he had published over the years. These Pearls emulated Jon Bentley’s famous Programming Pearls column in CACM in the 1980s. Bentley explained:

Just as natural pearls grow from grains of sand that have irritated oysters, these programming pearls have grown from real problems that have irritated programmers. The programs are fun, and they teach important programming techniques and fundamental design principles.

Richard established and was the founding editor of the Functional Pearls column, from the very first issue in 1991 of the Journal of Functional Programming until he retired in 2008. He characterized Functional Pearls as “polished, elegant, instructive, entertaining”. He had high standards for them: he advised reviewers to

stop reading when you get bored, or the material gets too complicated, or too much specialist knowledge is needed, or the writing is bad.

He gave a keynote lecture about Pearls in 2006, where he observed that “most of them need more time in the oyster”.

Functional Pearls are the epitome of Richard’s writing style. Erik Meijer called him “the poet laureate of functional programming”. Greg Fitzgerald tweeted that

When life got hard, I’d read his pearls to settle my mind. His proofs read like poetry.

Ed Kmett commented that Richard’s pearls book

is the only real clear exposition of how to think like a functional programmer that I’ve ever seen

I was honoured to be Richard’s co-author on his final book, Algorithm Design with Haskell, written during his illness and published in 2020. Working with him was always a joy.

Despite the global influence of his publications, Richard always saw teaching as his main duty, and research as something he did for relaxation. Andrew Ker thanked Richard for the way he had taught him as a first-year undergraduate—a way which Andrew has passed on to his own students, and which he now sees them repeating in turn. Andrew wrote:

I’ve come to see that passing on our knowledge to the next generation is the most important thing we do, even more than generating new knowledge through research. And the ideas and attitudes persist through the generations. Your influence and example extends far beyond what you know, and countless students are grateful for it.

My own favourite Richard story is a teaching one. Richard was giving a tutorial about algorithm analysis: how many steps does this program take—for example, something about processing hierarchical tree-shaped data? You’d expect a question about counting steps to have an answer in whole numbers; but in fact, the answer often involves logarithms. One student asked why this was. Richard thought for a moment, then said “logs come from trees”. The students broke out in laughter, and it took Richard a while to work out why. In fact, I think this was Richard’s own favourite Richard story too, because I know that he continued to tell it in tutorials for years afterwards.

I want to close by mentioning Richard’s openness, considerateness, and egalitarianism. The departmental secretaries always looked forward to him coming in, because he treated them with the same friendliness that he treated the most eminent of his colleagues. Several once-overawed students have written to me to express gratitude for his welcome—“no Herr Professor Doktor Doktor here, we are all colleagues”—and for helping them to find their own way.

Richard’s collaborator Lambert Meertens says:

His lectures were a joy to attend, because of their clarity and the entertaining delivery. But I’ll remember him at least as much for his kindness, his generosity, his warm interest in people’s wellbeing—expressed not only in words, but also deeds.

Cezar Ionescu remembers being apprehensive about first meeting Richard at WG2.1—you should never meet your heroes—but says he

was exactly the way I had imagined him to be: very sharp, funny, wise. We talked about opera and Shakespeare, and fought (and won!) a hard battle for tap water with the hotel staff.

My own involvement with Richard started when I applied in 1987 to do a DPhil at Oxford, on a project that turned out to have finished by the time I arrived. I rattled around the Lab for a year, not even seeing my official supervisor. But Richard invited me to join his weekly Problem Solving Club. He took me under his wing, and I have never looked back. He became my DPhil supervisor, then some years later hired me into my current position, and I am privileged to have been able to call him my colleague and my friend. I basically owe him my whole career, and I have no words sufficient to convey my gratitude to him. I will miss him greatly.

The Genuine Sieve of Eratosthenes

jeremygibbons — Mon, 10 May 2021 10:55:17 +0000

The “Sieve of Eratosthenes” is often used as a nice example of the expressive power of streams (infinite lazy lists) and list comprehensions:

That is, takes a stream of candidate primes; the head of this stream is a prime, and the remaining primes are obtained by removing all multiples of from the candidates and sieving what’s left. (It’s also a nice unfold.)

Unfortunately, as Melissa O’Neill observes, this nifty program is not the true Sieve of Eratosthenes! The problem is that for each prime , every remaining candidate is tested for divisibility. O’Neill calls this bogus common version “trial division”, and argues that the Genuine Sieve of Eratosthenes should eliminate every multiple of without reconsidering all the candidates in between. That is, only of the natural numbers are touched when eliminating multiples of 2, less than of the remaining candidates for multiples of 3, and so on. As an additional optimization, it suffices to eliminate multiples of starting with , since by that point all composite numbers with a smaller nontrivial factor will already have been eliminated.

O’Neill’s paper presents a purely functional implementation of the Genuine Sieve of Eratosthenes. The tricky bit is keeping track of all the eliminations when generating an unbounded stream of primes, since obviously you can’t eliminate all the multiples of one prime before producing the next prime. Her solution is to maintain a priority queue of iterators; indeed, the main argument of her paper is that functional programmers are often too quick to use lists, when other data structures such as priority queues might be more appropriate.

O’Neill’s paper was published as a Functional Pearl in the Journal of Functional Programming, when Richard Bird was the handling editor for Pearls. The paper includes an epilogue that presents a purely list-based implementation of the Genuine Sieve, contributed by Bird during the editing process. And the same program pops up in Bird’s textbook Thinking Functionally with Haskell (TFWH). This post is about Bird’s program.

The Genuine Sieve, using lists

Bird’s program appears in §9.2 of TFWH. It deals with lists, but these will be infinite, sorted, duplicate-free streams, and you should think of these as representing infinite sets, in this case sets of natural numbers. In particular, there will be no empty or partial lists, only properly infinite ones.

Now, the prime numbers are what you get by eliminating the composite numbers from the “plural” naturals (those greater than one), and the composite numbers are the proper multiples of the primes—so the program is cleverly cyclic:

We’ll come back in a minute to , which unions a set of sets to a set; but is the obvious implementation of list difference of sorted streams (hence, representing set difference):

y & = (x:\mathit{xs}) \mathbin{\backslash\!\backslash} \mathit{ys} \end{array} " class="latex" />

and generates the multiples of starting with :

Thus, the composites are obtained by merging together the infinite stream of infinite streams . You might think that you could have defined instead , but this doesn’t work: this won’t compute the first prime without first computing some composites, and you can’t compute any composites without at least the first prime, so this definition is unproductive. Somewhat surprisingly, it suffices to “prime the pump” (so to speak) just with 2, and everything else flows freely from there.

Returning to , here is the obvious implementation of , which merges two sorted duplicate-free streams into one (hence, representing set union):

y & = y : \mathit{merge}\;(x:\mathit{xs})\;\mathit{ys} \end{array} " class="latex" />

Then is basically a stream fold with . You might think you could define this simply by , but again this is unproductive. After all, you can’t merge the infinite stream of sorted streams into a single sorted stream, because there is no least element. Instead, we have to exploit the fact that we have a sorted stream of sorted streams; then the binary merge can exploit the fact that the head of the left stream is the head of the result, without even examining the right stream. So, we define:

This program is now productive, and yields the infinite sequence of prime numbers, using the genuine algorithm of Eratosthenes.

Approximation Lemma

Bird uses the cyclic program as an illustration of the Approximation Lemma. This states that for infinite lists , ,

where

Note that is undefined; the function preserves the outermost constructors of a list, but then chops off anything deeper and replaces it with (undefined). So, the lemma states that to prove two infinite lists equal, it suffices to prove equal all their finite approximations.

Then to prove that does indeed produce the prime numbers, it suffices to prove that

for all , where is the th prime (so ). Bird therefore defines

and claims that

To prove the claim, he observes that it suffices for to be well defined at least up to the first composite number greater than , because then delivers enough composite numbers to supply , which will in turn supply , and so on. It is a “non-trivial result in Number Theory” (in fact, a consequence of Bertrand’s Postulate) that ; therefore it suffices that

where is the th composite number (so ) and . Completing the proof is set as Exercise 9.I of TFWH, and Answer 9.I gives a hint about using induction to show that is the result of merging with . (Incidentally, there is a typo in TFWH: both the body of the chapter and the exercise and its solution have “” instead of “”.)

Unfortunately, the hint in Answer 9.I is simply wrong. For example, it implies that (which equals ) could be constructed from (which equals ) and (which equals ); but this means that the 6 and 8 would have to come out of nowhere. Nevertheless, the claim in Exercise 9.I is valid. What should the hint for the proof have been?

Fixing the proof

The problem is made tricky by the cyclicity. Tom Schrijvers suggested to me to break the cycle by defining

so that ; this allows us to consider in isolation from . In fact, I claim (following a suggestion generated by Geraint Jones and Guillaume Boisseau at our Algebra of Programming research group) that if where , then has a well-defined prefix consisting of all the composite numbers such that and is a multiple of some with , in ascending order, then becomes undefined. That’s not so difficult to see from the definition: contains the multiples of starting from ; all these streams get merged; but the result gets undefined (in ) once we pass the head of the last stream. In particular, has a well-defined prefix consisting of all the composites up to , then becomes undefined, as required.

However, I must say that I still find this argument unsatisfactory: I have no equational proof about the behaviour of . In fact, I believe that the Approximation Lemma is unsufficient to provide such a proof: works along the grain of , but against the grain of . I believe we want something instead like an “Approx While Lemma”, where is to as is to :

Thus, retains elements of as long as they satisfy , but becomes undefined as soon as they don’t (or when the list runs out). Then for all sorted, duplicate-free streams and unbounded streams (that is, n}" class="latex" />),

The point is that works conveniently for and and hence for too. I am now working on a proof by equational reasoning of the correctness (especially the productivity) of the Genuine Sieve of Eratosthenes…

How to design co-programs

jeremygibbons — Wed, 21 Nov 2018 16:04:49 +0000

I recently attended the Matthias Felleisen Half-Time Show, a symposium held in Boston on 3rd November in celebration of Matthias’s 60th birthday. I was honoured to be able to give a talk there; this post is a record of what I (wished I had) said.

Matthias Felleisen

Matthias is known for many contributions to the field of Programming Languages. He received the SIGPLAN Programming Languages Achievement Award in 2012, the citation for which states:

He introduced evaluation contexts as a notation for specifying operational semantics, and progress-and-preservation proofs of type safety, both of which are used in scores of research papers each year, often without citation. His other contributions include small-step operational semantics for control and state, A-normal form, delimited continuations, mixin classes and mixin modules, a fully-abstract semantics for Sequential PCF, web programming techniques, higher-order contracts with blame, and static typing for dynamic languages.

Absent from this list, perhaps because it wasn’t brought back into the collective memory until 2013, is a very early presentation of the idea of using effects and handlers for extensible language design.

However, perhaps most prominent among Matthias’s contributions to our field is a long series of projects on teaching introductory programming, from TeachScheme! through Program By Design to a latest incarnation in the form of the How to Design Programs textbook (“HtDP”), co-authored with Robby Findler, Matthew Flatt, and Shriram Krishnamurthi, and now in its second edition. The HtDP book is my focus in this post; I have access only the First Edition, but the text of the Second Edition is online. I make no apologies for using Haskell syntax in my examples, but at least that gave Matthias something to shout at!

Design Recipes

One key aspect of HtDP is the emphasis on design recipes for solving programming tasks. A design recipe is a template for the solution to a problem: a contract for the function, analogous to a type signature (but HtDP takes an untyped approach, so this signature is informal); a statement of purpose; a function header; example inputs and outputs; and a skeleton of the function body. Following the design recipe entails completing the template—filling in a particular contract, etc—then fleshing out the function body from its skeleton, and finally testing the resulting program against the initial examples.

The primary strategy for problem solving in the book is via analysis of the structure of the input. When the input is composite, like a record, the skeleton should name the available fields as likely ingredients of the solution. When the input has “mixed data”, such as a union type, the skeleton should enumerate the alternatives, leading to a case analysis in the solution. When the input is of a recursive type, the skeleton encapsulates structural recursion—a case analysis between the base case and the inductive case, the latter case entailing recursive calls.

So, the design recipe for structural recursion looks like this:

Phase Goal Activity

Data Analysis and Design to formulate a data definition develop a data definition for mixed data with at least two alternatives; one alternative must not refer to the definition; explicitly identify all self-references in the data definition

Contract Purpose and Header to name the function; to specify its classes of input data and its class of output data; to describe its purpose; to formulate a header name the function, the classes of input data, the class of output data, and specify its purpose:

;; name : in1 in2 … –> out

;; to compute … from x1 …

(define (name x1 x2 …) …)

Examples to characterize the input-output relationship via examples create examples of the input-output relationship; make sure there is at least one example per subclass

Template to formulate an outline develop a cond-expression with one clause per alternative; add selector expressions to each clause; annotate the body with natural recursions; Test: the self-references in this template and the data definition match!

Body to define the function formulate a Scheme expression for each simple cond-line; explain for all other cond-clauses what each natural recursion computes according to the purpose statement

Test to discover mistakes (“typos” and logic) apply the function to the inputs of the examples; check that the outputs are as predicted

The motivating example for structural recursion is Insertion Sort: recursing on the tail of a non-empty list and inserting the head into the sorted subresult.

a &= a : \mathit{insert}\;b\;x \end{array} " class="latex" />

A secondary, more advanced, strategy is to use generative recursion, otherwise known as divide-and-conquer. The skeleton in this design recipe incorporates a test for triviality; in the non-trivial cases, it splits the problem into subproblems, recursively solves the subproblems, and assembles the subresults into an overall result. The motivating example for generative recursion is QuickSort (but not the fast in-place version): dividing a non-empty input list into two parts using the head as the pivot, recursively sorting both parts, and concatenating the results with the pivot in the middle.

a ] \end{array} \end{array} " class="latex" />

As far as I can see, no other program structures than structural recursion and generative recursion are considered in HtDP. (Other design recipes are considered, in particular accumulating parameters and imperative features. But these do not determine the gross structure of the resulting program. In fact, I believe that the imperative recipe has been dropped in the Second Edition.)

Co-programs

My thesis is that HtDP has missed an opportunity to reinforce its core message, that data structure determines program structure. Specifically, I believe that the next design recipe to consider after structural recursion, in which the shape of the program is determined by the shape of the input, should be structural corecursion, in which the shape of the program is determined instead by the shape of the output.

More concretely, a function that generates “mixed output”—whether that is a union type, or simply a boolean—might be defined by case analysis over the output. A function that generates a record might be composed of subprograms that generate each of the fields of that record. A function that generates a recursive data structure from some input data might be defined with a case analysis as to whether the result is trivial, and for non-trivial cases with recursive calls to generate substructures of the result. HtDP should present explicit design recipes to address these possibilities, as it does for program structures determined by the input data.

For an example of mixed output, consider a program that may fail, such as division, guarded so as to return an alternative value in that case:

The program performs a case analysis, and of course the analysis depends on the input data; but the analysis is not determined by the structure of the input, only its value. So a better explanation of the program structure is that it is determined by the structure of the output data.

For an example of generating composite output, consider the problem of extracting a date, represented as a record:

from a formatted string. The function is naturally structured to match the output type:

For an example of corecursion, consider the problem of “zipping” together two input lists to a list of pairs—taking and to , and for simplicity let’s say pruning the result to the length of the shorter input. One can again solve the problem by case analysis on the input, but the fact that there are two inputs makes that a bit awkward—whether to do case analysis on one list in favour of the other, or to analyse both in parallel. For example, here is the outcome of case analysis on the first input, followed if that is non-empty by a case analysis on the second input:

Case analysis on both inputs would lead to four cases rather than three, which would not be an improvement. One can instead solve the problem by case analysis on the output—and it is arguably more natural to do so, because there is only one output rather than two. When is the output empty? (When either input is empty.) If it isn’t empty, what is the head of the output? (The pair of input heads.) And from what data is the tail of the output recursively constructed? (The pair of input tails.)

And whereas Insertion Sort is a structural recursion over the input list, inserting elements one by one into a sorted intermediate result, Selection Sort is a structural corecursion towards the output list, repeatedly extracting the minimum remaining element as the next element of the output: When is the output empty? (When the input is empty.) If the output isn’t empty, what is its head? (The minimum of the input.) And from what data is the tail recursively generated? (The input without this minimum element.)

(here, denotes list with the elements of list removed).

I’m not alone in making this assertion. Norman Ramsey wrote a nice paper On Teaching HtDP; his experience led him to the lesson that

Last, and rarely, you could design a function’s template around the introduction form for the result type. When I teach [HtDP] again, I will make my students aware of this decision point in the construction of a function’s template: should they use elimination forms, function composition, or an introduction form? They should use elimination forms usually, function composition sometimes, and an introduction form rarely.

He also elaborates on test coverage:

Check functional examples to be sure every choice of input is represented. Check functional examples to be sure every choice of output is represented. This activity is especially valuable for functions returning Booleans.

(his emphasis). We should pay attention to output data structure as well as to input data structure.

Generative recursion, aka divide-and-conquer

Only once this dual form of program structure has been explored should students be encouraged to move on to generative recursion, because this exploits both structural recursion and structural corecursion. For example, the QuickSort algorithm that is used as the main motivating example is really structured as a corecursion to construct an intermediate tree, followed by structural recursion over that tree to produce the resulting list:

a ] \end{array} \medskip \\ \multicolumn{2}{@{}l}{\mathit{flatten} :: \mathit{Tree} \rightarrow [\mathit{Integer}]} \\ \mathit{flatten}\;\mathit{Empty} & = [\,] \\ \mathit{flatten}\;(\mathit{Node}\;t\;a\;u) &= \mathit{flatten}\;t \mathbin{{+}\!\!\!{+}} [a] \mathbin{{+}\!\!\!{+}} \mathit{flatten}\;u \end{array} " class="latex" />

The structure of both functions and is determined by the structure of the intermediate datatype, the first as structural corecursion and the second as structural recursion. A similar explanation applies to any divide-and-conquer algorithm; for example, MergeSort is another divide-and-conquer sorting algorithm, with the same intermediate tree shape, but this time with a simple splitting phase and all the comparisons in the recombining phase:

(Choosing this particular tree type is a bit clunky for Merge Sort, because of the two calls to required in . It would be neater to use non-empty externally labelled binary trees, with elements at the leaves and none at the branches:

Then you define the main function to work only for non-empty lists, and provide a separate case for sorting the empty list. This clunkiness is a learning opportunity: to realise the problem, come up with a fix (no node labels in the tree), rearrange the furniture accordingly, then replay the development and compare the results.)

Having identified the two parts, structural recursion and structural corecursion, they may be studied separately; separation of concerns is a crucial lesson in introductory programming. Moreover, the parts may be put together in different ways. The divide-and-conquer pattern is known in the MPC community as a hylomorphism, an unfold to generate a call tree followed by a fold to consume that tree. As the QuickSort example suggests, the tree can always be deforested—it is a virtual data structure. But the converse pattern, of a fold from some structured input to some intermediate value, followed by an unfold to a different structured output, is also interesting—you can see this as a change of structured representation, so I called it a metamorphism. One simple application is to convert a number from an input base (a sequence of digits in that base), via an intermediate representation (the represented number), to an output base (a different sequence of digits). More interesting applications include encoding and data compression algorithms, such as arithmetic coding.

Laziness

Although I have used Haskell as a notation, nothing above depends on laziness; it would all work as well in ML or Scheme. It is true that the mathematical structures underlying structural recursion and structural corecursion are prettier when you admit infinite data structures—the final coalgebra of the base functor for lists is the datatype of finite and infinite lists, and without admitting the infinite ones some recursive definitions have no solution. But that sophistication is beyond the scope of introductory programming, and it suffices to restrict attention to finite data structures.

HtDP already stipulates a termination argument in the design recipe for generative recursion; the same kind of argument should be required for structural corecursion (and is easy to make for the sorting examples given above). Of course, structural recursion over finite data structures is necessarily terminating. Laziness is unnecessary for co-programming.

Structured programming

HtDP actually draws on a long tradition on relating data structure and program structure, which is a theme dear to my own heart (and indeed, the motto of this blog). Tony Hoare wrote in his Notes on Data Structuring in 1972:

There are certain close analogies between the methods used for structuring data and the methods for structuring a program which processes that data.

Fred Brooks put it pithily in The Mythical Man-Month in 1975:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

which was modernized by Eric Raymond in his 1997 essay The Cathedral and the Bazaar to:

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious.

HtDP credits Jackson Structured Programming as partial inspiration for the design recipe approach. As Jackson wrote, also in 1975:

The central theme of this book has been the relationship between data and program structures. The data provides a model of the problem environment, and by basing our program structures on data structures we ensure that our programs will be intelligible and easy to maintain.

and

The structure of a program must be based on the structures of all of the data it processes.

(my emphasis). In a retrospective lecture in 2001, he clarified:

program structure should be dictated by the structure of its input and output data streams

—the JSP approach was designed for processing sequential streams of input records to similar streams of output records, and the essence of the approach is to identify the structures of each of the data files (input and output) in terms of sequences, selections, and iterations (that is, as regular languages), to refine them all to a common structure that matches all of them simultaneously, and to use that common data structure as the program structure. So even way back in 1975 it was clear that we need to pay attention to the structure of output data as well as to that of input data.

Ad-hockery

HtDP apologizes that generative recursion (which it identifies with the field of algorithm design)

is much more of an ad hoc activity than the data-driven design of structurally recursive functions. Indeed, it is almost better to call it inventing an algorithm than designing one. Inventing an algorithm requires a new insight—a “eureka”.

It goes on to suggest that mere programmers cannot generally be expected to have such algorithmic insights:

In practice, new complex algorithms are often developed by mathematicians and mathematical computer scientists; programmers, though, must th[o]roughly understand the underlying ideas so that they can invent the simple algorithms on their own and communicate with scientists about the others.

I say that this defeatism is a consequence of not following through on the core message, that data structure determines program structure. QuickSort and MergeSort are not ad hoc. Admittedly, they do require some insight in order to identify structure that is not present in either the input or the output. But having identified that structure, there is no further mystery, and no ad-hockery required.

Folds on lists

jeremygibbons — Tue, 14 Aug 2018 15:05:44 +0000

The previous post turned out to be rather complicated. In this one, I return to much simpler matters: and on lists, and list homomorphisms with an associative binary operator. For simplicity, I’m only going to be discussing finite lists.

Fold right

Recall that is defined as follows:

It satisfies the following universal property, which gives necessary and sufficient conditions for a function to be expressible as a :

As one consequence of the universal property, we get a fusion theorem, which states sufficient conditions for fusing a following function into a :

(on finite lists—infinite lists require an additional strictness condition). Fusion is an equivalence if is surjective. If is not surjective, it’s an equivalence on the range of that fold: the left-hand side implies

only for of the form for some , not for all .

As a second consequence of the universal property (or alternatively, as a consequence of the free theorem of the type of ), we have map–fold fusion:

The code golf “” is a concise if opaque way of writing the binary operator pointfree: .

Fold left

Similarly, is defined as follows:

( is tail recursive, so it is unproductive on an infinite list.)

also enjoys a universal property, although it’s not so well known as that for . Because of the varying accumulating parameter , the universal property entails abstracting that argument on the left in favour of a universal quantification on the right:

(For the proof, see Exercise 16 in my paper with Richard Bird on arithmetic coding.)

From the universal property, it is straightforward to prove a map– fusion theorem:

Note that . There is also a fusion theorem:

This is easy to prove by induction. (I would expect it also to be a consequence of the universal property, but I don’t see how to make that go through.)

Duality theorems

Of course, and are closely related. §3.5.1 of Bird and Wadler’s classic text presents a First Duality Theorem:

when is associative with neutral element ; a more general Second Duality Theorem:

when associates with (that is, ) and (for all , but fixed ); and a Third Duality Theorem:

(again, all three only for finite ).

The First Duality Theorem is a specialization of the Second, when (but curiously, also a slight strengthening: apparently all we require is that , not that both equal ; for example, we still have , even though is not the neutral element for ).

Characterization

“When is a function a fold?” Evidently, if function on lists is injective, then it can be written as a : in that case (at least, classically speaking), has a post-inverse—a function such that —so:

and so, letting , we have

But also evidently, injectivity is not necessary for a function to be a fold; after all, is a fold, but not the least bit injective. Is there a simple condition that is both sufficient and necessary for a function to be a fold? There is! The condition is that lists equivalent under remain equivalent when extended by an element:

We say that is leftwards. (The kernel of a function is a relation on its domain, namely the set of pairs such that ; this condition is equivalent to for every .) Clearly, leftwardsness is necessary: if and , then

Moreover, leftwardsness is sufficient. For, suppose that is leftwards; then pick a function such that, when is in the range of , we get for some such that (and it doesn’t matter what value returns for outside the range of ), and then define

This is a proper definition of , on account of leftwardsness, in the sense that it doesn’t matter which value we pick for , as long as indeed : any other value that also satisfies entails the same outcome for . Intuitively, it is not necessary to completely invert (as we did in the injective case), provided that preserves enough distinctions. For example, for (for which indeed implies ), we could pick . In particular, is obviously in the range of ; then is chosen to be some such that , and so by construction —in other words, acts as a kind of partial inverse of . So we get:

and therefore where .

Monoids and list homomorphisms

A list homomorphism is a fold after a map in a monoid. That is, define

provided that and form a monoid. (One may verify that this condition is sufficient for the equations to completely define the function; moreover, they are almost necessary— and should form a monoid on the range of .) The Haskell library defines an analogous method, whose list instance is

—the binary operator and initial value are determined implicitly by the instance rather than being passed explicitly.

Richard Bird’s Introduction to the Theory of Lists states implicitly what might be called the First Homomorphism Theorem, that any homomorphism consists of a reduction after a map (in fact, a consequence of the free theorem of the type of ):

Explicitly as Lemma 4 (“Specialization”) the same paper states a Second Homomorphism Theorem, that any homomorphism can be evaluated right-to-left or left-to-right, among other orders:

I wrote up the Third Homomorphism Theorem, which is the converse of the Second Homomorphism Theorem: any function that can be written both as a and as a is a homomorphism. I learned this theorem from David Skillicorn, but it was conjectured earlier by Richard Bird and proved by Lambert Meertens during a train journey in the Netherlands—as the story goes, Lambert returned from a bathroom break with the idea for the proof. Formally, for given , if there exists such that

then there also exists and associative such that

Recall that is “leftwards” if

and that the leftwards functions are precisely the s. Dually, is rightwards if

and (as one can show) the rightwards functions are precisely the s. So the Specialization Theorem states that every homomorphism is both leftwards and rightwards, and the Third Homomorphism Theorem states that every function that is both leftwards and rightwards is necessarily a homomorphism. To prove the latter, suppose that is both leftwards and rightwards; then pick a function such that, for in the range of , we get for some such that , and then define

As before, this is a proper definition of : assuming leftwardsness and rightwardsness, the result does not depend on the representatives chosen for . By construction, again satisfies for any , and so we have:

Moreover, one can show that is associative, and its neutral element (at least on the range of ).

For example, sorting a list can obviously be done as a , using insertion sort; by the Third Duality Theorem, it can therefore also be done as a on the reverse of the list; and because the order of the input is irrelevant, the reverse can be omitted. The Third Homomorphism Theorem then implies that there exists an associative binary operator such that . It also gives a specification of , given a suitable partial inverse of —in this case, suffices, because is idempotent. The only characterization of arising directly from the proof involves repeatedly inserting the elements of one argument into the other, which does not exploit sortedness of the first argument. But from this inefficient characterization, one may derive the more efficient implementation that merges two sorted sequences, and hence obtain merge sort overall.

The Trick

Here is another relationship between the directed folds ( and ) and list homomorphisms—any directed fold can be implemented as a homomorphism, followed by a final function application to a starting value:

Roughly speaking, this is because application and composition are related:

(here, is right-associating, loose-binding function application). To be more precise:

This result has many applications. For example, it’s the essence of parallel recognizers for regular languages. Language recognition looks at first like an inherently sequential process; a recognizer over an alphabet can be represented as a finite state machine over states , with a state transition function of type , and such a function is clearly not associative. But by mapping this function over the sequence of symbols, we get a sequence of functions, which should be subject to composition. Composition is of course associative, so can be computed in parallel, in steps on processors. Better yet, each such function can be represented as an array (since is finite), and the composition of any number of such functions takes a fixed amount of space to represent, and a fixed amount of time to apply, so the steps take time.

Similarly for carry-lookahead addition circuits. Binary addition of two -bit numbers with an initial carry-in proceeds from right to left, adding bit by bit and taking care of carries, producing an -bit result and a carry-out; this too appears inherently sequential. But there aren’t many options for the pair of input bits at a particular position: two s will always generate an outgoing carry, two s will always kill an incoming carry, and a and in either order will propagate an incoming carry to an outgoing one. Whether there is a carry-in at a particular bit position is computed by a of the bits to the right of that position, zipped together, starting from the initial carry-in, using the binary operator defined by

Again, applying to a bit-pair makes a bit-to-bit function; these functions are to be composed, and function composition is associative. Better, such functions have small finite representations as arrays (indeed, we need a domain of only three elements, ; and are both left zeroes of composition, and is neutral). Better still, we can compute the carry-in at all positions using a , which for an associative binary operator can also be performed in parallel in steps on processors.

The Trick seems to be related to Cayley’s Theorem, on monoids rather than groups: any monoid is equivalent to a monoid of endomorphisms. That is, corresponding to every monoid , with a set, and an associative binary operator with neutral element , there is another monoid on the carrier of endomorphisms of the form ; the mappings and are both monoid homomorphisms, and are each other’s inverse. (Cayley’s Theorem is what happens to the Yoneda Embedding when specialized to the one-object category representing a monoid.) So any list homomorphism corresponds to a list homomorphism with endomorphisms as the carrier, followed by a single final function application:

Note that takes a list element the endomorphism . The change of representation from to is the same as in The Trick, and is also what underlies Hughes’s novel representation of lists; but I can’t quite put my finger on what these both have to do with Cayley and Yoneda.

Asymmetric Numeral Systems

jeremygibbons — Sat, 04 Aug 2018 14:12:56 +0000

The previous two posts discussed arithmetic coding (AC). In this post I’ll discuss a related but more recent entropy encoding technique called asymmetric numeral systems (ANS), introduced by Jarek Duda and explored in some detail in a series of 12 blog posts by Charles Bloom. ANS combines the effectiveness of AC (achieving approximately the theoretical optimum Shannon entropy) with the efficiency of Huffman coding. On the other hand, whereas AC acts in a first-in first-out manner, ANS acts last-in first-out, in the sense that the decoded text comes out in reverse; this is not a problem in a batched context, but it makes ANS unsuitable for encoding a communications channel, and also makes it difficult to employ adaptive text models.

Arithmetic coding, revisited

Here is a simplified version of arithmetic coding; compared to the previous posts, we use a fixed rather than adaptive model, and rather than ing some arbitrary element of the final interval, we pick explicitly the lower bound of the interval (by , when it is represented as a pair). We suppose a function

that returns a positive count for every symbol, representing predicted frequencies of occurrence, from which it is straightforward to derive functions

that work on a fixed global model, satisfying the central property

For example, an alphabet of three symbols `a’, `b’, `c’, with counts 2, 3, and 5 respectively, could be encoded by the intervals . We have operations on intervals:

that satisfy

Then we have:

For example, with the alphabet and intervals above, the input text “abc” gets encoded symbol by symbol, from left to right (because of the ), to the narrowing sequence of intervals from which we select the lower bound of the final interval. (This version doesn’t stream well, because of the decision to pick the lower bound of the final interval as the result. But it will suffice for our purposes.)

Note now that the symbol encoding can be “fissioned” out of , using map fusion for backwards; moreover, is associative (and its neutral element), so we can replace the with a ; and finally, now using fusion forwards, we can fuse the with the . That is,

and we can fuse the back into the ; so we have , where

Now encoding is a simple and decoding a simple . This means that we can prove that decoding inverts encoding, using the “Unfoldr–Foldr Theorem” stated in the Haskell documentation for (in fact, the only property of stated there!). The theorem states that if

for all and , then

for all finite lists . Actually, that’s too strong for us here, because our decoding always generates an infinite list; a weaker variation (hereby called the “Unfoldr–Foldr Theorem With Junk”) states that if

for all and , then

for all finite lists —that is, the inverts the except for appending some junk to the output. It’s a nice exercise to prove this weaker theorem, by induction.

We check the condition:

Now, we hope to recover the first symbol, ie we hope that :

Fortunately, it is an invariant of the encoding process that —the result of —is in the unit interval, so indeed . Continuing:

as required. Therefore decoding inverts encoding, apart from the fact that it continues to produce junk having decoded all the input; so

for all finite .

From fractions to integers

Whereas AC encodes longer and longer messages as more and more precise fractions, ANS encodes them as larger and larger integers. Given the function on symbols, we can easily derive definitions

such that gives the cumulative counts of all symbols preceding (say, in alphabetical order), the total of all symbol counts, and looks up an integer :

(for example, we might tabulate the function using a scan).

We encode a text as an integer , containing bits of information (all logs in this blog are base 2). The next symbol to encode has probability , and so requires an additional bits; in total, that’s bits. So entropy considerations tell us that, roughly speaking, to incorporate symbol into state we want to map to . Of course, in order to decode, we need to be able to invert this transformation, to extract and from ; this suggests that we should do the division by first:

so that the multiplication by the known value can be undone first:

How do we reconstruct ? Well, there is enough headroom in to add any non-negative value less than without affecting the division; in particular, we can add to , and then we can use on the remainder:

But we have still lost some information from the lower end of through the division, namely ; so we can’t yet reconstruct . Happily, for any , so there is still precisely enough headroom in to add this lost information too without affecting the , allowing us also to reconstruct :

We therefore define

Using similar reasoning as for AC, it is straightforward to show that a decoding step inverts an encoding step:

Therefore decoding inverts encoding, modulo final junk, by the Unfoldr–Foldr Theorem With Junk:

for all finite . For example, with the same alphabet and symbol weights as before, encoding the text “abc” (now from right to left, because of the ) proceeds through the steps , and the last element is the encoding.

In fact, the inversion property holds whatever value we use to start encoding with (since it isn’t used in the proof); in the next section, we start encoding with a certain lower bound rather than . Moreover, is strictly increasing on states strictly greater than zero, and strictly decreasing; which means that the decoding process can stop when it returns to the lower bound. That is, if we pick some 0}" class="latex" /> and define

then the stronger version of the Unfoldr–Foldr Theorem holds, and we have

for all finite , without junk.

Bounded precision

The previous versions all used arbitrary-precision arithmetic, which is expensive. We now change the approach slightly to use only bounded-precision arithmetic. As usual, there is a trade-off between effectiveness (a bigger bound means more accurate approximations to ideal entropies) and efficiency (a smaller bound generally means faster operations). Fortunately, the reasoning does not depend on much about the actual bounds. We will represent the integer accumulator as a pair which we call a :

such that is a list of digits in a given base , and is an integer with (where we define for the upper bound of the range), under the abstraction relation

where

We call the “window” and the “remainder”. For example, with and , the pair represents ; and indeed, for any there is a unique representation satisfying the invariants. According to Duda, should divide into evenly. On a real implementation, and should be powers of two, such that arithmetic on values up to fits within a single machine word.

The encoding step acts on the window in the accumulator using , which risks making it overflow the range; we therefore renormalize with by shifting digits from the window to the remainder until this overflow would no longer happen, before consuming the symbol.

Some calculation, using the fact that divides , allows us to rewrite the guard in :

For example, again encoding the text “abc”, again from right to left, with and , the accumulator starts with , and goes through the states and on consuming the `c’ then the `b’; directly consuming the `a’ would make the window overflow, so we renormalize to ; then it is safe to consume the `a’, leading to the final state .

Decoding is an unfold using the accumulator as state. We repeatedly output a symbol from the window; this may make the window underflow the range, in which case we renormalize if possible by injecting digits from the remainder (and if this is not possible, because there are no more digits to inject, it means that we have decoded the entire text).

Note that decoding is of course symmetric to encoding; in particular, when encoding we renormalize before consuming a symbol; therefore when decoding we renormalize after emitting a symbol. For example, decoding the final encoding starts by computing ; the window value 68 has underflowed, so renormalization consumes the remaining digit 3, leading to the accumulator ; then decoding proceeds to extract the `b’ and `c’ in turn, returning the accumulator to .

We can again prove that decoding inverts encoding, although we need to use yet another variation of the Unfoldr–Foldr Theorem, hereby called the “Unfoldr–Foldr Theorem With Invariant”. We say that predicate is an invariant of and if

The theorem is that if is such an invariant, and the conditions

hold for all and , then

for all finite lists , provided that the initial state satisfies the invariant . In our case, the invariant is that the window is in range (), which is indeed maintained by and . Then it is straightforward to verify that

and

when is in range, making use of a lemma that

when is in range. Therefore decoding inverts encoding:

for all finite .

Trading in sequences

The version of encoding in the previous section yields a , that is, a pair consisting of an integer window and a digit sequence remainder. It would be more conventional for encoding to take a sequence of symbols to a sequence of digits alone, and decoding to take the sequence of digits back to a sequence of symbols. For encoding, we have to flush the remaining digits out of the window at the end of the process, reducing the window to zero:

For example, . Then we can define

Correspondingly, decoding should start by populating an initially-zero window from the sequence of digits:

For example, . Then we can define

One can show that

provided that is initially in range, and therefore again decoding inverts encoding:

for all finite .

Note that this is not just a data refinement—it is not the case that yields the digits of the integer computed by . For example, , whereas . We have really changed the effectiveness of the encoding in return for the increased efficiency.

Streaming encoding

We would now like to stream encoding and decoding as metamorphisms: encoding should start delivering digits before consuming all the symbols, and conversely decoding should start delivering symbols before consuming all the digits.

For encoding, we have

with a fold; can we turn into an unfold? Yes, we can! The remainder in the accumulator should act as a queue of digits: digits get enqueued at the most significant end, so we need to dequeue them from the least significant end. So we define

to “deal out” the elements of a list one by one, starting with the last. Of course, this reverses the list; so we have the slightly awkward

Now we can use unfold fusion to fuse and into a single unfold, deriving the coalgebra

by considering each of the three cases, getting where

As for the input part, we have a where we need a . Happily, that is easy to achieve, at the cost of an additional , since:

on finite . So we have , where

It turns out that the streaming condition does not hold for and —although we can stream digits out of the remainder:

we cannot stream the last few digits out of the window:

We have to resort to flushing streaming, which starts from an apomorphism

rather than an unfold. Of course, any unfold is trivially an apo, with the trivial flusher that always yields the empty list; but more interestingly, any unfold can be factored into a “cautious” phase (delivering elements only while a predicate holds) followed by a “reckless” phase (discarding the predicate, and delivering the elements anyway)

where

In particular, the streaming condition may hold for the cautious coalgebra when it doesn’t hold for the reckless coalgebra itself. We’ll use the cautious coalgebra

which carefully avoids the problematic case when the remainder is empty. It is now straightforward to verify that the streaming condition holds for and , and therefore , where

and where streams its output. On the downside, this has to read the input text in reverse, and also write the output digit sequence in reverse.

Streaming decoding

Fortunately, decoding is rather easier to stream. We have

with an unfold; can we turn into a fold? Yes, we can! In fact, we have , where

That is, starting with for the accumulator, digits are injected one by one into the window until this is in range, and thereafter appended to the remainder.

Now we have decoding as an after a , and it is straightforward to verify that the streaming condition holds for and . Therefore on finite inputs, where

and streams. It is perhaps a little more symmetric to move the from the final step of encoding to the initial step of decoding—that is, to have the least significant digit first in the encoded text, rather than the most significant—and then to view both the encoding and the decoding process as reading their inputs from right to left.

Summing up

Inlining definitions and simplifying, we have ended up with encoding as a flushing stream:

and decoding as an ordinary stream:

Note that the two occurrences of can be taken to mean that encoding and decoding both process their input from last to first. The remainder acts as a queue, with digits being added at one end and removed from the other, so should be represented to make that efficient. But in fact, the remainder can be eliminated altogether, yielding simply the window for the accumulator—for encoding:

and for decoding:

This gives rather simple programs, corresponding to those in Duda’s paper; however, the calculation of this last version from the previous steps currently eludes me.

Streaming Arithmetic Coding

jeremygibbons — Mon, 11 Dec 2017 13:32:35 +0000

In the previous post we saw the basic definitions of arithmetic encoding and decoding, and a proof that decoding does indeed successfully retrieve the input. In this post we go on to show how both encoding and decoding can be turned into streaming processes.

Producing bits

Recall that

Encoding and decoding work together. But they work only in batch mode: encoding computes a fraction, and yields nothing until the last step, and so decoding cannot start until encoding has finished. We really want encoding to yield as the encoded text a list of bits representing the fraction, rather than the fraction itself, so that we can stream the encoded text and the decoding process. To this end, we replace by , where

The obvious definitions have yield the shortest binary expansion of any fraction within , and evaluate this binary expansion. However, we don’t do quite this—it turns out to prevent the streaming condition from holding—and instead arrange for to yield the bit sequence that when extended with a 1 yields the shortest expansion of any fraction within (and indeed, the shortest binary expansion necessarily ends with a 1), and compute the value with this 1 appended.

Thus, if then the binary expansion of any fraction within starts with 0; and similarly, if , the binary expansion starts with 1. Otherwise, the interval straddles ; the shortest binary expansion within is it the expansion of , so we yield the empty bit sequence.

Note that is a hylomorphism, so we have

Moreover, it is clear that yields a finite bit sequence for any non-empty interval (since the interval doubles in width at each step, and the process stops when it includes ); so this equation serves to uniquely define . In other words, is a recursive coalgebra. Then it is a straightforward exercise to prove that ; so although and differ, they are sufficiently similar for our purposes.

Now we redefine encoding to yield a bit sequence rather than a fraction, and decoding correspondingly to consume that bit sequence:

That is, we move the part of from the encoding stage to the decoding stage.

Streaming encoding

Just like , the new version of encoding consumes all of its input before producing any output, so does not work for encoding infinite inputs, nor for streaming execution even on finite inputs. However, it is nearly in the right form to be a metamorphism—a change of representation from lists of symbols to lists of bits. In particular, is associative, and is its unit, so we can replace the with a :

Now that is in the right form, we must check the streaming condition for and . We consider one of the two cases in which is productive, and leave the other as an exercise. When , and assuming , we have:

as required. The last step is a kind of associativity property:

whose proof is left as another exercise. Therefore the streaming condition holds, and we may fuse the with the , defining

which streams the encoding process: the initial bits are output as soon as they are fully determined, even before all the input has been read. Note that and differ, in particular on infinite inputs (the former diverges, whereas the latter does not); but they coincide on finite symbol sequences.

Streaming decoding

Similarly, we want to be able to stream decoding, so that we don’t have to wait for the entire encoded text to arrive before starting decoding. Recall that we have so far

where is an and a . The first obstacle to streaming is that , which we need to be a instead. We have

Of course, is not associative—it doesn’t even have the right type for that. But we can view each bit in the input as a function on the unit interval: bit~0 is represented by the function that focusses into the lower half of the unit interval, and bit~1 by the function that focusses into the upper half. The fold itself composes a sequence of such functions; and since function composition is associative, this can be written equally well as a or a . Having assembled the individual focussers into one composite function, we finally apply it to . (This is in fact an instance of a general trick for turning a into a , or vice versa.) Thus, we have:

where yields either the lower or the upper half of the unit interval:

In fact, not only may the individual bits be seen as focussing functions and on the unit interval, so too may compositions of such functions:

So any such composition is of the form for some interval , and we may represent it concretely by itself, and retrieve the function via :

So we now have

This is almost in the form of a metamorphism, except for the occurrence of the adapter in between the unfold and the fold. It is not straightforward to fuse that adapter with either the fold or the unfold; fortunately, however, we can split it into the composition

of two parts, where

in such a way that the first part fuses with the fold and the second part fuses with the unfold. For fusing the first half of the adapter with the fold, we just need to carry around the additional value with the interval being focussed:

where

For fusing the second half of the adapter with the unfold, let us check the fusion condition. We have (exercise!):

where the is the functorial action for the base functor of the datatype, applying just to the second component of the optional pair. We therefore define

and have

and therefore

Note that the right-hand side will eventually lead to intervals that exceed the unit interval. When , it follows that ; but the unfolding process keeps widening the interval without bound, so it will necessarily eventually exceed the unit bounds. We return to this point shortly.

We have therefore concluded that

Now we need to check the streaming condition for and . Unfortunately, this is never going to hold: is always productive, so will only take production steps and never consume any input. The problem is that is too aggressive, and we need to use the more cautious flushing version of streaming instead. Informally, the streaming process should be productive from a given state only when the whole of interval maps to the same symbol in model , so that however is focussed by subsequent inputs, that symbol cannot be invalidated.

More formally, note that

where

and

That is, the interval is “safe” for model if it is fully included in the encoding of some symbol ; then all elements of decode to . Then, and only then, we may commit to outputting , because no further input bits could lead to a different first output symbol.

Note now that the interval remains bounded by unit interval during the streaming phase, because of the safety check in , although it will still exceed the unit interval during the flushing phase. However, at this point we can undo the fusion we performed earlier, “fissioning” into again: this manipulates rationals rather than intervals, so there is no problem with intervals getting too wide. We therefore have:

Now let us check the streaming condition for and the more cautious . Suppose that is a productive state, so that holds, that is, all of interval is mapped to the same symbol in , and let

so that . Consuming the next input leads to state . This too is a productive state, because for any , and so the whole of the focussed interval is also mapped to the same symbol in the model. In particular, the midpoint of is within interval , and so the first symbol produced from the state after consumption coincides with the symbol produced from the state before consumption. That is,

as required. We can therefore rewrite decoding as a flushing stream computation:

That is, initial symbols are output as soon as they are completely determined, even before all the input bits have been read. This agrees with on finite bit sequences.

Fixed-precision arithmetic

We will leave arithmetic coding at this point. There is actually still quite a bit more arithmetic required—in particular, for competitive performance it is important to use only fixed-precision arithmetic, restricting attention to rationals within the unit interval with denominator for some fixed~. In order to be able to multiply two numerators using 32-bit integer arithmetic without the risk of overflow, we can have at most . Interval narrowing now needs to be approximate, rounding down both endpoints to integer multiples of . Care needs to be taken so that this rounding never makes the two endpoints of an interval coincide. Still, encoding can be written as an instance of . Decoding appears to be more difficult: the approximate arithmetic means that we no longer have interval widening as an exact inverse of narrowing, so the approach above no longer works. Instead, our 2002 lecture notes introduce a “destreaming” operator that simulates and inverts streaming: the decoder works in sympathy with the encoder, performing essentially the same interval arithmetic but doing the opposite conversions. Perhaps I will return to complete that story some time…

Arithmetic Coding

jeremygibbons — Tue, 05 Dec 2017 15:57:52 +0000

This post is about the data compression method called arithmetic coding, by which a text is encoded as a subinterval of the unit interval, which is then represented as a bit sequence. It can often encode more effectively than Huffman encoding, because it doesn’t have the restriction of Huffman that each symbol be encoded as a positive whole number of bits; moreover, it readily accommodates adaptive models of the text, which “learn” about the text being encoded while encoding it. It is based on lecture notes that I wrote in 2002 with Richard Bird, although the presentation here is somewhat simplified; it is another application of streaming. There’s quite a lot to cover, so in this post I’ll just set up the problem by implementing a basic encoder and decoder. In the next post, I’ll show how they can both be streamed. (We won’t get into the intricacies of restricting to fixed-precision arithmetic—perhaps I can cover that in a later post.)

The basic idea behind arithmetic coding is essentially to encode an input text as a subinterval of the unit interval, based on a model of the text symbols that assigns them to a partition of the unit interval into non-empty subintervals. For the purposes of this post, we will deal mostly with half-open intervals, so that the interval contains values such that , where are rationals.

For example, with just two symbols “a” and “b”, and a static model partitioning the unit interval into for “a” and for “b”, the symbols in the input text “aba” successively narrow the unit interval to , and the latter interval is the encoding of the whole input. And in fact, it suffices to pick any single value in this final interval, as long as there is some other way to determine the end of the encoded text (such as the length, or a special end-of-text symbol).

Intervals

We introduce the following basic definitions for intervals:

We’ll write “” for , and “” for .

A crucial operation on intervals is narrowing of one interval by another, where is to as is to the unit interval:

We’ll write “” for . Thus, is “proportionately of the way between and “, and we have

Conversely, we can widen one interval by another:

We’ll write “” for . Note that is inverse to , in the sense

and consequently widening is inverse to narrowing:

Models

We work with inputs consisting of sequences of symbols, which might be characters or some higher-level tokens:

The type then must provide the following operations:

a way to look up a symbol, obtaining the corresponding interval:
conversely, a way to decode a value, retrieving a symbol:
an initial model:
a means to adapt the model on seeing a new symbol:

The central property is that encoding and decoding are inverses, in the following sense:

There are no requirements on and , beyond the latter being a total function.

For example, we might support adaptive coding via a model that counts the occurrences seen so far of each of the symbols, represented as a histogram:

This naive implementation works well enough for small alphabets. One might maintain the histogram in decreasing order of counts, so that the most likely symbols are at the front and are therefore found quickest. For larger alphabets, it is better to maintain the histogram as a binary search tree, ordered alphabetically by symbol, and caching the total counts of every subtree.

Encoding

Now encoding is straightforward to define. The function takes an initial model and a list of symbols, and returns the list of intervals obtained by looking up each symbol in turn, adapting the model at each step:

That is,

We then narrow the unit interval by each of these subintervals, and pick a single value from the resulting interval:

All we require of is that ; then yields a fraction in the unit interval. For example, we might set , where

Decoding

So much for encoding; how do we retrieve the input text? In fact, we can retrieve the first symbol simply by using . Expanding the encoding of a non-empty text, we have:

The proof obligation, left as an exercise, is to show that

which holds when is of the form for some .

Now

and indeed, encoding yields a fraction in the unit interval, so this recovers the first symbol correctly. This is the foothold that allows the decoding process to make progress; having obtained the first symbol using , it can adapt the model in precisely the same way that the encoding process does, then retrieve the second symbol using that adapted model, and so on. The only slightly tricky part is that when decoding an initial value , having obtained the first symbol , decoding should continue on some modified value ; what should the modification be? It turns out that the right thing to do is to scale by the interval associated in the model with symbol , since scaling is the inverse operation to the s that take place during encoding. That is, we define:

(Of course, , by the inverse requirement on models, and so the new scaled value is again within the unit interval.)

Note that decoding yields an infinite list of symbols; the function is always productive. Nevertheless, that infinite list starts with the encoded text, as we shall now verify. Define the round-trip function

Then we have:

From this it follows that indeed the round-trip recovers the initial text, in the sense that yields an infinite sequence that starts with ; in fact,

yielding the original input followed by some junk, the latter obtained by decoding the fraction (the encoding of ) from the final model that results from adapting the initial model to each symbol in in turn. To actually retrieve the input text with no junk suffix, one could transmit the length separately (although that doesn’t sit well with streaming), or append a distinguished end-of-text symbol.

What’s next

So far we have an encoder and a decoder, and a proof that the decoder successfully decodes the encoded text. In the next post, we’ll see how to reimplement both as streaming processes.

The Digits of Pi

jeremygibbons — Thu, 09 Nov 2017 14:55:09 +0000

In the previous post we were introduced to metamorphisms, which consist of an unfold after a fold—typically on lists, and the fold part typically a . A canonical example is the conversion of a fraction from one base to another. For simplicity, let’s consider here only infinite fractions, so we don’t have to deal with the end of the input and flushing the state:

So for example, we can convert an infinite fraction in base 3 to one in base 7 with

where

In this post, we’ll see another number conversion problem, which will deliver the digits of . For more details, see my paper—although the presentation here is simpler now.

Series for pi

Leibniz showed that

From this, using Euler’s convergence-accelerating transformation, one may derive

or equivalently

This can be seen as the number in a funny mixed-radix base , just as the usual decimal expansion

is represented by the number in the fixed-radix base . Computing the decimal digits of is then a matter of conversion from the mixed-radix base to the fixed-radix base.

Conversion from a fixed base

Let’s remind ourselves of how it should work, using a simpler example: conversion from one fixed base to another. We are given an infinite-precision fraction in the unit interval

in base , in which for each digit . We are to convert it to a similar representation

in base , in which for each output digit . The streaming process maintains a state , a pair of rationals; the invariant is that after consuming input digits and producing output digits, we have

so that represents a linear function that should be applied to the value represented by the remaining input.

We can initialize the process with . At each step, we first try to produce another output digit. The remaining input digits represent a value in the unit interval; so if and have the same integer part, then that must be the next output digit, whatever the remaining input digits are. Let be that integer. Now we need to find such that

for any remainder ; then we can increment and set to and the invariant is maintained. A little algebra shows that we should take and .

If and have different integer parts, we cannot yet tell what the next output digit should be, so we must consume the next input digit instead. Now we need to find such that

for any remainder ; then we can increment and set to and the invariant is again maintained. Again, algebraic manipulation leads us to and .

For example, , and the conversion starts as follows:

That is, the initial state is . This state does not yet determine the first output digit, so we consume the first input digit 0 to yield the next state . This state still does not determine the first output, and nor will the next; so we consume the next two input digits 2 and 0, yielding state . This state does determine the next digit: and both start with a 1 in base 7. So we can produce a 1 as the first output digit, yielding state . And so on.

The process tends to converge. Each production step widens the non-empty window by a factor of , so it will eventually contain multiple integers; therefore we cannot produce indefinitely. Each consumption step narrows the window by a factor of , so it will tend towards eventually producing the next output digit. However, this doesn’t always work. For example, consider converting to base 3:

The first output digit is never determined: if the first non-3 in the input is less than 3, the value is less than a third, and the first output digit should be a 0; if the first non-3 is greater than 3, then the value is definitely greater than a third, and it is safe to produce a 1 as the first output digit; but because the input is all 3s, we never get to make this decision. This problem will happen whenever the value being represented has a finite representation in the output base.

Conversion from a mixed base

Let’s return now to computing the digits of . We have the input

which we want to convert to decimal. The streaming process maintains a pair of rationals—but this time representing the linear function , since this time our expression starts with a sum rather than a product. The invariant is similar: after consuming input digits and producing output digits, we have

Note that the output base is fixed at 10; but more importantly, the input digits are all fixed at 2, and it is the input base that varies from digit to digit.

We can initialize the process with . At each step, we first try to produce an output digit. What value might the remaining input

represent? Each of the bases is at least , so it is clear that , where

which has unique solution . Similarly, each of the bases is less than , so it is clear that , where

which has unique solution . So we consider the bounds and ; if these have the same integer part , then that is the next output digit. Now we need to find such that

for any remainder , so we pick and . Then we can increment and set to , and the invariant is maintained.

If the two bounds have different integer parts, we must consume the next input digit instead. Now we need to find such that

for all , so we pick and . Then we can increment and set to , and again the invariant is maintained.

The conversion starts as follows:

Happily, non-termination ceases to be a problem: the value being represented does not have a finite representation in the output base, being irrational.

Code

We can plug these definitions straight into the function above:

where

and

(The s make rational numbers in Haskell, and force the ambiguous fractional type to be rather than .)

In fact, this program can be considerably simplified, by inlining the definitions. In particular, the input digits are all 2, so we need not supply them. Moreover, the component of the state is never used, because we treat each output digit in the same way (in contrast to the input digits); so that may be eliminated. Finally, we can eliminate some of the numeric coercions if we represent the component as a rational in the first place:

Then we have

Metamorphisms

jeremygibbons — Wed, 04 Oct 2017 09:45:14 +0000

It appears that I have insufficient time, or at least insufficient discipline, to contribute to this blog, except when I am on sabbatical. Which I now am… so let’s see if I can do better.

Hylomorphisms

I don’t think I’ve written about them yet in this series—another story, for another day—but hylomorphisms consist of a fold after an unfold. One very simple example is the factorial function: is the product of the predecessors of . The predecessors can be computed with an unfold:

and the product as a fold:

and then factorial is their composition:

Another example is a tree-based sorting algorithm that resembles Hoare’s quicksort: from the input list, grow a binary search tree, as an unfold, and then flatten that tree back to a sorted list, as a fold. This is a divide-and-conquer algorithm; in general, these can be modelled as unfolding a tree of subproblems by repeatedly dividing the problem, then collecting the solution to the original problem by folding together the solutions to subproblems.

An unfold after a fold

This post is about the opposite composition, an unfold after a fold. Some examples:

to reformat a list of lists to a given length;
to sort a list;
to convert a fraction from base to base ;
to encode a text in binary by “arithmetic coding”.

In each of these cases, the first phase is a fold, which consumes some structured representation of a value into an intermediate unstructured format, and the second phase is an unfold, which generates a new structured representation. Their composition effects a change of representation, so we call them metamorphisms.

Hylomorphisms always fuse, and one can deforest the intermediate virtual data structure. For example, one need not construct the intermediate list in the factorial function; since each cell gets constructed in the unfold only to be immediately deconstructed in the fold, one can cut to the chase and go straight to the familiar recursive definition. For the base case, we have:

and for non-zero argument , we have:

In contrast, metamorphisms only fuse under certain conditions. However, when they do fuse, they also allow infinite representations to be processed, as we shall see.

Fusion seems to depend on the fold being tail-recursive; that is, a :

For the unfold phase, we will use the usual list unfold:

We define a metamorphism as their composition:

This transforms input of type to output of type : in the first phase, , it consumes all the input into an intermediate value of type ; in the second phase, , it produces all the output.

Streaming

Under certain conditions, it is possible to fuse these two phases—this time, not in order to eliminate an intermediate data structure (after all, the intermediate type need not be structured), but rather in order to allow some production steps to happen before all the consumption steps are complete.

To that end, we define the function as follows:

This takes the same arguments as . It maintains a current state , and produces an output element when it can; and when it can’t produce, it consumes an input element instead. In more detail, it examines the current state using function , which is like the body of an unfold; this may produce a first element of the result and a new state ; when it yields no element, the next element of the input is consumed using function , which is like the body of a ; and when no input remains either, we are done.

The streaming condition for and is that

Consider a state from which the body of the unfold is productive, yielding some . From here we have two choices: we can either produce the output , move to intermediate state , then consume the next input to yield a final state ; or we can consume first to get the intermediate state , and again try to produce. The streaming condition says that this intermediate state will again be productive, and will yield the same output and the same final state . That is, instead of consuming all the inputs first, and then producing all the outputs, it is possible to produce some of the outputs early, without jeopardizing the overall result. Provided that the streaming condition holds for and , then

for all finite lists .

As a simple example, consider the `buffering’ process , where

Note that , so is just a complicated way of writing as a . But the streaming condition holds for and (as you may check), so may be streamed. Operationally, the streaming version of consumes one list from the input list of lists, then peels off and produces its elements one by one; when they have all been delivered, it consumes the next input list, and so on.

Flushing

The streaming version of is actually rather special, because the production steps can always completely exhaust the intermediate state. In contrast, consider the `regrouping’ example where

from the introduction (here, yields where , with when and otherwise). This transforms an input list of lists into an output list of lists, where each output `chunk’ except perhaps the last has length —if the content doesn’t divide up evenly, then the last chunk is short. One might hope to be able to stream , but it doesn’t quite work with the formulation so far. The problem is that is too aggressive, and will produce short chunks when there is still some input to consume. (Indeed, the streaming condition does not hold for and —why not?) One might try the more cautious producer :

But this never produces a short chunk, and so if the content doesn’t divide up evenly then the last few elements will not be extracted from the intermediate state and will be lost.

We need to combine these two producers somehow: the streaming process should behave cautiously while there is still remaining input, which might influence the next output; but it should then switch to a more aggressive strategy once the input is finished, in order to flush out the contents of the intermediate state. To achieve this, we define a more general flushing stream operator:

This takes an additional argument ; when the cautious producer is unproductive, and there is no remaining input to consume, it uses to flush out the remaining output elements from the state. Clearly, specializing to retrieves the original operator.

The corresponding metamorphism uses an apomorphism in place of the unfold. Define

Then behaves like , except that if and when stops being productive it finishes up by applying to the final state. Similarly, define flushing metamorphisms:

Then we have

for all finite lists if the streaming condition holds for and . In particular,

on finite inputs : the streaming condition does hold for and the more cautious , and once the input has been exhausted, the process can switch to the more aggressive .

Infinite input

The main advantage of streaming is that it can allow the change-of-representation process also to work on infinite inputs. With the plain metamorphism, this is not possible: the will yield no result on an infinite input, and so the will never get started, but the may be able to produce some outputs before having consumed all the inputs. For example, the streaming version of also works for infinite lists, providing that the input does not end with an infinite tail of empty lists. And of course, if the input never runs out, then there is no need ever to switch to the more aggressive flushing phase.

As a more interesting example, consider converting a fraction from base 3 to base 7:

We assume that the input digits are all either 0, 1 or 2, so that the number being represented is in the unit interval.

The fold in is of the wrong kind; but we have also

Here, the intermediate state can be seen as a defunctionalized representation of the function , and applies this function to :

Now there is an extra function between the and the ; but that’s no obstacle, because it fuses with the :

However, the streaming condition does not hold for and . For example,

That is, , but , so it is premature to produce the first digit 2 in base 7 having consumed only the first digit 1 in base 3. The producer is too aggressive; it should be more cautious while input remains that might invalidate a produced digit.

Fortunately, on the assumption that the input digits are all 0, 1, or 2, the unconsumed input—a tail of the original input—again represents a number in the unit interval; so from the state the range of possible unproduced outputs represents a number between and . If these both start with the same digit in base 7, then (and only then) is it safe to produce that digit. So we define

and we have

Now, the streaming condition holds for and (as you may check), and therefore

on finite digit sequences in base 3. Moreover, the streaming program works also on infinite digit sequences, where the original does not.

(Actually, the only way this could possibly produce a finite output in base 7 would be for the input to be all zeroes. Why? If we are happy to rule out this case, we could consider only the case of taking infinite input to infinite output, and not have to worry about reaching the end of the input or flushing the state.)

Breadth-First Traversal

jeremygibbons — Thu, 05 Mar 2015 17:16:26 +0000

Recently Eitan Chatav asked in the Programming Haskell group on Facebook

What is the correct way to write breadth first traversal of a ?

He’s thinking of “traversal” in the sense of the class, and gave a concrete declaration of rose trees:

It’s an excellent question.

Breadth-first enumeration

First, let’s think about breadth-first enumeration of the elements of a tree. This isn’t compositional (a fold); but the related “level-order enumeration”, which gives a list of lists of elements, one list per level, is compositional:

Here, is “long zip with”; it’s similar to , but returns a list as long as its longer argument:

(It’s a nice exercise to define a notion of folds for , and to write as a fold.)

Given , breadth-first enumeration is obtained by concatenation:

Incidentally, this allows trees to be foldable, breadth-first:

Relabelling

Level-order enumeration is invertible, in the sense that you can reconstruct the tree given its shape and its level-order enumeration.

One way to define this is to pass the level-order enumeration around the tree, snipping bits off it as you go. Here is a mutually recursive pair of functions to relabel a tree with a given list of lists, returning also the unused bits of the lists of lists.

Assuming that the given list of lists is “big enough”—ie each list has enough elements for that level of the tree—then the result is well-defined. Then is determined by the equivalence

Here, the of a tree is obtained by discarding its elements:

In particular, if the given list of lists is the level-order of the tree, and so is exactly the right size, then will have no remaining elements, consisting entirely of empty levels:

So we can take a tree apart into its shape and contents, and reconstruct the tree from such data.

Breadth-first traversal

This lets us traverse a tree in breadth-first order, by performing the traversal just on the contents. We separate the tree into shape and contents, perform a list-based traversal, and reconstruct the tree.

This trick of traversal by factoring into shape and contents is explored in my paper Understanding Idiomatic Traversals Backwards and Forwards from Haskell 2013.

Inverting breadth-first enumeration

We’ve seen that level-order enumeration is invertible in a certain sense, and that this means that we perform traversal by factoring into shape and contents then traversing the contents independently of the shape. But the contents we’ve chosen is the level-order enumeration, which is a list of lists. Normally, one thinks of the contents of a data structure as simply being a list, ie obtained by breadth-first enumeration rather than by level-order enumeration. Can we do relabelling from the breadth-first enumeration too? Yes, we can!

There’s a very clever cyclic program for breadth-first relabelling of a tree given only a list, not a list of lists; in particular, breadth-first relabelling a tree with its own breadth-first enumeration gives back the tree you first thought of. In fact, the relabelling function is precisely the same as before! The trick comes in constructing the necessary list of lists:

Note that variable is defined cyclically; informally, the output leftovers on one level also form the input elements to be used for relabelling all the lower levels. Given this definition, we have

for any . This program is essentially due to Geraint Jones, and is derived in an unpublished paper Linear-Time Breadth-First Tree Algorithms: An Exercise in the Arithmetic of Folds and Zips that we wrote together in 1993.

We can use this instead in the definition of breadth-first traversal:

Phase	Goal	Activity
Data Analysis and Design	to formulate a data definition	develop a data definition for mixed data with at least two alternatives; one alternative must not refer to the definition; explicitly identify all self-references in the data definition
Contract Purpose and Header	to name the function; to specify its classes of input data and its class of output data; to describe its purpose; to formulate a header	name the function, the classes of input data, the class of output data, and specify its purpose: ;; name : in1 in2 … –> out ;; to compute … from x1 … (define (name x1 x2 …) …)
Examples	to characterize the input-output relationship via examples	create examples of the input-output relationship; make sure there is at least one example per subclass
Template	to formulate an outline	develop a cond-expression with one clause per alternative; add selector expressions to each clause; annotate the body with natural recursions; Test: the self-references in this template and the data definition match!
Body	to define the function	formulate a Scheme expression for each simple cond-line; explain for all other cond-clauses what each natural recursion computes according to the purpose statement
Test	to discover mistakes (“typos” and logic)	apply the function to the inputs of the examples; check that the outputs are as predicted