This post is about my all-time favourite calculation, of a linear-time algorithm for the maximum segment sum problem, based on Horner’s Rule. The problem was popularized in Jon Bentley’s Programming Pearls series in CACM (and in the subsequent book), but I learnt about it from Richard Bird’s lecture notes on The Theory of Lists and Constructive Functional Programming and his paper Algebraic Identities for Program Calculation, which he was working on around the time I started my DPhil. It seems like I’m not the only one for whom the problem is a favourite, because it has since become a bit of a cliché among program calculators; but that won’t stop me writing about it again.
Maximum segment sum
The original problem is as follows. Given a list of numbers (say, a possibly empty list of integers), find the largest of the sums of the contiguous segments of that list. In Haskell, this specification could be written like so:
where computes the contiguous segments of a list:
and computes the sum of a list, and the maximum of a nonempty list:
This specification is executable, but takes cubic time; the problem is to do better.
We can get quite a long way just with standard properties of , , etc:
For the final step, if we can write in the form , then the of this can be fused with the to yield ; this observation is known as the Scan Lemma. Moreover, if takes constant time, then this gives a linear-time algorithm for .
The crucial observation is based on Horner’s Rule for evaluation of polynomials, which is the first important thing you learn in numerical computing—I was literally taught it in secondary school, in my sixth-year classes in mathematics. Here is its familiar form:
but the essence of the rule is about sums of products:
Expressed in Haskell, this is captured by the equation
(where computes the product of a list of integers).
But Horner’s Rule is not restricted to sums and products; the essential properties are that addition and multiplication are associative, that multiplication has a unit, and that multiplication distributes over addition. This the algebraic structure of a semiring (but without needing commutativity and a unit of addition, or that that unit is a zero of multiplication). In particular, the so-called tropical semiring on the integers, in which “addition” is binary and “multiplication” is integer addition, satisfies the requirements. So for the maximum segment sum problem, we get
Moreover, takes constant time, so this gives a linear-time algorithm for .
Tail segments, datatype-generically
About a decade after the initial “theory of lists” work on the maximum segment sum problem, Richard Bird (with Oege de Moor and Paul Hoogendijk) came up with a datatype-generic version of the problem in the paper Generic functional programming with types and relations. It’s clear what “maximum” and “sum” mean generically, but not so clear what “segment” means for nonlinear datatypes; the point of their paper is basically to resolve that issue.
Recalling the definition of in terms of and , we see that it would suffice to develop datatype-generic notions of “initial segment” and “tail segment”. One fruitful perspective is given in Bird & co’s paper: a “tail segment” of a cons list is just a subterm of that list, and an “initial segment” is the list but with some tail (that is, some subterm) replaced with the empty structure.
So, representing a generic “tail” of a data structure is easy: it’s a data structure of the same type, and a subterm of the term denoting the original structure. A datatype-generic definition of is a little trickier, though. For lists, you can see it as follows: every node of the original list is labelled with the subterm of the original list rooted at that node. I find this a helpful observation, because it explains why the of a list is one element longer than the list itself: a list with elements has nodes ( conses and a nil), and each of those nodes gets labelled with one of the subterms of the list. Indeed, ought morally to take a possibly empty list and return a non-empty list of possibly empty lists—there are two different datatypes involved. Similarly, if one wants the “tails” of a data structure of a type in which some nodes have no labels (such as leaf-labelled trees, or indeed such as the “nil” constructor of lists), one needs a variant of the datatype providing labels at those positions. Also, for a data structure in which some nodes have multiple labels, or in which there are different types of labels, one needs a variant for which every node has precisely one label.
Bird & co call this the labelled variant of the original datatype; if the original is a polymorphic datatype for some binary shape functor , then the labelled variant is where —whatever labels may or may not have specified are ignored, and precisely one label per node is provided. Given this insight, it is straightforward to define a datatype-generic variant of the function:
where returns the root label of a labelled data structure, and is the unique arrow to the unit type. (Informally, having computed the tree of subterms for each child of a node, we make the tree of subterms for this node by assembling all the child trees with the label for this node; the label should be the whole structure rooted at this node, which can be reconstructed from the roots of the child trees.) What’s more, there’s a datatype-generic scan lemma too:
(Again, the label for each node can be constructed from the root labels of each of the child trees.) In fact, and are paramorphisms, and can also be nicely written coinductively as well as inductively. I’ll return to this in a future post.
Initial segments, datatype-generically
What about a datatype-generic “initial segment”? As suggested above, that’s obtained from the original data structure by replacing some subterms with the empty structure. Here I think Bird & co sell themselves a little short, because they insist that the datatype supports empty structures, which is to say, that is of the form for some . This isn’t necessary: for an arbitrary , we can easily manufacture the appropriate datatype of “data structures in which some subterms may be replaced by empty”, by defining and .
As with , the datatype-generic version of is a bit trickier—and this time, the special case of lists is misleading. You might think that because a list has just as many initial segments as it does tail segments, so the labelled variant ought to suffice just as well here too. But this doesn’t work for non-linear data structures such as trees—in general, there are many more “initial” segments than “tail” segments (because one can make independent choices about replacing subterms with the empty structure in each child), and they don’t align themselves conveniently with the nodes of the original structure.
The approach I prefer here is just to use an unstructured collection type to hold the “initial segments”; that is, a monad. This could be the monad of finite lists, or of finite sets, or of finite bags—we will defer until later the discussion about precisely which, and write simply . We require only that it provide a -like interface, in the sense of an operator ; however, for reasons that will become clear, we will expect that it does not provide a operator yielding empty collections.
Now we can think of the datatype-generic version of as nondeterministically pruning a data structure by arbitrarily replacing some subterms with the empty structure; or equivalently, as generating the collection of all such prunings.
Here, supplies a new alternative for a nondeterministic computation:
and distributes the shape functor over the monad (which can be defined for all functors ). Informally, once you have computed all possible ways of pruning each of the children of a node, a pruning of the node itself is formed either as some node assembled from arbitrarily pruned children, or for the empty structure.
Horner’s Rule, datatype-generically
As we’ve seen, the essential property behind Horner’s Rule is one of distributivity. In the datatype-generic case, we will model this as follows. We are given an -algebra , and a -algebra ; you might think of these as “datatype-generic product” and “collection sum”, respectively. Then there are two different methods of computing a result from an structure: we can either distribute the structure over the collection(s) of s, compute the “product” of each structure, and then compute the “sum” of the resulting products; or we can “sum” each collection, then compute the “product” of the resulting structure. Distributivity of “product” over “sum” is the property that these two different methods agree, as illustrated in the following diagram.
For example, with adding all the integers in an -structure, and finding the maximum of a (non-empty) collection, the diagram commutes. (To match up with the rest of the story, we have presented distributivity in terms of a bifunctor , although the first parameter plays no role. We could just have well have used a unary functor, dropping the , and changing the distributor to .)
Note that is required to be an algebra for the monad . This means that it is not only an algebra for as a functor (namely, of type ), but also it should respect the extra structure of the monad: and . For the special case of monads for associative collections (such as lists, bags, and sets), and in homage to the old Squiggol papers, we will stick to reductions—s of the form for associative binary operator ; then we also have distribution over choice: . Note also that we prohibited empty collections in , so we do not need a unit for .
Recall that we modelled an “initial segment” of a structure of type as being of type , where . We need to generalize “product” to work on this extended structure, which is to say, we need to specify the value of the “product” of the empty structure too. Then we let , so that .
The datatype-generic version of Horner’s Rule is then about computing the “sum” of the “products” of each of the “initial segments” of a data structure:
We will use fold fusion to show that this can be computed as a single fold, given the necessary distributivity property.
(Sadly, I have to break this calculation in two to get it through WordPress’s somewhat fragile LaTeX processor… where were we? Ah, yes:)
(Curiously, it doesn’t seem to matter what value is chosen for .)
Maximum segment sum, datatype-generically
We’re nearly there. We start with the traversable shape bifunctor , a collection monad , and a distributive law . We are given an -algebra , an additional element , and a -algebra , such that and take constant time and distributes over in the sense above. Then
can be computed in linear time, where
computes the contents of an -structure (which, like , can be defined using the traversability of ). Here’s the calculation:
The scan can be computed in linear time, because its body takes constant time; moreover, the “sum” and can also be computed in linear time (and what’s more, they can be fused into a single pass).
For example, with adding all the integers in an -structure, , and returning the greater of two integers, we get a datatype-generic version of the linear-time maximum segment sum algorithm.
Monads versus relations
As the title of their paper suggests, Bird & co carried out their development using the relational approach set out in the Algebra of Programming book; for example, their version of is a relation between data structures and their prunings, rather than being a function that takes a structure and returns the collection of all its prunings. There’s a well-known isomorphism between relations and set-valued functions, so their relational approach roughly looks equivalent to the monadic one I’ve taken.
I’ve known their paper well for over a decade (I made extensive use of the “labelled variant” construction in my own papers on generic downwards accumulations), but I’ve only just noticed that although they discuss the maximum segment sum problem, they don’t discuss problems based on other semirings, such as the obvious one of integers with addition and multiplication—which is, after all, the origin of Horner’s Rule. Why not? It turns out that the relational approach doesn’t work in that case!
There’s a hidden condition in the calculation, which relates back to our earlier comment about which collection monad—finite sets, finite bags, lists, etc—to use. When is the set monad, distribution over choice ()—and consequently the condition that we used in proving Horner’s Rule—require to be idempotent, because itself is idempotent; but addition is not idempotent. For the same reason, the distributivity property does not hold for addition with the set monad. But everything does work out for the bag monad, for which is not idempotent. The bag monad models a flavour of nondeterminism in which multiplicity of results matters—as it does for the sum-of-products instance of the problem, when two copies of the same segment should be treated differently from just one copy. Similarly, if the order of results matters—if, for example, we were looking for the “first” solution—then we would have to use the list monad rather than bags or sets. Seen from a monadic perspective, the relational approach is programming with just one monad, namely the set monad; if that monad doesn’t capture your effects faithfully, you’re stuck.
(On the other hand, there are aspects of the problem that work much better relationally. We have carefully used only for a linear order, namely the usual ordering of the integers. A partial order is more awkward monadically, because there need not be a unique maximal value. For example, it is not so easy to compute a segment with maximal sum, unless we refine the sum ordering on segments to make it once more a linear order; relationally, this works out perfectly straightforwardly. We can try the same trick of turning the relation “maximal under a partial order” into the collection-valued function “all maxima under a partial order”, but I fear that the equivalent trick on the ordering itself—turning the relation “” into the collection-valued function “all values less than this one”—runs into problems from taking us outside the world of finite nondeterminism.)