THE DESIGN OF INTERMEDIATE LANGUAGES IN OPTIMIZING
COMPILERS
by
LUKE VAN WYCK MAURER
A DISSERTATION
Presented to the Department of Computer & Information Science
and the Graduate School of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
March 2018
DISSERTATION APPROVAL PAGE
Student: Luke Van Wyck Maurer
Title: The Design of Intermediate Languages in Optimizing Compilers
This dissertation has been accepted and approved in partial fulfillment of the
requirements for the Doctor of Philosophy degree in the Department of Computer
& Information Science by:
Zena M. Ariola Chair
Michal Young Core Member
Boyana Norris Core Member
Mark Lonergan Institutional Representative
and
Sara D. Hodges Interim Vice Provost and Dean of the Graduate
School
Original approval signatures are on file with the University of Oregon Graduate
School.
Degree awarded March 2018
ii
c© 2018 Luke Van Wyck Maurer
iii
DISSERTATION ABSTRACT
Luke Van Wyck Maurer
Doctor of Philosophy
Department of Computer and Information Science
March 2018
Title: The Design of Intermediate Languages in Optimizing Compilers
Every compiler passes code through several stages, each a sort of mini-
compiler of its own. Thus each stage may deal with the code in a different
representation, which may have little to do with the source or target language.
We can describe these in-memory representations as languages in their own
right, which we call intermediate languages.
Each intermediate language is designed to accomodate the stage of
compilation that handles it. Those toward the end of the compilation pipeline,
for instance, tend to have features expressing low-level details of computation.
A subtler case is that of the optimization stage, whose role is to transform the
program so that it runs faster, uses less memory, and so forth. The optimizer
faces tradeoffs: The language should provide enough information to guide
optimization algorithms, but all of this information must be kept up to date as
the program is transformed. Also, establishing invariants in the language can be
helpful both in implementing algorithms and in debugging the implementation,
but each invariant may complicate desirable transformations or rule them out
altogether. Finally, a language where the invariants are obviously correct may
have a form too awkward or otherwise unsuited to the compiler’s needs.
iv
Given the properties and invariants that we would like the language to
provide, we can approach the design task in a way that gives these features
without necessarily sacrificing implementability. Namely, begin with a formal
language that makes the desired properties obvious, then translate it to one
more suitable for implementation. We can even translate theorems about valid
transformations in the formal language to derive correct algorithms in the
implementation language.
This dissertation explores the connections between different intermediate
languages and how they can be interderived, then demonstrates how translation
lead to an improvement to the Glasgow Haskell Compiler opimization engine.
This dissertation includes previously published coauthored material.
v
CURRICULUM VITAE
NAME OF AUTHOR: Luke Van Wyck Maurer
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, OR
Carleton College, Northfield, MN
DEGREES AWARDED:
Master of Science, Computer & Information Science, 2012, University of
Oregon
Bachelor of Arts, Mathematics and Computer Science, 2006, Carleton
College
AREAS OF SPECIAL INTEREST:
Programming Languages
Compilation
Formal Methods
Proof Assistants
PROFESSIONAL EXPERIENCE:
Junior Software Developer, SingleMind Consulting, September 2006–
October 2008
Software Developer, Emberex Custom Software, December 2008–June 2011
Intern, Microsoft Research, summer 2015
GRANTS, AWARDS AND HONORS:
Distinguished Paper, 2017 ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI 2017)
vi
PUBLICATIONS:
Downen, P., Maurer, L., Ariola, Z. M., & Varacca, D. (2014)
Continuations, processes, and sharing. PPDP 2014.
Downen, P., Maurer, L., Ariola, Z. M., & Peyton Jones, S. (2016) Sequent
calculus as a compiler intermediate language. ICFP 2016.
Maurer, L., Downen, P., Ariola, Z. M., & Peyton Jones, S. (2017)
Compiling without continuations. PLDI 2017.
vii
ACKNOWLEDGEMENTS
I thank Professors Ariola and Young for their assistance and
encouragement (as well as their patience). Thanks also to INRIA and to
Microsoft Research for the internships that supported much of this work, and
especially Daniele Varacca of INRIA and Simon Peyton Jones of Microsoft
Research for their guidance. Special thanks to Paul Downen, Philip Johnson-
Freyd, and Zach Sullivan for endless discussions on the finer points of
programming-language theory and also nuclear politics. This research was
supported by a grant from the National Science Foundation, CCF-1423617, to
Prof. Ariola.
viii
For my parents, who taught me the value of hard work by raising me; and for
Mrs. Klinge and all my other teachers at Logos Lab School, who sparked the
researcher in me and gave me a taste of grad school at age eleven. (I regret that
this project did not include a diorama.)
ix
TABLE OF CONTENTS
Chapter Page
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Co-Authored Material . . . . . . . . . . . . . . . . . . . . . . 4
II. Intermediate Languages . . . . . . . . . . . . . . . . . . . . . . 5
2.1. Languages for Imperative Optimizations . . . . . . . . . . . . 6
2.2. Languages for Functional Optimizations . . . . . . . . . . . . 12
2.3. The Glasgow Haskell Compiler (GHC) . . . . . . . . . . . . . 29
2.4. Sequent Calculus for GHC . . . . . . . . . . . . . . . . . . . 43
2.5. Join Points in Direct Style . . . . . . . . . . . . . . . . . . . 58
III. Improving Compilation of Haskell . . . . . . . . . . . . . . . . 62
3.1. Motivation and Key Ideas . . . . . . . . . . . . . . . . . . . . 64
3.2. System FJ : Join Points and Jumps . . . . . . . . . . . . . . . 70
3.3. Contification: Inferring Join Points . . . . . . . . . . . . . . . . 81
3.4. Recursive Join Points and Fusion . . . . . . . . . . . . . . . . 84
3.5. Metatheory of FJ . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6. Join Points in Practice . . . . . . . . . . . . . . . . . . . . . . 94
x
Chapter Page
IV. Lazy Functions as Processes . . . . . . . . . . . . . . . . . . . 99
4.1. Call-by-Name and Call-by-Value . . . . . . . . . . . . . . . . 102
4.2. A Uniform CPS Transform . . . . . . . . . . . . . . . . . . . 105
4.3. Preservation of Observations . . . . . . . . . . . . . . . . . . 113
4.4. CPS and Processes . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.5. Call-by-Need and Constructive Update . . . . . . . . . . . . . 137
V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
References Cited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
xi
LIST OF FIGURES
Figure Page
1. The control-flow graph for a simple array-sum program. . . . . . . . . 8
2. An example of common-subexpression elimination. . . . . . . . . . . . 9
3. A routine to sum the elements of an array, in SSA form. . . . . . . . . . 11
4. The syntax of the untyped λ-calculus. . . . . . . . . . . . . . . . . . . 13
5. A λ-calculus with data constructors, recursion, and case. . . . . . . . 18
6. A call-by-value CPS transform. . . . . . . . . . . . . . . . . . . . . . . 19
7. A CPS language for the λ-calculus in Fig. 5, as produced by the transform
in Fig. 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8. The ANF language produced by the CPS language in Fig. 35 by an inverse
CPS transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9. Two representations of a function to compute a partial sum. . . . . . 26
10. Translating from SSA to CPS or ANF using the dominance tree. . . . 28
11. Modifying the CFG by moving φ-node values into predecessors. . . . . 30
12. The program from Fig. 10a, in ANF. . . . . . . . . . . . . . . . . . . . 31
13. The GHC Core language as of 2006, before coercions were added. . . . 34
14. Types in the GHC Core language. . . . . . . . . . . . . . . . . . . . 36
15. The syntax for Sequent Core. . . . . . . . . . . . . . . . . . . . . . . . 51
16. The readback translation D from Sequent Core back to Core, as defined
on commands, bindings, terms, and continuations. . . . . . . . . . . . 57
17. Syntax of System FJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
18. Type system for System FJ . . . . . . . . . . . . . . . . . . . . . . . . 73
19. Call-by-name operational semantics for System FJ . . . . . . . . . . . . 76
20. Common optimizations for System FJ . . . . . . . . . . . . . . . . . . . 78
21. Contification as a source-to-source transformation. . . . . . . . . . . . . 81
xii
Figure Page
22. The call-by-name λ-calculus, λn. . . . . . . . . . . . . . . . . . . . . . 103
23. The call-by-value λ-calculus, λv. . . . . . . . . . . . . . . . . . . . . . 103
24. A uniform CPS transform for call-by-name and call-by-value. . . . . . 106
25. A revised call-by-value λ-calculus. . . . . . . . . . . . . . . . . . . . . 109
26. The syntax and semantics of the CPS λ-calculus, λcps . . . . . . . . . . 109
27. The value-named CPS λ-calculus, λcps,vn . . . . . . . . . . . . . . . . . 112
28. Uniform CPS transform in named form. . . . . . . . . . . . . . . . . . 113
29. Diagrams of Proposition 17 and Lemma 18. . . . . . . . . . . . . . . . 119
30. A summary of the proof of Lemma 18. . . . . . . . . . . . . . . . . . . 126
31. A fragment of the pi-calculus. . . . . . . . . . . . . . . . . . . . . . . . 132
32. CPS, named CPS, and pi-encoding for CBN, CBV. . . . . . . . . . . . 135
33. Uniform abstract machine for CBN and CBV. . . . . . . . . . . . . . 138
34. The call-by-need λ-calculus, λneed . . . . . . . . . . . . . . . . . . . . . 139
35. A call-by-need CPS transform using constructive update. . . . . . . . 143
36. The CPS λ-calculus with constructive update, λ:=1cps . . . . . . . . . . . 144
37. CPS, named CPS, and pi-encoding for call-by-need. . . . . . . . . . . 146
38. Reductions for the distance call-by-need λ-calculus, λdneed . . . . . . . 147
39. The annotated call-by-need λ-calculus, λaneed . . . . . . . . . . . . . . . 149
40. Call-by-need CPS transform, Ca. . . . . . . . . . . . . . . . . . . . . . . 151
41. The abstract machine derived from C. . . . . . . . . . . . . . . . . . . 168
xiii
LIST OF TABLES
Table Page
1. Benchmarks from the spectral, real, and shootout NoFib suites. . 97
xiv
CHAPTER I
Introduction
The compiler is a complex beast. As a program transforms from a source
code comprehensible to humans into a machine language suitable for hardware
execution, it must pass through several stages, including parsing, optimization,
and code generation. At each stage, the code may be translated from one form
to another. Some of these forms, such as abstract syntax trees (ASTs), are
closely related to the source or target languages, but many fall in between: they
are intermediate representations, or IRs.
IRs serve several purposes. They are essential for mitigating the
complexity of any minimally sophisticated compiler. Also, they make the
compiler more flexible, and thus more useful, by abstracting out finer details
of the source or target language, thus allowing the same compiler to target
different hardware platforms or to implement several source languages.
Here we focus on another use for IRs: Since they represent the program
more abstractly than either the source language or the target, they are ideal
for the sort of manipulation performed by an optimizer. Optimizers must work
carefully to avoid changing behavior, so a simpler IR is better: there are fewer
opportunities to mess something up. However, if we can’t be confident that
it’s improving performance, it’s hardly an optimizer, so a richer IR is better:
optimization decisions can be better informed. The obvious tension here calls
for careful attention when designing an IR.
A good way to have a rich IR while controlling for its complexity is to
formalize it as a mathematical calculus. This is less daunting than it may
sound: there are decades of results in programming-language theory to build
1
on. A formalized language is as well-specified as one could ask, eliminating any
question of undocumented corner cases or inconsistent requirements. Of course,
designing a formalized IR still requires ironing out all such wrinkles; we’re not
saving any of that work. But a formal semantics certifies that the work is done.
A formal specification does far more than merely define a language,
however. By establishing the rules for how the language works, it allows
theorems to be proved about what modifications can be made to a program
without changing its validity or its behavior. In turn, these theorems specify
optimization and translation passes that can be implemented with confidence.
Thus the formal specification equips compiler writers to implement transforms
that are correct but not obviously so. Furthermore, it is natural when writing
optimizer code to run into subtle corner cases; writing out the case in the
formal language often makes the correct implementation clear, easing both the
task of writing the implementation and of documenting it in comments. In this
way, a formal IR can help keep compiler code maintainable while implementing
highly sophisticated algorithms.
One further benefit to a formal language is that it can include types
and other invariants in the specification, and these invariants can be checked
programmatically. This can provide an invaluable debugging tool, as
running the optimizer with the internal typechecker enabled will catch most
transformation errors as soon as possible—it is rare that a buggy pass produces
incorrect yet well-typed code.
Just as different languages serve different purposes in the compiler,
different choices of language bring out various aspects of computation. For
example, the pure λ-calculus has no clear representation of control flow: the
sequence of actions to take is entirely implicit in the syntax. This is a natural
perspective for considering the mathematical content of the program, but it is
2
ill-suited for reasoning about sequential execution on hardware. A language in
continuation-passing style, in contrast, is more complex but offers constructs
that directly correspond to labeled sections of code in the running program.
Translating programs is not difficult—certainly it tends to be simpler than
a typical analysis for an optimization pass. Much more difficult is proving that
a translation is correct, i.e. that it preserves the meaning of a program. Once we
have done so, however, we can now “translate” theorems and other facts from
one language to another. In this way, we can reason about code using the best
language for proving the theorems we want, then implement an optimizer that
uses the most convenient language for the actual program.
1.1 Outline
Chapter II introduces several intermediate languages and discuss their
origins, their purposes, and the relationships between them.
In Chapter III, we consider GHC’s Core language and an extension that
adds jumps and labeled blocks for explicit representation of control flow. This
gives code-motion techniques more precision and flexibility, providing more
opportunities to discover interactions at compile time: moving a case analysis
to where a value is produced, for instance, may resolve the decision at compile
time, allowing other branches to be pruned and potentially even avoiding the
allocation of the produced value.
Chapter IV dives deeper into the mechanism of laziness underpinning the
GHC run-time system. By drawing a connection to the pi-calculus, a language
for describing intercommunicating processes running in parallel, we derive an IL
that describes the in-place update used by GHC-compiled code to cache results
of suspended computations. We hope that a well-tailored language that includes
this effect may allow updates to participate in code motion as well.
3
1.2 Co-Authored Material
This dissertation owes much of its content to co-authored papers already
published.
– Chapter III was a collaboration with Paul Downen (University of Oregon),
Zena M. Ariola (UO), and Simon Peyton Jones (Microsoft Research).
– Chapter IV was a collaboration with Paul Downen, Zena M. Ariola, and
Daniele Varacca (Universite´ Paris Diderot).
4
CHAPTER II
Intermediate Languages
Optimizers are necessarily heuristic and conservative—they need to gather
data to make informed decisions, both in the hope of improving the code and
to be certain not to introduce errors. Thus, an informative IR is a must. Each
property the IR tracks has a cost, however, both in the resources for calculating
and storing the property and in the need to maintain it whenever the code
is altered. The design space is large, therefore, and the criteria for a good IR
depend on the form of the source language (or previous IR), the requirements of
the target architecture (or next IR), and the set of algorithms to be performed
by the particular compilation stage. Generally speaking, a good IR is
1. straightforward to build from the incoming code, with a clear mapping for
each language construct;
2. sufficiently informative to support optimization, so that common analyses
need only be implemented once;
3. sufficiently flexible to allow changes to be made while preserving
correctness and consistency; and
4. straightforward to translate into the next representation.
To study the breadth of possibilities, authors and researchers look at IRs
not as mere data structures but as programming languages unto themselves,
with well-defined semantics independent of a particular implementation. Such
an intermediate language (IL) crystallizes the design of an IR, giving it a precise
characterization removed from the intricate details of the implementation, and
allowing its properties to be stated, proved, and compared to those of other
5
languages. This paper will look at a few of the most important kinds of ILs for
optimization, studying their features and differences, and finally proposing one
for use in the Glasgow Haskell Compiler (GHC).
The plan is as follows: Compilers can be broadly categorized by the
families of languages they implement. As such, we will look at ILs for
imperative languages (Section 2.1) and functional languages (Section 2.2)
separately. In Section 2.3, we will look at the specific challenges facing GHC,
stemming from the particular features of the lazy functional language Haskell.
Then, in Section 2.4, we introduce our own IL, which was implemented as an
experimental patch to GHC.
2.1 Languages for Imperative Optimizations
Three-Address Code
A three-address code is a language comprised of linear sequences of
instructions of fixed size, resembling assembly code. A typical instruction might
be
A← B + C,
indicating a destructive assignment of B + C to the variable A. Depending
on the particular language, A, B, and C might be abstract variables, or they
might be well-defined registers or memory locations on a target architecture.
In the latter case, we have abstracted away only the precise syntax of assembly
language, but no other details of the target hardware, making this an ideal
form for peephole optimization (Davidson & Fraser, 1980). This is the lowest-
level form of optimization, operating at the level of individual instructions,
looking only at a small window (a “peephole”) at a time to find opportunities to
combine instructions and save CPU cycles.
6
Typically, a three-address code expresses control through labels and jumps.
Each instruction may carry a label identifying it as a possible target for a jump.
A jump can be either conditional, occurring only if some condition is met, or
unconditional. Thus a routine in three-address code to sum the members of an
array a of length n might be:
i ← 0
s ← 0
loop:
c ← i − n
ifge c then done
t ← a @ i
s ← s + t
i ← i + 1
jump loop
done:
return s
Here ifge c done is a conditional jump that executes when c ≤ 0, and
jump loop is an unconditional jump. Note that any complex expressions have
been broken down—the programmer probably wrote s += a[i], but this
three-address code requires the array access and the addition to be separate
instructions.
Control-Flow Graphs
Three-address codes are effective for expressing programs, but a simple list
of instructions is unwieldy for performing analyses and manipulation. Therefore
it is typical to construct a control-flow graph, or CFG, to represent the control
flow as edges connecting fragments of the program.
Note that CFGs do not themselves comprise a language; the CFG is an
in-memory representation that holds code in an underlying language, such as
a three-address code. Nonetheless, it is important to the study of intermediate
7
i← 0
s← 0
jump loop
loop:
c← i− n
ifge c then done else next
next :
t← a@ i
s← s+ t
i← i+ 1
jump loop
done:
return s
FIGURE 1. The control-flow graph for a simple array-sum program.
languages because the static single-assignment form, which we consider next,
exploits the CFG that is assumed to be available.
A vertex in a CFG embodies a basic block, a sequence of instructions that
proceeds in a straight line. No instruction in the program jumps into the middle
of a basic block, and the only jump in the block can come at the end. Thus all
internal jumps go from the end of one basic block to the beginning of another.
The CFG then records each block as a vertex and each jump as an edge. For a
given block, the blocks that may jump to it are its predecessors and the blocks
it may jump to are its successors. See Fig. 1 for an example.
The CFG makes reasoning about and manipulating the control flow
much easier. For instance, the basic form of dead-code elimination (DCE) finds
instructions that will never run and deletes them. Without a CFG, one would
have to scan the code for jumps to find labels that are never targeted; with a
CFG, one simply checks which blocks have no predecessors.
8
a ← x + y
b ← a + 3
c ← b + z
d ← a + 3
a ← d + c
e ← a + 3
return e
⇒
a ← x + y
b ← a + 3
c ← b + z
a ← b + c
e ← a + 3
return e
FIGURE 2. An example of common-subexpression elimination.
Static Single-Assignment Form
Consider the code in Fig. 2. Clearly the computation of d is redundant;
we should remove it and replace references to d with b. This is an important
optimization called common-subexpression elimination, or CSE. But note
something crucial to this analysis: the value of a did not change between
the assignments to b and d. We cannot remove e in the same way, because
“a+ 3” is not the same value that it was when b or d computed it. Thus, starting
from simple three-address code, any CSE routine must perform some analysis
involving so-called available expressions and reaching definitions, walking
through the code and working out all the ramifications of each assignment
for other instructions (Appel, 1998a; Aho, Sethi, & Ullman, 1986).
This need is the basis of dataflow analysis. At its core is the question,
“What are the possible values of this computation?”
The chief cause of complexity in dataflow analysis is mutability. If there
is an intervening assignment to a variable, then two different occurrences of
the variable generally don’t refer to the same value, and thus we can’t take an
expression “at face value.” In Fig. 2, the expression a + 3 has no consistent,
well-defined value, since the value of a changes.
Traditionally, it was up to each optimization to take mutability into
account so that the optimizer only makes valid changes. This added complexity
9
to many individual algorithms. The static single-assignment form, or SSA form,
makes dataflow obvious by eliminating mutability of variables, thus simplifying
many algorithms and making new ones more feasible.
The code in Fig. 2 is not in SSA form, since there are two assignments to
a. However, in this case we can observe that there are really two “versions” of
a involved, and each occurrence of a refers unambiguously to one or the other.
The assignments to b and d refer to the first version, and the assignment to e
refers to the second. Therefore we can rename the second version to a′: We have
a ← x + y
b ← a + 3
c ← b + z
a ′ ← b + c
e ← a ′ + 3
return e
obtained the SSA form, guaranteeing that a+ 3 has a consistent value so long as
a is in scope.
Renaming suffices for only the simplest cases. Here, for any instruction,
we know which “version” of a is active and thus whether to rename each
occurrence of a to a′. In the presence of control flow, however, one cannot
always know what the “current version” is. Thus we need a way to merge
together different possible values for a given variable.
The φ-Node
The construct at the heart of SSA is the φ-node. A φ-node is an
instruction of the form
A← φ(B1, . . . , Bn)
10
i0 ← 0
s0 ← 0
loop:
i ← φ(i0, i ′)
s ← φ(s0, s ′)
c ← i − n
ifge c then done else next
next :
t ← a @ i
s ′ ← s + t
i ′ ← i + 1
jump loop
done:
return s
FIGURE 3. A routine to sum the elements of an array, in SSA form.
appearing at the beginning of a basic block. There should be one argument
for each predecessor to the block. Operationally, it is understood that A will
get the value Bi if the block is entered from its ith predecessor. Thus the
conditional update of a variable is modeled by the creation of a new variable
whose value depends on the control flow. See Fig. 3 for an example, where the
index i will be zero when loop is entered from the beginning of the program
but i′ when it is entered from the jump. Since i′ is i + 1, this causes i to be
incremented each time through the loop, as expected. The accumulator variable
s works similarly.
In some ways, SSA form does for dataflow what the CFG does for control
flow by making crucial properties obvious. For instance, another form of dead-
code elimination concerns dead stores, which happen when a value is written
that will never be read. This can happen when two writes are made to the
same variable in succession; the first is a dead store and is wasted. Without
SSA, finding dead stores requires performing a liveness analysis by scanning
backward; with SSA, there cannot be two writes to the same variable, so a dead
11
store is simply a variable that is never read. Typical implementations maintain
a def-use chain (Cytron, Ferrante, Rosen, Wegman, & Zadeck, 1991) for each
variable, listing the instructions where it is used; finding a dead store is then a
simple matter of checking for an empty def-use chain.
SSA is powerful enough to make new optimizations practical as well. For
instance, one of the original applications (Rosen, Wegman, & Zadeck, 1988) was
a generalization of CSE called global value numbering : once we can trust the
face value of an expression a + b because the values of a and b can’t change, it
becomes practical to, say, identify a+ b with b+ a.
Since its inception, SSA has become the dominant form of IL both in
the literature and in compilers for imperative languages—GCC, in versions 4.0
onward, uses SSA as its high-level optimization IL (Novillo, 2003; Pop, 2006),
and the LLVM framework’s bytecode language is in SSA form (Lattner & Adve,
2004; “LLVM language reference manual,” 2015).
2.2 Languages for Functional Optimizations
Compilers for functional languages sometimes use or extend
representations from the imperative world. However, especially for high-level
optimization, it is more common to employ a simpler functional language, in
much the same way that the typical three-address code follows the imperative
model.
The λ-Calculus
Three-address codes represent imperative programs by a minimum of
constructs. Similarly, Church’s λ-calculus (Barendregt, 1984) boils functional
programs down to their essentials: functions, variables, and applications. A
function is represented as λx.M , where x is a variable and M is the function
12
Variable: x, y, z, . . .
Term: M,N ::= x
∣∣M N ∣∣ λx.M
FIGURE 4. The syntax of the untyped λ-calculus.
body in which x is bound; application of M to N is written simply as the
juxtaposition M N . See Fig. 4.
An advantage of the λ-calculus is that its semantics can be given purely as
a system of simplification rules, or rewrite rules, on the terms themselves. The
crucial one is called the β-rule, which in its most general conventional form is
(λx.M)N ⇒M{N/x}
This says that to apply a known function λx.M to an argument N , you take
the body M and substitute N for the occurrences of x. (We haven’t yet said
what terms are allowed as N or how to apply the rule on a subterm of a larger
term; these are specified by the evaluation order.) Applying the β-rule is called
β-reduction.
The other rule is the α-rule, which simply says that we can rename a
variable bound by a λ without changing the term’s meaning, so long as we do
so consistently. Applying the α-rule is called α-conversion. Because they are
subject to α-conversion, variables have local scope, in much the way they do in
most programming languages. For instance, since λx. x and λy. y are equivalent
by the α-rule, no program’s behavior can depend on the choice of x or y.
This is the whole of the syntax of the plain untyped λ-calculus, but the
language is already rich enough to have spawned a whole field of research.
Indeed, the untyped λ-calculus is universal, that is, Turing-complete—it is
expressive enough to encode any program we could want to. For use in a
13
compiler, however, it would be impractical; besides the sheer awkwardness
of, say, representing the number five as λs. λz. s(s(s(s(s z)))), as is conventional
(Hinze, 2005), the encoding would lose the program’s structure, offering little
help to an optimizer that wishes to perform code motion safely and effectively.
Nonetheless, there is little we need to add. First, we need literals and
primitives to represent arithmetic sensibly; these are no problem. We’ll
deal with literals and other constants shortly, and primitives can simply be
preallocated variable names.
Second, it helps to have a way to declare local variables and functions,
so we want a let/in form, including a recursive version let rec. Again, these
could be encoded easily in terms of application, but at little gain at the price of
lost information.1 The precise semantics of let differs more widely than that of
function application; the simplest form is a variant of the β-rule,
letx = N inM ⇒M{N/x},
but again, the form of N may be restricted. Also, rather than substitute for
all instances of x at once, one can wait for the value of x to be needed (Ariola,
Felleisen, Maraist, Odersky, & Wadler, 1995). For let rec, in particular, one has
to be careful; see, for instance, (Pierce, 2002, chapters 20–21) for a practical
treatment.
More significantly, we want structured data and control, namely data
constructors and pattern matching. A data constructor is a constant with a
particular arity. We write Cn for an unknown data constructor with arity n,
1Arguably, we could go without let rec by using fixpoint combinators such as the Y-
combinator, but explicit knowledge of which functions are recursive is often beneficial. For
example, GHC is aggressive about inlining non-recursive functions because doing so cannot
cause the optimizer to loop.
14
but we will often drop the superscript. Applying Cn to n arguments2 packages
the values as a tagged record. Pattern matching is then performed by a case
statement such as this one:
case M of
C n x1 . . . xn → N
Dm y1 . . . ym → P
This expression evaluates M to a constructor and its arguments, then checks
whether the constructor is C or D, binds the corresponding variables to the
arguments, and evaluates the chosen branch.
For example, the functional version of our array-sum example could be:
sum = λxs . case xs of
Nil → 0
Cons x xs ′ → x + sum xs ′
Here Nil is the empty list and Cons x xs prepends x onto the list xs.
Note that Nil has arity zero, as do True and False. Zero-arity, or nullary,
constructors thus act as constants. In fact, we can simply treat literals as
special syntax for particular nullary constructors, so we don’t need them as
a separate construct.
The semantics of case is specified by a rule called the case rule:
case C n M1 · · · Mn of
...
C n x1 · · · xn → N
...
⇒ N {M1/x1} · · · {Mn/xn}
Hence, we reduce a case by finding the matching branch and substituting the
arguments of the constructor.
2Like Haskell, we use curried constructors. For example, Cons x xs is Cons applied to the
arguments x and xs. This avoids needing tuples as a separate construct.
15
We also allow a lone variable to act as a wildcard pattern:
case C n M1 · · · Mn of
...
x → N
...
⇒ N {C M1 · · · Mn/x} (if no other match)
Languages such as ML and Haskell also allow for more complex patterns,
but those can be re-expressed using these simple ones (Peyton Jones, 1987).
Also, we don’t need a separate if construct, since we can define:
if M then N else P ,
case M of
True → N
False → P
In all, we have the syntax in Fig. 5. Note that for compactness we
may use braces and semicolons instead of separate lines in case and let rec
expressions.
To make the language useful for writing or compiling programs, we need
to say a bit more about the semantics than just the rewrite rules as we have
seen them. The rules say what to do, but not in what order. Evaluation orders
comprise a field of study all of their own (Plotkin, 1975; Ariola et al., 1995), but
for our purposes, informal descriptions will suffice.
– In all practical languages, evaluation “stops at a lambda.” Bodies of
functions are not evaluated until they are called.
– In call-by-value languages, the arguments to a function are evaluated
before the function is called. This amounts to restricting the β-rule:
(λx.M)V ⇒M{V/x}
Here V stands for a value, which must be a λ-abstraction, a constant, or
a constructor applied to values. The restriction also affects let bindings—
16
the definition is evaluated before the body. Call-by-value let rec is usually
allowed to define only λ-abstractions.3
– In lazy languages, function application occurs as soon as the function
body is known, while the arguments are still unevaluated. Lazy let
bindings similarly go unevaluated at first, and let rec needn’t be
restricted, allowing circular or (conceptually) infinite data structures.
To make laziness practical, implementations use a call-by-need strategy
(Ariola et al., 1995), which ensures that each value is still only evaluated
once.
A related concept is purity. A language is called pure if evaluating
an expression never has any side effects, such as overwriting a variable or
performing I/O. An impure language is thus very sensitive to evaluation order—
move the wrong function call, and Hello World! could become World! Hello.
If the language is call-by-value, impurity is manageable, as one can predict from
any function body the order in which terms will be evaluated. However, in a
call-by-need language, if one writes f M N P , the order in which M , N , and P
are evaluated (if at all) depends entirely on the body of f , which may not even
be in the same module. Thus laziness practically necessitates purity.4
In an IL for optimization, the major impact of evaluation order and purity
is the freedom with which terms can be rearranged. Of course, the purpose of
the optimizer is often to change the order in which things are evaluated, but
doing so in an impure language is hazardous. By contrast, the order in which
3In a call-by-value language, a variable always stands for a value. So how do we bind the
value of a let rec inside its own definition while it’s being computed? There are workarounds,
such as using a mutable storage cell, but the need is not great enough to justify additional
complication.
4It gets worse: Besides call-by-need, another class of non-strict languages is parallel
languages. These evaluate the arguments and the body simultaneously. Thus, in the parallel
setting, there is no defined order in which side effects in M , N , and P will run.
17
Variable: x, . . .
Constructor: Cn, . . .
Pattern: P ::=
∣∣ Cn x1 · · · xn
Term: M,N ::= x
∣∣ Cn ∣∣M N ∣∣ λx.M∣∣ caseM of P1 → N1; . . . ; Pn → Nn∣∣ letx = M inN∣∣ let recx1 = M1; . . . ; xn = Mn inN
FIGURE 5. A λ-calculus with data constructors, recursion, and case.
expressions appear in a lazy language often does not matter, so the compiler has
a great deal of freedom.
Since it is so rigid, an impure call-by-value λ-calculus is cumbersome as
an optimizing IL, even for call-by-value source languages (though it can serve
as an “abstract source” language (Appel, 1992, chapter 4)). However, if the
source language is lazy, it may happily be optimized using plain λ-terms—in
fact, our λ-calculus is essentially an untyped version of GHC’s Core language
(Peyton Jones & Launchbury, 1991; Santos, 1995; Peyton Jones & Santos, 1998).
Continuation-Passing Style
One advantage of a three-address code as an IL for imperative programs
is that everything is spelled out: Intermediate results are given names. Every
aspect of control flow, including order of operations, is explicit. Manipulating
code is made easier by the regularity, and the similarity to assembly language
makes code generation a simple syntactic translation.
Continuation-passing style (Sussman & Steele, Jr., 1975), or CPS, is a
way for the λ-calculus to play much the same role for functional programs. As
proved formally by Plotkin (Plotkin, 1975), a term written in CPS effectively
specifies its own evaluation order, and any λ-term can be translated into CPS
18
CJxK = λk. k x
CJM NK = λk. CJMK(λf. CJNK(λx. f x k))
CJλx.MK = λk. k (λx. λk1. CJMKk1)
CJC0K = λk. k C0
CJC1K = λk. k (λx1. λk1. k1 (C1 x1))
CJCnK = λk. k (λx1. λk1. k1 (· · · (λxn. λkn. kn (Cn x1 · · · xn))))
C
uwwwv
case M of
P1 → N1
...
Pn → Nn
}~ = λk. CJMK

λx . case x of
P1 → CJN1Kk
...
Pn → CJNnKk

CJletx = M inNK = λk. CJMK(λx. CJNKk)
C
uwwwwwwwv
let rec
f1 = λx1.M1
...
fn = λxn .Mn
in
N
}~
=
λk . let rec
f1 = λx1. λk1. CJM1Kk1
...
fn = λxn . λkn . CJMnKkn
in
CJN Kk
FIGURE 6. A call-by-value CPS transform.
using a CPS transform. As it happens, CPS gives a name to each intermediate
value, just as a three-address code does. The correspondence to assembly
language is not as clear, often necessitating a lower-level IL acting as an abstract
machine (Appel, 1992, chapter 13). Nonetheless, λ-terms in CPS have proven a
useful IL for optimizations on functional programs.
The CPS method is simple enough that we can demonstrate it on a simple
language by constructing a “compiler” on paper. We take the source language
to be the one in Fig. 5, given call-by-value semantics. The transform is given in
Fig. 6. Note that we assume throughout that k, k1, etc., are fresh (they don’t
clash with existing names); one could always use names not allowed in the
source language to avoid trouble.
19
The CPS transform makes each term into a function taking a continuation,
which specifies what to do with the value of the term once it’s computed. Since
the continuation is itself a λ-abstraction, the result of each computation ends up
bound to a variable. This variable then represents either a register or a location
in memory where the value is stored.
We will consider the CPS rules in turn. Since the source language is
call-by-value, any variable x will be bound to a value that has already been
computed; hence, in the variable case, there is nothing more to do and we can
pass x directly to the continuation k.
The key to the CPS transform is the rule for function calls:
CJM NK = λk. CJMK(λf. CJNK(λx. f x k))
Once the CPS term is applied to a continuation k, the first thing to do is to
evaluate the function M . The continuation for M binds its result as f , then
goes on to evaluate the argument N . That result is bound as x. Finally, we
perform the function call proper by invoking f with argument x and the original
continuation k.
Translating a λ-abstraction is straightforward, though the translated
version takes an extra argument for the continuation with which to evaluate the
body.
Constructors appear somewhat more involved, but they are really merely
special cases of functions. It is instructive to derive the CPS form for a
20
constructor applied to arguments5:
CJC2M NK = λk. CJC2MK(λf. CJNK(λy. f y k))
= λk. (λk. CJC2K(λf. CJMK(λx. f x k)))(λf. CJNK(λy. f y k))
⇒ λk. CJC2K(λf. CJMK(λx. f x (λf. CJNK(λy. f y k))))
= λk. (λk0. k0 (λx1. λk1. k1(λx2. λk2. k2(C
2 x1 x2))))
(λf. CJMK(λx. f x (λf. CJNK(λy. f y k))))
⇒∗ λk. CJMK(λx. CJNK(λy. k(C2 x y)))
Once the dust settles, the procedure is to evaluate M , bind it as x, evaluate N ,
bind it as y, then apply the data constructor and return.
The other rules are routine: A case evaluates the scrutinee first, then the
continuation chooses the branch. A let evaluates the bound term first, then the
body. A let rec simply binds the function literals and moves on (as suggested
earlier, we assume that a call-by-value let rec only ever binds λ-abstractions).
The IL we get from the CPS transform is the CPS language given in
Fig. 7. All function calls must involve both a function and an argument (or two)
that are values—compile-time constants, variables standing for intermediate
results, and applications of data constructors to values.
As an example, consider again the functional analog of our array-sum
code:
sum xs = case xs of
Nil → 0
Cons x xs ′ → x + sum xs ′
Its CPS form (after some simplification) is:
5Application associates to the left, so C 2MN parses as (C 2M )N .
21
Variable: x, f, k, . . .
Constructor: Cn, . . .
Pattern: P ::= x
∣∣ Cn x1 · · · xn
Value: V,W ::= x
∣∣ λx.M ∣∣ λx. λk.M ∣∣ Cn V1 · · · Vn
Term: M ::= V W
∣∣ V W1W2∣∣ casexof P1 →M1; . . . ;Pn →Mn∣∣ let rec f1 = λx1.M1; . . . ; fn = λxn.Mn inN
FIGURE 7. A CPS language for the λ-calculus in Fig. 5, as produced by the
transform in Fig. 6.
sum = λxs . λk . case xs of
Nil → k 0
Cons x xs ′ → sum xs ′ (λy . k (x + y))
Now sum takes a continuation k as an extra parameter. In the Nil case,
the answer is immediately passed to the continuation. In the Cons case, we
perform a recursive call. Since, in the original expression x + sum xs, the first
computation that would be made is the recursive call sum xs, the code for that
call is now at the top of the CPS term. Its continuation will take the result y
from the the recursion, add x, and return to the caller. (Here we suppose that +
is a primitive in the CPS language and isn’t called using a continuation.)
Administrative Normal Form
CPS is expressive but heavyweight. A less radical alternative with many,
though not all (Kennedy, 2007), of its virtues is the administrative normal form,
better known as A-normal form or simply ANF. An ANF term has names for
all arguments, and its evaluation order is spelled out, but function calls are not
rewritten as tail calls.
22
Variable: x, f, . . .
Constructor: Cn, . . .
Pattern: P ::= x
∣∣ Cn x1 · · · xn
Value: V,W ::= x
∣∣ λx.M ∣∣ Cn V1 · · · Vn
Term: M,N ::= V
∣∣ letx = VW inM∣∣ casexof P1 →M1; . . . ; | Pn →Mn∣∣ let rec f1 = λx1.M1; . . . ; fn = λxn.Mn inN∣∣ letx = V inM
FIGURE 8. The ANF language produced by the CPS language in Fig. 35 by an
inverse CPS transform.
ANF emerges naturally by considering the inverse of the CPS transform
(Flanagan, Sabry, Duba, & Felleisen, 1993). In order to see what reductions of
the CPS term do to the original term, we can consider transforming a λ-term to
CPS, performing some reductions, then transforming it back. Not all reductions
in CPS terms are interesting, however: the CPS transform introduces many
λ-abstractions into the program, and β-reduction on these functions doesn’t
correspond to any behavior of the original term. These reductions are called
administrative reductions. In order to see what the CPS transform does from
the perspective of the source language, then, we can consider translating a term
to CPS, performing any administrative reductions we can, then translating
back.
The net result is a language (Fig. 8) in which all nontrivial values have
names, but which is free of the “noise” of administrative reductions. For
instance, the ANF for the sum function, taken using the inverse transform
of the simplified CPS term above, is:
23
sum = λxs . case xs of
Nil → 0
Cons x xs ′ → let
y = sum xs ′
in
x + y
Compared to the original code, the only additional syntax is the binding of y for
the partial result from the recursive call.
Many algorithms don’t make direct use of the continuation terms in
CPS, and thus they apply equally well in ANF. For example, conversion to
SSA (to be seen shortly) works broadly the same way (Chakravarty, Keller, &
Zadarnowski, 2003).
Losing explicit continuations is not without cost, however. One downside
is that CPS terms are closed under the β-rule, that is, applying the β-rule
to a CPS term gives another CPS term, so inlining and substitution can be
applied freely in CPS. In contrast, ANF requires some extra renormalization.
More seriously, though ANF enjoys a formal correspondence to CPS, this
correspondence gives no guarantees about important properties such as code
size. We will return to this point in Section 2.4 when we introduce the sequent
calculus.
Comparing functional and imperative approaches
The functional and imperative worlds often express the same concepts
in different terms, such as loops vs. recursion. The same is true of ILs—what
appear to be radically different approaches are often accomplishing the same
things.
In this section, we use ANF for comparison due to its lightweight syntax,
but everything applies to CPS as well.
24
Hoisting vs. Floating
Despite coming from such different programming traditions, functional
representations and CFGs share much in common, and many optimizations
apply in either case under different guises. For example, consider a variation
of our array-sum code (Fig. 9a), where we sum the first k entries of an array
starting at index i0. The loop thus exits when i ≥ i0 + k. This being a three-
address code, i0 + k is broken out into its own computation, which we’ve named
h. Notice, however, that i0 and k never change, and hence neither does h. So it
is wasteful to calculate it again each time through the loop, and loop-invariant
hoisting moves the assignment to h before the loop (Fig. 9b).
Now consider an ANF version (Fig. 9c). It uses a tail-recursive function
in place of the block, but it’s otherwise similar. Precisely the same effect is
achieved by let floating—since h’s definition has no free variables defined inside
loop, we can float it outside (Fig. 9d). Note that, in both cases, we need to be
careful that there are no side effects interfering.
Converting between functional and SSA
In the case of SSA, one can do more than show that certain algorithms
accomplish the same goal. SSA, CPS, and ANF are equivalent in the sense that
one can translate faithfully between them (Kelsey, 1995; Appel, 1998b).
Conceptually, it is unsurprising that these forms should be interderivable.
The single-assignment property is fundamental to functional programming, and
the correspondence between gotos and tail calls is well known (Reynolds, 1998).
There remain two apparent differences:
Nested structure As usually understood, the SSA namespace is “flat:” though
we insist that each variable is defined once, all blocks can see each
25
i ← i0
s ← 0
loop:
h ← i0 + k
c ← i − h
ifge c then done
t ← a @ i
s ← s + t
i ← i + 1
jump loop
done:
return s
(a) In three-address code.
i ← i0
s ← 0
h ← i0 + k
loop:
c ← i − h
ifge c then done
t ← a @ i
s ← s + t
i ← i + 1
jump loop
done:
return s
(b) After hoisting.
let rec
loop i s =
let
h = i0 + k
c = i − h
in
if c ≥ 0 then
done s
else
let
t = a @ i
s ′ = s + t
i ′ = i + 1
in
loop i ′ s ′
done s = s
in
loop i0 0
(c) In ANF.
let
h = i0 + k
in
let rec
loop i s =
let
c = i − h
in
if c ≥ 0 then
done s
else
let
t = a @ i
s ′ = s + t
i ′ = i + 1
in
loop i ′ s ′
done s = s
in
loop i0 0
(d) After let floating.
FIGURE 9. Two representations of a function to compute a partial sum of the
integers in an array, starting at index i0 and including k elements.
26
definition. In contrast, CPS and ANF are defined in terms of nested
functions in languages that employ lexical scope. Thus we must ensure
that each variable’s definition remains visible at each of its occurrences
after translation.
φ-nodes A φ-node has no obvious meaning in a functional language; we
must somehow encode it in terms of the available constructs: functions,
variables, applications, and definitions.
We will consider each of these points in turn. First we turn to scope. The
scoping rule for a functional program is simple: each occurrence of a variable
must be inside the body of its binding. This means something slightly different
for λ-bound and let-bound variables, but the effect is the same.
SSA programs obey a similar invariant: each variable’s definition
dominates each of its uses (Appel, 1998a). A block b1 dominates a block b2
if each path in the CFG from the start of the program to b2 goes through b1
(Prosser, 1959). If we make each block a function, then, we should be able
to satisfy the scoping invariant, so long as we can nest each block’s function
inside the functions for the block’s dominators. For CPS, blocks will naturally
translate into continuations. In ANF, we can simply use regular functions;
functions used for control flow in this way are often called join points, about
which there will be much to say in Section 2.4.
Fortunately, dominance does form a tree structure, which we can use
directly as the nesting structure for the translated program. As it happens,
this dominance tree is something already calculated in the process of efficient
translation to SSA form (Cytron et al., 1991).
Figure 10 demonstrates the dominance tree for a somewhat tangled
section of code (Fig. 10a). The CFG is shown in Fig. 10b and the dominance
27
b0:
i0 ← 0
s0 ← 0
b1:
i ← φ(i0, i1)
s ← φ(s0, s3)
c1 ← i − n
ifge c1 then b2 else b6
b2:
c2 ← i%2
ifz c2 then b3 else b4
b3:
t1 ← a@i
t2 ← t1 ∗ 2
s1 ← s + t2
jumpb5
b4:
t3 ← a@i
s2 ← s + t3
ifz t3 then b7 else b5
b5:
s3 ← φ(s1, s2)
i1 ← i + 1
jumpb1
b6:
c3 ← s − 100
ifgt c3 then b7 else b8
b7:
s4 ← −1
b8:
s5 ← φ(s , s4)
returns5
(a) A program in SSA form.
b0
b1
b2
b3 b4
b5
b6
b7 b8
(b) The CFG for (a).
b0
b1
b2
b3 b4 b5
b6 b7 b8
(c) The dominance tree for (a).
FIGURE 10. Translating from SSA to CPS or ANF using the dominance tree.
28
tree in Fig. 10c. Notice the impact of the edge from b4 to b7. Since a dominator
must lie on every path from b0, b6 does not dominate b7 or b8. If not for that
edge, it would dominate both.
Now we tackle φ-nodes. As a programming-language construct, a φ-node
is an oddity. It says what a variable’s value should be based on the previous
state of control flow.6 If we are to reinterpret blocks as functions, then, each call
should cause the φ-node to get a different value. The solution is, of course,
that the φ-node should become a parameter to the function, transferring
responsibility for the value from the block (function) to its predecessor
(invoker).
This can be formally justified by altering the way we draw the CFG. Since
each entry in a φ-node relates the block to one of its predecessors, in a sense the
value “belongs” on the edge of the CFG, not a vertex (see Fig. 11). SSA and
CPS then represent the same information, but with the edge labels in different
places.
The result of translating Fig. 10a to ANF is shown in Fig. 12.
2.3 The Glasgow Haskell Compiler (GHC)
Implementing Lazy Evaluation
A compiler for any functional language has different concerns from those
of an imperative language: higher-order functions and their closures are of
paramount importance, interprocedural analysis is absolutely necessary, and
alias analysis is an afterthought at most. But these are matters of emphasis
rather than fundamental differences, as function application still works largely
6Oddly, this has been proposed before as a source language construct! A 1969 paper on
optimization (Lowry & Medlock, 1969) suggested a novel form of declaration that would
insinuate assignments to a variable at given line numbers in a program. Arguably, then, the
φ-node was anticipated nearly twenty years beforehand, though of course its suitability in a
source language is dubious. (To be sure, this was the era where goto was popular, so perhaps
we should not be harsh.)
29
i0 ← 0
s0 ← 0
jump loop
loop:
i← φ(i0, i′)
s← φ(s0, s′)
c← i− n
ifge c then done
elsenext
next :
t← a@ i
s′ ← s+ t
i′ ← i+ 1
jump loop
done:
return s
i0 ← 0
s0 ← 0
jump loop
loop:
c← i− n
ifge c then done
elsenext
next :
t← a@ i
s′ ← s+ t
i′ ← i+ 1
jump loop
done:
return s
i = i0
s = s0
i = i′
s = s′
(b) The same, but moving φ-node
values to the edges.
i0 ← 0
s0 ← 0
jump loop(i0, s0)
loop(i , s):
c← i− n
ifge c then done
elsenext
next :
t← a@ i
s′ ← s+ t
i′ ← i+ 1
jump loop(i, s)
done:
return s
(c) A modified CFG form with parameterized labels, similar to continuations in CPS.
FIGURE 11. Modifying the CFG for Fig. 3 by moving the φ-node values into
the predecessors.
30
λa. λn.
let rec
b0 = let
i0 = 0
s0 = 0
in
let rec
b1 i s = let
c1 = i − n
in
let rec
b2 = let
c2 = i % 2
in
let rec
b3 = let
t1 = a @ i
t2 = t1 ∗ 2
s1 = s + t2
in
b5 s1
b4 = let
t3 = a @ i
s2 = s + t3
in
if t3 = 0 then b7 else b5 s2
b5 s3 = let
i1 = i + 1
in
b1 i1 s3
in
if c2 = 0 then b3 else b4
b6 = let
c3 = s − 100
in
if c3 > 0 then b7 else b8 s
b7 = let
s4 = −1
in
b8 s4
b8 s5 = s5
in
if c1 ≥ 0 then b2 else b6
in b1 i0 s0
in b0
FIGURE 12. The program from Fig. 10a, in ANF.
31
the same way; it is merely that one expects a different sort of function to be
common in an ML program from a C program.
Haskell is a more radical departure. Application is as important an
operation as ever, but lazy evaluation turns it on its head. Variable lookup
is put into particularly dramatic relief, as what was a single read could now be
an arbitrary computation! Furthermore, this is not a benign change of execution
model—a na¨ıve implementation has a disastrous impact on performance.
The classical way to implement lazy evaluation is to use a memo-thunk
(Hatcliff & Danvy, 1997; Okasaki, Lee, & Tarditi, 1994; Steele, Jr. & Sussman,
1976). This is a nullary function closure that will update itself when it finishes
executing; on subsequent invocations, the new version will immediately return
the cached answer. It is an effective strategy, and one might think that it
merely shifts work from before a function is called to the first time its argument
is needed. Unfortunately, this thought overlooks an important fact about
modern hardware: an indirect jump, i.e., one to an address stored in memory
rather than wired into the program, is much slower than a call to a known
function, as it interferes with the pipelining and branch prediction that are
crucial to performance. And in order to have a variable stand for a suspended
computation, it must store the address of some code that will be executed,
thus necessitating an indirect jump for each occurrence of each argument.
Some improvements can be made—we can use tagged pointers to avoid an
indirect call in the already-evaluated case, for instance (Marlow, Yakushev, &
Peyton Jones, 2007)—but we cannot avoid at least one indirect function call per
evaluated memo-thunk.
The GHC optimizer’s fundamental task, then, is to avoid lazy evaluation
as much as possible. Like polymorphism before it, laziness is a luxury for
programming but a catastrophe for performance.
32
The Core Language
There were two important design principles (Peyton Jones & Santos, 1998)
behind the design of Core:
1. Provide an operational interpretation. Given the severe penalties
associated with lazy evaluation, GHC cannot afford to be blase´ about
fine details of evaluation order. Yet we would not want Core to be bogged
down by operational details. Ideally, then, we want to choose constructs
judiciously, so as to remain focused on the mission to eliminate laziness.
2. Preserve type information. Since type information is erased at run
time, it is tempting to throw away types as soon as possible. But some
passes, such as strictness analysis, can make good use of types if they
are available. Perhaps more importantly, however, a typed IL can be of
enormous use in developing the compiler itself by allowing an IL-level type
checker to detect bugs early (Peyton Jones & Meijer, 1997): “It is quite
difficult to write an incorrect transformation that is type correct.”
Though GHC does not, one can also use typed representations for
formal verification. If the target language of a transform has a suitable
type system, one can prove powerful faithfulness properties such as full
abstraction (Plotkin, 1977), which means roughly that a translation does
not expose implementation details that could not be observed in the
source language. This is an important security property—the detail in
question could be someone’s password! For instance, a typed presentation
of closure conversion (Minamide, Morrisett, & Harper, 1996) has been
proved fully abstract (Ahmed & Blume, 2008), meaning not only that
the conversion preserves meaning, but also that code written in the target
33
Variable: x, y, z, . . .
Type Variable: α, . . .
Data Constructor: Cn, . . .
Type: τ ::= (see Fig. 14)
Pattern: P ::=
∣∣ x ∣∣ Cn x1 · · · xn
Term: M,N ::= x
∣∣ Cn ∣∣ λx : τ.M ∣∣ Λα.M ∣∣M N ∣∣M τ∣∣ letx : τ = M inN∣∣ let recx1 : τ1 = M1; . . . ;xn : τn = Mn inN∣∣ caseM of P1 →M1 | · · · | Pn →Mn
FIGURE 13. The GHC Core language as of 2006 (GHC 6.6), before coercions
were added.
language cannot “cheat” by inspecting a function’s closure (perhaps to
find a password). By employing a typed assembly language (Morrisett,
Walker, Crary, & Glew, 1999), a whole compiler could, in principle, be
proved fully abstract.
The first consideration led to refined semantics for let and case and to
the use of unboxed types and values. The second consideration led to the use of
the polymorphic λ-calculus, better known as System F.
Syntax
The syntax of Core is given in Fig. 13. Besides the types, there are few
surprises at the syntactic level.
Semantics
The operational reading of a let is that it allocates a thunk. Thunks are
also allocated for nontrivial function arguments.7 A case of an expression M
always forces the evaluation of M down to weak head-normal form, or WHNF,
34
meaning that a pattern match is always done on a λ, a literal, or a constructor
application.
The semantics of let and case allow GHC to reason about space usage
and strictness. An evaluation that might otherwise remain suspended can be
forced by putting it in a case, which doesn’t necessarily actually perform any
pattern matching—the pattern can simply be some x to bind the result of
computation, just as in the CPS form for a call-by-value language.
For example, suppose we are compiling a function application f(x + y),
where it is known that f is a strict function—that is, one that is certain to force
its argument to be evaluated. Since we know that x+ y will be forced, it would
be wasteful to allocate a memo-thunk for it. We’d end up with a closure in
the heap, only for it to overwrite itself with the result of the addition. Possibly
worse, evaluating the thunk would entail an indirect call, when we know right
now what the call will be! Much better, then, to force the evaluation early by
writing casex+ y of z → f z.
Giving let and case these specific meanings relieves Core of needing any
explicit constructs for dealing with memory or evaluation order, keeping the
syntax very light.
Types
The Core type system is shown in Fig. 14. The basis of the language is
System F, or more specifically System Fω, otherwise known as the higher-order
polymorphic λ-calculus. This system describes both types, such as Int and Bool ,
and type constructors, such as Maybe and List . List is not itself a datatype per
7Some older versions of GHC enforced an ANF-like restriction that arguments be atomic;
in these versions of Core, lets were the only terms that allocated thunks. Even in current
GHC, however, Core is translated into a lower IL called STG, which does have this restriction,
so function arguments that aren’t variables still wind up getting let-bound.
35
Type Variable: α, . . .
Type Constructor: T, . . .
Type: τ, σ ::= α
∣∣ T ∣∣ τ σ ∣∣ τ → σ ∣∣ ∀α : κ. τ
Kind: κ ::= ?
∣∣ # ∣∣ κ1 → κ2
FIGURE 14. Types in the GHC Core language.
se; it takes a parameter, so that List Int is a type. Thus types, themselves, have
types, which are called kinds.
Kinds come in three varieties: The kind ? is the kind of typical datatypes
like Int and Bool . Arrow kinds κ1 → κ2, like arrow types, describe type
constructors: List and Maybe have kind ? → ?. The kind # is particular to
the Core type system; it is the kind of unboxed datatypes.
Unboxed Types
As mentioned above, a primary objective of Haskell optimization is
to reduce laziness. To this end, GHC made an unusual choice for a lazy-
language compiler by expressing true machine-level values in its high-level IL
(Peyton Jones & Launchbury, 1991). This demonstrates that the difference
between “high-level” and “low-level” is often a matter of design—if some
aspect of execution on the target machine is absolutely crucial, it can be worth
encoding that aspect at the high level.
Boxed types are those represented by a pointer to a heap object (often an
unevaluated thunk) and include all types that appear in standard Haskell code.
Unboxed types are represented by raw memory or registers. For instance, an
Int# is a machine-level integer, what C would call an int. This has two major
ramifications:
36
1. A term of unboxed type must be computed strictly, rather than lazily. An
Int# is represented by an actual native integer, not a thunk, so there is no
mechanism for lazily computing it.
Therefore, an argument or let-binding must be a simple variable or
literal (as in ANF), not an expression that performs work8, since these
constructs represent suspended (lazy) computations. Work returning an
unboxed value must take place under a case, which as always serves to fix
evaluation order.
2. Since types are (eventually) erased by the compiler, polymorphic functions
rely on the uniform representation (pointer to heap object) of boxed types.
Since unboxed types have different representations, they cannot be used as
parameters to polymorphic types or functions.
Core enforces this restriction by giving unboxed types the special kind #.
A type variable or type constructor must be of kind ? or an arrow kind
built from ?s (such as (? → ?) → ?). Since List has kind ? → ? and Int#
has kind #, then, List Int# is a kind error.
The payoff is that, in many ways, Int is just like any other datatype, and
the same optimizations that eliminate constructors for other datatypes work to
keep arithmetic unboxed. For instance, this is how GHC defines the addition
operator in Core:9
data Int = I# Int#
8The actual invariant is a bit more subtle. Some expressions of unlifted type can be let-
bound or passed as arguments, but only if they are known to evaluate quickly without side
effects or possible errors. Such expressions are called ok-for-speculation, because there is no
harm in executing them speculatively in the hopes of saving work later, say by moving them
out of a loop.
9Actually, the operator belongs to a type class and is therefore defined for many types
besides Int . This is, however, the implementation of (+) for Int .
37
(+) :: Int → Int → Int
x + y = case x of
I# x# → case y of
I# y# → case x# +# y# of
s# → I# s#
Here +# is the primitive addition operator. The innermost case is necessary
because x# +# y# is a term of unboxed type and thus cannot be used directly as
the argument to I#.
Now consider the optimization of the term x + y + z . A na¨ıve
interpretation would compute x + y , getting back a boxed integer s, then
compute s + z . Thus the box created for s is thrown out immediately,
performing unnecessary work and straining the garbage collector.
Let us see, then, what GHC does with it. Since (+) is so small,
applications of it will always be inlined, so after inlining (that is, substituting
and then β-reducing), we have:
(x + y) + z
⇒

case x of
I# x# → case y of
I# y# → case x# +# y# of
s# → I# s#
 + z
⇒ case

case x of
I# x# → case y of
I# y# → case x# +# y# of
s# → I# s#
 of
I# a# → case z of
I# z# → case a# +# z# of
I# t# → I# t#
38
Next, we perform the case-of-case transform, to be explored in detail in
Section 2.4. Briefly, consider what happens to evaluate this term: First we
must evaluate the inner case, which will eventually yield the value I# s#. Thus
it is I# s# that will be scrutinized by the outer case, and we can rewrite the
term to reflect this knowledge:
⇒ case x of
I# x# → case y of
I# y# → case x# +# y# of
s# → case I# s# of
I# a# → case z of
I# z# → case a# +# z# of
t# → I# t#
Note that the case-of-case transform has exposed the box-unbox sequence as a
redex—case I# s# of . . . can be reduced at compile time. Thus we eliminate the
case and substitute s# for a#:
⇒ case x of
I# x# → case y of
I# y# → case x# +# y# of
s# → case z of
I# z# → case s# +# z# of
t# → I# t#
Now we have efficient three-way addition, the way one might write it by hand:
unbox x and y ; add them; unbox z ; add to previous sum; box the result.
Moreover, no special knowledge of (+) was required; merely applying the same
algorithms used with all Core code exposed and eliminated the gratuitous
boxing.
39
Coercions and System FC
A primary requirement of any typed IL is that it can embed the type
system of the source language. As type systems become more sophisticated,
including features such as dependent types (Xi & Pfenning, 1999; Bove, Dybjer,
& Norell, 2009; Brady, 2013), one might see reason for concern: can we retain
the benefits of typed ILs without making them unwieldy?
Fortunately, the answer appears to be “yes,” at least so far. GHC’s
extensions to Haskell have begun to intermingle typing and computation with
features such as generalized algebraic datatypes (Peyton Jones, Vytiniotis,
Weirich, & Washburn, 2006; Xi, Chen, & Chen, 2003) and type families
(Chakravarty, Keller, & Peyton Jones, 2005). These have been accomodated
in Core by the addition of a much simpler extension, the coercion (Sulzmann,
Chakravarty, Jones, & Donnelly, 2007). This extended λ-calculus is called
System FC .
The essential idea is that complex features of the type system can be
handled entirely by the source-level typechecker, which annotates the generated
Core with pre-calculated evidence, i.e., proofs showing that terms have the
required types. These annotations take the form of type-safe casts
M . γ,
where M has some type τ and γ is a coercion of type10 τ ∼ τ ′, proving that τ
and τ ′ are equivalent—so far as Core is concerned, they are the same type.
10Somewhat confusingly, the literature sometimes refers to coercions as types whose kinds
have the form τ ∼ τ ′. The distinction is not consequential.
40
Example: Well-Typed Expressions
For example, consider generalized algebric datatypes (GADTs), a popular
extension to standard Haskell allowing datatypes to express constraints
succinctly. The traditional example of a GADT is one for well-typed
expressions:
data Expr a where
Lit :: a → Expr a
Plus :: Expr Int → Expr Int → Expr Int
Eq :: Expr Int → Expr Int → Expr Bool
If :: Expr Bool → Expr a → Expr a → Expr a
Because different constructors produce terms with different types, a
nonsensical term like Plus (Lit 3) (Lit True) is a compile-time error. Even
better, we can write a well-typed interpreter:
eval :: Expr a → a
eval e = case e of
Lit a → a
Plus x y → eval x + eval y
Eq x y → eval x == eval y
If b x y → if eval b then eval x else eval y
Clearly the GADT representation is convenient, as even though
expressions can denote different types, there is no need to tag the returned
values or to check for error cases. However, the source-level typechecker’s job
is now trickier. Notice that this single case expression has different types in
different branches: the Plus branch returns an Int , but the Eq branch returns
a Bool . So the typechecker needs to keep track of types that change in different
41
cases. Here it is only the return type that changes, but if eval took a second
parameter of type a, that parameter’s type would change as well.
We promised that the added complexity of GADTs could be isolated from
the Core type system. GHC’s typechecker keeps this promise by rewriting a
GADT like Expr as a regular datatype:
data Expr a where
Lit :: a → Expr a
Plus :: Expr Int → Expr Int → a ∼ Int → Expr a
Eq :: Expr Int → Expr Int → a ∼ Bool → Expr a
If :: Expr Bool → Expr a → Expr a → Expr a
We have kept the GADT syntax, but Expr is now a traditional datatype—
its constructors all return Expr a, no matter what a is chosen to be. The caveat
is that Plus requires a proof that a is actually Int , and similarly with Eq , so we
still can’t use Plus to create an Expr Bool .
Here is how eval is desugared into Core11:
eval :: Expr a → a
eval e = case e of
Lit a → a
Plus x y γ → (eval x + eval y) . γ−1
Eq x y γ → (eval x == eval y) . γ−1
If b x y → if eval b then eval x else eval y
Now the Plus and Eq branches can access the coercion γ stored in the
Expr . In the Plus case, γ has type a ∼ Int . Now, eval x + eval y has type Int ,
and we must return an a, so γ’s type is “backwards”—we need Int ∼ a. But
of course type equivalence is symmetric, so we can always take the inverse γ−1.
The Eq case is similar.
42
Typechecking is now straightforward. The only novelty is checking the
coercions themselves, but this is not hard.
2.4 Sequent Calculus for GHC
As mentioned in Section 2.2, ANF is more concise than CPS, yet it is
formally equivalent. Thus reasoning about the observable behavior of a term’s
CPS translation carries over to its ANF form (Flanagan et al., 1993). However,
an optimizing compiler is concerned with much more than observable behavior—
the equivalence can tell us only which transformations are correct, not which
are desirable. Thus it is worth considering what we might be trading for the
syntactic economy of ANF or plain λ-terms.
Case Floating and Join Points
When dealing with plain λ-terms, one important operation is called case
floating (Santos, 1995; Peyton Jones & Santos, 1998). In general, if the first
step of evaluating some term will be to evaluate a case, then the case can be
brought to the top of the term. For instance, a case might return a function
11Technically, the if becomes a case in Core.
43
that will be applied immediately:

case b of
True → λy . x + y
False → λy . x ∗ y
 5⇒
case b of
True → (λy . x + y) 5
False → (λy . x ∗ y) 5
⇒
case b of
True → x + 5
False → x ∗ 5
As can be seen, the purpose of case floating is generally to bring terms together
in the hope of finding a redex, i.e., an opportunity to perform a compile-
time computation. Here, the case was always returning a λ, so moving the
application inward lets the function call happen at compile time.
One possible concern is that the argument has been duplicated. What if
this 5 were instead some large expression?

case b of
True → λy . x + y
False → λy . x ∗ y
 〈BIG〉 ⇒⇒
case b of
True → x + 〈BIG〉
False → x ∗ 〈BIG〉 -- Oops!
The β-reduction may be a Pyrrhic victory if it causes code size to explode—
consider that the branches may, themselves, be case expressions, leading to
exponential blow-up. The obvious solution, which GHC uses whenever a value is
44
about to become shared, is to introduce a let-binding:

case b of
True → λy . x + y
False → λy . x ∗ y
 〈BIG〉 ⇒⇒
let
arg = 〈BIG〉
in
case b of
True → x + arg
False → x ∗ arg
The situation is more complex in a call-by-value language: if b were an
expression with side effects, we would need to be careful that 〈BIG〉 is still not
evaluated until afterward.
In contrast, CPS does not need case floating as a special case, and it will
avoid the sharing issue even in a call-by-value language. Here is the call-by-
value CPS form of our term:
λk .

λk1. case b of
True → k1 (λy . λk2. k2 (x + y))
False → k1 (λy . λk2. k2 (x ∗ y))
 (λf . C J〈BIG〉K (λy . f y k))
The CPS form of the case expression is now applied to the continuation that
serves to evaluate 〈BIG〉 and apply it as an argument to whichever function
comes out of the case. A simple β-reduction moves that continuation’s use into
the branches, bringing the case to the top without any special rule:
45
λk . let
k1 = λf . C J〈BIG〉K (λy . f y k)
in
case b of
True → k1 (λy . λk2. k2 (x + y))
False → k1 (λy . λk2. k2 (x ∗ y))
Here we have used a variation on β-reduction that let-binds the argument
rather than substituting it.
Eliminating the redex, the way we did with the λ-term, is trickier but
doable. As is, we would need to inline k1 at both call sites, which would
duplicate 〈BIG〉 all over again. But we can instead give 〈BIG〉 its own binding
and (since f was chosen fresh) float it out:
λk . let
arg = λk . C J〈BIG〉K k
k1 = λf . arg (λy . f y k)
in
case b of
True → k1 (λy . λk2. k2(x + y))
False → k1 (λy . λk2. k2(x ∗ y))
⇒∗
λk . let
arg = λk . C J〈BIG〉K k
in
case b of
True → arg (λy . k (x + y))
False → arg (λy . k (x ∗ y))
Since k1 became small, we inlined it in the branches, and now those
branches use x and arg directly, just as in the λ-term.
46
None of these procedures were specific to case statements; only floating
and careful inlining were required. And floating can be performed aggressively
on the CPS form, since there is no danger of changing evaluation order (Plotkin,
1975).
What about ANF? Remember that ANF is derived from CPS by
performing all administrative reductions, then translating back to direct style.
So the na¨ıve ANF transform duplicates the context of any case. In practice, of
course, implementations are smarter, avoiding administrative reductions that
would duplicate too much code. The leftover continuations, which would have
disappeared due to administrative reductions, are then called join points :
let
j = λf . let
arg = 〈BIG〉
in
f arg
in
case b of
True → j (λy . x + y)
False → j (λy . x ∗ y)
Now, in CPS, every function call is a tail call and thus function
application, including continuation invocation, is cheap. In ANF, however,
function calls generally entail all the usual overhead12, so if we don’t treat j
specially somehow, we will introduce that overhead in making j a function.
12It may seem from this discussion that ANF itself imposes function-call overhead compared
to CPS. In fact, it merely makes implicit again what CPS expresses explicitly—a continuation
closure represents the call stack, including the return pointer, as a λ-term. If code generation
in turn uses the call stack to implement continuations, or if CPS code is translated back
to direct style before code generation (Kennedy, 2007), then CPS vs. ANF has no lasting
significance after the optimizer. Alternatively, the compiler can simply let the continuations
be values like any other, using heap-allocated “stack frames” and forgoing the stack entirely.
47
Fortunately, j is special: like a continuation, it is only ever tail-called.
Furthermore, it is only used in the context that defined it, and not under a λ-
abstraction. Therefore all calls to j will return to the same point. Hence there is
no need to push a stack frame with a return address in order to invoke it, and
we can generate code for it as we would a continuation in a CPS IL.
So we can create join points when translating to ANF, and we can
recognize join points during code generation so that calling them is efficient.
Have we regained everything lost from CPS to ANF? Unfortunately, no. The
problem is that join points may not stay join points during optimization.
Suppose our term, now in ANF with a join point, is part of a bigger
expression:
let
g = λx . let
j = λf . let
arg = 〈BIG〉
in
f arg
in
case b of
True → j (λy . x + y)
False → j (λy . x ∗ y)
in
case g 1 of
〈HUGE〉
By doing the latter, SML/NJ easily implements the call/cc operator, which allows user code
to access its continuation, with little overhead (Appel, 1998a, §5.9).
48
Now suppose that this is the only reference to g (it does not appear in
〈HUGE〉). Thus we want to inline the call g 1 so we can remove the binding
for g altogether. We can start by performing the β-reduction:
case

let
j = λf . let
arg = 〈BIG〉
in
f arg
in
case b of
True → j (λy . x + y)
False → j (λy . x ∗ y)

of
〈HUGE〉
This term is no longer in ANF. To renormalize, first we float out the let:
let
j = λf . let
arg = 〈BIG〉
in
f arg
in
case

case b of
True → j (λy . x + y)
False → j (λy . x ∗ y)
 of
〈HUGE〉
Then, as before, we need to perform case floating. This time we’re doing
the case-of-case transformation (Peyton Jones & Santos, 1998; Santos, 1995).
It is similar to the previous “app-of-case” case, so we’ll need to make another
join point. (Incidentally, here GHC would need to make a join point as well; the
trick it used earlier to float out 〈BIG〉 won’t help.)
49
let
j = λf . let
arg = 〈BIG〉
in
f arg
j2 = λz . case z of
〈HUGE〉
in
case b of
True → j2 (j (λy . x + y))
False → j2 (j (λy . x ∗ y))
We’re back in ANF, but notice that j is no longer a join point—it’s now called
in non-tail position. Thus the function-call overhead has crept back in.
Introducing Sequent Calculus
Clearly it is hazardous to represent join points as normal functions and
expect to find them later still intact. Thus we would like a representation that
treats them fundamentally differently. In particular, it would help to enforce
syntactically the invariant that a join point must be tail-called. Needing to
systematize the notion of “tail call” leads us to consider an encoding that
makes control flow explicit, like CPS. CPS is syntactically heavy, however.
More importantly, CPS makes too much control flow explicit. As was its
initial purpose (Plotkin, 1975), CPS specifies syntactically the order in which
expressions are evaluated, leaving nothing to the evaluation strategy. This
would be cumbersome for the GHC optimization engine, which takes full
advantage of Haskell’s underspecified evaluation order. Also, since CPS encodes
the evaluation strategy, there is a different transform for each evaluation
strategy, whereas there is a single correspondence between λ-calculi and sequent
50
Term Variable: x, f, . . .
Cont. Variable: k, j, . . .
Data Constructor: Cn, . . .
Type: τ ::= (see Fig. 14)
Coercion: γ ::= (omitted)
Pattern: P ::= x
∣∣ Cn x1 . . . xn
Term: M ::= x
∣∣ Cn ∣∣ λx : τ.M ∣∣ Λα : κ.M ∣∣ µk : τ.K
Continuation: E ::= k
∣∣M · E ∣∣ τ · E ∣∣ .γ · E∣∣ case of P1 → K1; . . . ;Pn → Kn
Command: K ::= 〈M |E〉 ∣∣ let recB1; . . . ;Bn inK∣∣ letB inK
Binding: B ::= x = M
∣∣ cont k = E
FIGURE 15. The syntax for Sequent Core.
calculi. In other words, the first step to using CPS in GHC would be to worry
over which CPS to use, whereas sequent calculus lets us forget about evaluation
order until later.
The sequent calculus was invented by Gerhard Gentzen in his study
of logic, in order to prove properties of his other system, natural deduction
(Gentzen, 1935). Decades later, it was realized that natural deduction is
intimately related to the λ-calculus by what is now called the Curry–Howard
isomorphism (Howard, 1980). More recently, there has been interest in the
similar way that the sequent calculus can be seen as a programming language
(Herbelin, 1995). We propose an IL called Sequent Core (Fig. 15), based on a
lazy fragment of a sequent calculus, Dual System FC , that incorporates the type
system of System FC .
The sequent calculus divides the expression syntax into three categories:
terms, which produce values; continuations, which consume values; and
51
commands, which apply values to continuations. Thus, computation is modeled
as the interaction between a value and its context.
Most of the term forms are familiar. The novel one is the µ-abstraction,
whose syntax is borrowed from Parigot’s λµ-calculus (Parigot, 1992) describing
control operators. It is written µk : τ.K, meaning bind the continuation as k
and then perform the command K. The µ-abstraction arises by analogy with
CPS, which represents each term as a function of a continuation; hence, we
distinguish continuation bindings. Keeping this distinction, as well as other
syntactic restrictions compared to CPS, makes it simple to convert freely
between Core and Sequent Core as needed.
The continuations comprise the observations that can be made of a term:
we can apply an argument to it (either a value or a type argument); we can cast
it using a coercion; we can perform case analysis on it; or we can return it to
some context.
A command either applies a term to a continuation or allocates using
a let binding. Either a term or a continuation may be let-bound, either
recursively or non-recursively. Note that recursive continuations don’t arise
from translating Core, but just as join points could be recognized before, we
can perform contification (Kennedy, 2007; Fluet & Weeks, 2001) to turn a
consistently tail-called function (even a recursive one) into a continuation.
In Sequent Core, however, we do not risk accidentally “ruining” a
continuation. Consider again our problematic term:
case b of
True → λy . x + y
False → λy . x ∗ y
 〈BIG〉
Here it is as a term in Sequent Core:
52
µk .
〈
µk1.
〈
b
∣∣∣∣∣∣∣∣
case of
True → 〈λy . µk2. 〈(+) | x · y · k2〉 | k1〉
False → 〈λy . µk2. 〈(∗) | x · y · k2〉 | k1〉
〉 ∣∣∣∣∣∣∣∣ 〈BIG〉 · k
〉
As before, we can share the application between the two branches as a join
point. The lazy form of the µβ rule for applying a continuation is:
〈µk.K |E〉 ⇒ let cont k = E inK
Thus we can perform a µβ-reduction (renaming k1 as j):
µk . let
cont j = 〈BIG〉 · k
in〈
b
∣∣∣∣∣∣∣∣
case of
True → 〈λy . µk2. 〈(+) | x · y · k2〉 | j 〉
False → 〈λy . µk2. 〈(∗) | x · y · k2〉 | j 〉
〉
Now, what happens in the troublesome case-of-case situation in Sequent Core?
Suppose, again, there is a larger context:
case

case b of
True → λy . x + y
False → λy . x ∗ y
 〈BIG〉 of
〈HUGE〉
Our simplified Sequent Core term becomes:
µk0.
〈

µk . let
cont j = 〈BIG〉 · k
in〈
b
∣∣∣∣∣∣∣∣
case of
True → 〈λy . µk2. 〈(+) | x · y · k2〉 | j 〉
False → 〈λy . µk2. 〈(∗) | x · y · k2〉 | j 〉
〉

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
case of
〈HUGE〉
〉
Recall that GHC’s case-of-case transform would be pulling the outer case
into both branches of the inner case. It would make a join point to avoid
53
duplicating 〈HUGE〉, but then it would still “ruin” j. But here, a simple µβ-
reduction, subsituting for k afterward, gives:
µk0. let
cont j = 〈BIG〉 · case of
〈HUGE〉
in〈
b
∣∣∣∣∣∣∣∣
case of
True → 〈λy . µk2. 〈(+) | x · y · k2〉 | j 〉
False → 〈λy . µk2. 〈(∗) | x · y · k2〉 | j 〉
〉
We pull the case into j, where it has a chance to interact with 〈BIG〉, perhaps
by matching a known constructor.
Observe that this would be a very unnatural code transformation on
Core: normally, it would never make sense to move an outer context into some
let-bound term. If nothing else, it would change the return type of j, from
that of 〈BIG〉 to that of the branches 〈HUGE〉, which a correct transformation
rarely does. But the invariants of Sequent Core make case-of-case a simple
substitution like any other. In particular, since continuations are only typed
according to their argument type, j has no “return type,” so substituting the
outer context for k preserves types. Also, invariants about how k must be used
(see below) ensure that we haven’t changed the outcome of any code path.
Type and Scope Invariants
A command is simply some code that runs; it has no type of its own. A
well-typed command is simply one whose term and continuation have the same
type. Similarly, a continuation takes its argument and runs, so it doesn’t have
an “outgoing type” any more than a term has an “incoming type.”
This may be worrisome—have we allowed control effects into our
language? Haskell (and hence Core) is supposed to be a lazy language whose
54
evaluation order is loosely specified, yet so far, it seems we would allow terms
that discriminate according to evaluation order. For example:
µk . let
cont j1 = case of { → 〈“Left first” | k〉 }
cont j2 = case of { → 〈“Right first” | k〉 }
in
〈(+) | (µk1. 〈() | j1〉) · (µk2. 〈() | j2〉) · k〉
In this term, whichever operand to (+) is evaluated first will pass a string
directly to the continuation k, interrupting the whole computation and revealing
the evaluation order. We have already seen how freely GHC rearranges terms
because such changes are not supposed to be observable to Haskell programs;
thus allowing programs such as this would be disastrous. If nothing else, the
above term has no counterpart in Core, suggesting that the difference between
Core and Sequent Core is so profound as to require rethinking the entire
optimization pipeline.
On the other hand, we do not want to compromise flexibility. A rule
such as “k must occur free in each branch” would disallow having different
branches return through different join points. Selective inlining and known-
case optimizations can cause branches to diverge dramatically. Indeed, there is
nothing objectionable about this term, which is superficially similar to the one
above:
µk . let
cont j1 = case of { → 〈“It was true” | k〉 }
cont j2 = case of { → 〈“It was false” | k〉 }
in〈
b
∣∣∣∣∣∣∣∣
case of
True → 〈() | j1〉
False → 〈() | j2〉
〉
55
The solution is simple, as suggested by (Kennedy, 2007). We impose the
scoping rule that continuation variables may not appear free in terms. Thus, we
say terms are continuation-closed. This forbids j1 and j2 from being invoked
from argument terms, where they constitute an impure computation, but not
from case branches, where they are properly used to describe control flow.
It should be clear, then, that if the evaluation of the term µk.K completes
normally, k must be invoked. An informal proof: If K has no let cont bindings,
then invoking k is “the only way out.”13 If there is a local continuation (i.e., a
join point) declared, then it can only recurse to itself or invoke k, and if it only
recurses then computation does not complete normally. If there is a join point
j1 and then another join point j2, then it may be that j2 is invoked, but it must
eventually defer to k or to j1 (and thus eventually to k) or else loop forever; and
so on. By induction, either execution fails, or it succeeds through k.
The “inevitability” of a µ-bound continuation makes translating from
Sequent Core back into Core easy. If we see the command 〈M |E〉, we can say
confidently that the normal flow of control passes to E, so we translate E as the
context of M . That is, we translate it as a fragment of syntax to be wrapped
around M . For instance, writing D (for “direct style”) for the translation to
Core, we have:
D
uwwv
〈
M
∣∣∣∣∣∣∣∣
case of
True → K1
False → K2
〉}~ =
caseD JM K of
True → D JK1K
False → D JK2K
13Note that, in general, a command has the structure of a tree of case expressions
with continuation variables at the leaves. Without let cont, there is no way to bring new
continuation variables into scope, and no terms (including arguments in continuations) may
have free continuation variables, so only k may occur free at all. There may be branches with
no continuation variables, since an empty case statement is allowed (typically when it is
known that a term crashes or loops, so its continuation is dead code). Hence the stipulation at
the start that computation “computes normally.”
56
D J〈M |E 〉Kk = D JE K [D JM Kk ]
D Jlet B in K Kk = letD JBKk inD JK Kk
D
r
let rec
#»
B in K
z
k
= let rec
#              »D JBKk inD JK Kk
D Jx = M Kk = x = D JM K
D Jcont j = E Kk = j = λx → D JE Kk [x ]
D JxK = x
D JC K = C
D Jλ x . M K = λ x .D JM K
D JΛ α. M K = Λ α.D JM K
D Jµ k . K K = D JK Kk
D JkKk = 
D Jj Kk = j  (j 6= k)
D JM · E Kk = D JE Kk [D JM K ]
D Jτ · E Kk = D JE Kk [τ ]
D J. γ · E Kk = D JE Kk [ . γ]
D
r
case of
#                 »
P → K
z
k
= case of { #                              »P → D JKKk }
FIGURE 16. The readback translation D from Sequent Core back to Core, as
defined on commands, bindings, terms, and continuations.
57
Thus, we can easily return to Core after working in Sequent Core. The
full definition of D is given in Fig. 16. The extra argument k is carried so
that, when translating µk.K, we can distinguish between the k that marks
the current context from other continuation variables introduced as join points.
(Since terms are continuation-closed, only one such k can be µ-bound at a
time.)
There is some overhead in translating back and forth, but early experience
suggests it is tolerable.
2.5 Join Points in Direct Style
While the prototype implementation demonstrated Sequent Core’s
feasibility, the sheer size and disruptiveness of the change posed major
challenges in a production compiler. The central Expr datatype became three
mutually-recursive datatypes for terms, continuations, and commands. Thus
every module being adapted to the Sequent Core world had to have each pass
rewritten, often with one loop turning into three. No mature project undertakes
such disruption unless absolutely necessary—code review alone would have
taken countless hours, to say nothing of the torrent of new bugs.
Thus it is worth seeking a more modest solution providing the same
benefits. The fact that we can translate back and forth offers a clue: by seeing
what is preserved in the round trip and what isn’t, we may hope to find what
we could add to Core so that we might perform the same optimizations on Core
that we were performing on Sequent Core.
To illustrate the need for care, consider:
〈 µ k . let cont j = E1 in K | E2 〉 ⇒ let cont j = E1 {E2/k} in K {E2/k}
58
This is a crucial fact, for it deals with the situation where we would be
tempted to ruin a join point. If we translate both sides back to Core using
D and apply a tempting “substitution lemma,” however, we get something
nonsensical (here E stands for an evaluation context):
E [let j = M in N ] ⇒ let j = E [M ] in E [N ]
This would allow us to derive14:
even ( let b = True in if b then 0 else 1)
⇒ let b = even True in even ( if b then 0 else 1)
Not only is this wrong, but even True doesn’t so much as typecheck.
What has gone wrong? Observe from Fig. 16 that D turns both normal
let bindings and continuation bindings into Core let bindings. If we came from
Sequent Core, our “rule” can’t tell whether j came from a join point or not.
Worse, the putative substitution lemma fails, for reasons turning precisely
on whether or not a given identifier is a join point. This is what we would like
to write:
D JK{E/k}Kk ⇔ D JE Kk [D JK Kk ]
(Here ⇔ means equality up to reductions in either direction.) The intuition is
that substituting a continuation for k corresponds to adding more context to
the top level of a term. The “lemma” breaks down completely, however, when
K is an invocation of a join point:
D J〈M | j 〉 {E/k}Kk
= D J〈M | j 〉Kk
= j D JM K
14We can use even  as an evaluation context because even is strict.
59
6⇔ D JE Kk [j D JM K]
= D JE Kk [D J〈M | j 〉Kk ]
The problem is that the continuation k does not appear in the invocation
of a join point in Sequent Core. This makes sense, since a join point is
an alternate continuation that retains its own context. But it means that
remembering what’s a join point and what isn’t is crucial for maintaining any
connection between Core and Sequent Core strong enough to start deriving
useful relations.
Hence, anticipating the syntax introduced in Chapter III, we define D′ to
be the same as D except:
D′ Jcont j = E Kk = join j x = D′ JE Kk [x ]
D′ Jj Kk = jump j 
Practically, all we have done is added a Boolean flag to each binding. Yet
this is very powerful information, as the rule
E [let join j x = M in N ] ⇒ let join j x = E [M ] in E [N ]
is not only sound but the key to avoiding ruining join points. To deal with our
trouble with invoking a join point, we will also have this rule:
E [jump j M ] ⇒ jump j M
Together with the usual rules for case floating, we can now give a substitution
lemma to make the desired rules derivable from D:
Lemma 1. For any Sequent Core command K and continuation E:
D′ JK{E/k}Kk ⇔ D′ JE Kk [D′ JK Kk ]
60
Thus many of the rules in Chapter III were derived by translating from
Sequent Core. Since we also translated the type system, which enforces the
restrictions on join points, we know that the translated rules are sound. This
is the key advantage of using translation as a strategy for language extension:
delicate matters such as what exactly the constraints on join points are or
how case floating can work with them can be handled simply by appealing
to translation. Since the sequent calculus handles join points so naturally,
the soundness of the rule can be proved easily, and then translation saves us
the implementation trouble of using a brand-new intermediate language for
optimization.
61
CHAPTER III
Improving Compilation of Haskell
Most of the text in this chapter comes from (Maurer, Downen, Ariola, &
Peyton Jones, 2017), which was a collaboration with Paul Downen (UO), Zena
M. Ariola (UO), and Simon Peyton Jones (MSR). I led the research and did
most of the writing. I also performed the software development (namely the
extension to GHC), under Simon’s helpful guidance.
Consider this code, in a functional language:
if (if e1 then e2 else e3) then e4 else e5
Many compilers will perform a commuting conversion (Girard, Taylor, & Lafont,
1989), which na¨ıvely would produce:
if e1 then (if e2 then e4 else e5)
else (if e3 then e4 else e5)
Commuting conversions are tremendously important in practice (Sec. 3.1),
but there is a problem: the conversion duplicates e4 and e5. A natural
countermeasure is to name the offending expressions and duplicate the names
instead:
let { j4 () = e4; j5 () = e5 }
in if e1 then (if e2 then j4 () else j5 ())
else (if e3 then j4 () else j5 ())
We describe j4 and j5 as join points, because they say where execution of the
two branches of the outer if joins up again. The duplication is gone, but a new
problem has surfaced: the compiler may allocate closures for locally-defined
functions like j4 and j5. That is bad because allocation is expensive. And it
62
is tantalizing because all we are doing here is encoding control flow: it is plain
as a pikestaff that the “call” to j4 should be no more than a jump, with no
allocation anywhere. That’s what a C compiler would do! Some code generators
can cleverly eliminate the closures, but perhaps not if further transformations
intervene.
The reader of Appel’s inspirational book (Appel, 1992) may be thinking
“Just use continuation-passing style (CPS)!” When expressed over CPS terms,
many classic optimizations boil down to β-reduction (i.e.function application),
or arithmetic reductions, or variants thereof. And indeed it turns out that
commuting conversions fall out rather naturally as well. But using CPS comes
at a fairly heavy price: the intermediate language becomes more complicated,
some transformations are harder or out of reach, and (unlike direct style) CPS
commits to a particular evaluation order.
Inspired by Flanagan et al. (Flanagan et al., 1993), the reader may now
be thinking “OK, just use administrative normal form (ANF)!” That paper
shows that many transformations achievable in CPS are equally accessible
in direct style. ANF allows an optimizer to exploit CPS technology without
needing to implement it. The motto is: Think in CPS; work in direct style.
But alas, a subsequent paper by Kennedy shows that there remain
transformations that are inaccessible in ANF but fall out naturally in CPS
(Kennedy, 2007). So the obvious question is this: could we extend ANF in some
way, to get all the goodness of direct style and the benefits of CPS? In this
paper we say “yes!”, making the following contributions:
– We describe a modest extension to a direct-style λ-calculus intermediate
language, namely adding join points (Sec. 3.2). We give the syntax,
type system, and operational semantics, together with optimising
transformations.
63
– We describe how to infer which ordinary bindings are in fact join points
(Sec. 3.3). In a CPS setting this analysis is called contification (Kennedy,
2007), but it looks rather different in our setting.
– We show that join points can be recursive, and that recursive join points
open up a new and entirely unexpected (to us) optimization opportunity
for fusion (Sec. 3.4). In particular, this insight fully resolves a long-
standing tension between two competing approaches to fusion, namely
stream fusion (Coutts, Leshchinskiy, & Stewart, 2007) and unfold/destroy
fusion (Svenningsson, 2002).
– We give some metatheory in Sec. 3.5, including type soundness and
correctness of the optimizing transformations. We show the safety of
adding jumps as a control effect by establishing an equivalence with
System F.
– We demonstrate that our approach works at scale, in a state-of-the-art
optimizing compiler for Haskell, GHC (Sec. 3.6). As hoped, adding join
points turned out to be a very modest change, despite GHC’s scale and
complexity. Like any optimization, it does not make every program go
faster, but it has a dramatic effect on some.
Overall, adding join points to ANF has an extremely good power-to-weight
ratio, and we strongly recommend it to any direct-style compiler. Our title is
somewhat tongue-in-cheek, but we now know of no optimizing transformation
that is accessible to a CPS compiler but not to a direct-style one.
3.1 Motivation and Key Ideas
We review compilation techniques for commuting conversions, to expose
the challenge that we tackle in this paper. For the sake of concreteness we
64
describe the way things work in GHC. However, we believe that the whole paper
is equally applicable to a call-by-value language.
Case-of-Case Transformation
Consider these function definitions:
isNothing :: Maybe a -> Bool
isNothing x = case x of Nothing -> True
Just _ -> False
mHead :: [a] -> Maybe a
mHead ps = case ps of [] -> Nothing
(p:_) -> Just p
null :: [a] -> Bool
null as = isNothing (mHead as)
Here null is a simple composition of the library functions isNothing and
mHead. When the optimizer works on null, it will inline both isNothing and
mHead to yield:
null as = case (case as of [] -> Nothing
(p:_) -> Just p) of
{ Nothing -> True; Just _ -> False }
Executed directly, this would be terribly inefficient; if the argument list is non-
empty we would allocate a result Just p only to immediately decompose it. We
want to move the outer case into the branches of the inner one, like this:
null as = case as of
[] -> case Nothing of Nothing -> True
65
Just z -> False
p:_ -> case Just p of Nothing -> True
Just _ -> False
This is a commuting conversion, specifically the case-of-case transformation. In
this example, it now happens that both inner case expressions scrutinize a data
constructor, so they can be simplified, yielding
null as = case as of { [] -> True; _:_ -> False }
which is exactly the code we would have written for null from scratch.
GHC does a tremendous amount of inlining, including across modules
or even packages, so commuting conversions like this are very important in
practice: they are the key that unlocks a cascade of further optimizations.
Join Points
Commuting conversions have a problem, though: they often duplicate the
outer case. In our example that was OK, but what about
case (case v of { p1 -> e1; p2 -> e2 }) of
{ Nothing -> BIG1; Just x -> BIG2 }
where BIG1 and BIG2 are big expressions? We do not want to duplicate these
large expressions, or we would risk bloating the size of the compiled code,
perhaps exponentially when case expressions are deeply nested (Lindley, 2005).
It is easy to avoid this duplication by first introducing an auxiliary let binding:
let { j1 () = BIG1; j2 x = BIG2 } in
case (case v of { p1 -> e1; p2 -> e2 }) of
{ Nothing -> j1 (); Just x -> j2 x }
66
Now we can move the outer case expression into the arms of the inner case,
without duplicating BIG1 or BIG2, thus:
let { j1 () = BIG1; j2 x = BIG2 } in
case v of
p1 -> case e1 of Nothing -> j1 ()
Just x -> j2 x
p2 -> case e2 of Nothing -> j1 ()
Just x -> j2 x
Notice that j2 takes as its parameter the variable bound by the pattern Just x,
whereas j1 has no parameters1.
Compiling Join Points Efficiently
We call j1 and j2 join points because you can think of them as places
where control joins up again, but so far they are perfectly ordinary let-bound
functions, and as such they will be allocated as closures in the heap. But that’s
ridiculous: all that is happening here is control flow splitting and joining up
again. A C compiler would generate a jump to a label, not a call to a heap-
allocated function closure!
So, right before code generation, GHC performs a simple analysis to
identify bindings that can be compiled as join points. This identifies let-bound
functions that will never be captured in a closure or thunk, and will only be tail-
called with exactly the right number of arguments. (We leave the exact criteria
for Sec. 3.3.) These join-point bindings do not allocate anything; instead a tail
call to a join point simply adjusts the stack and jumps to the code for the join
point.
1The dummy unit parameter is not necessary in a lazy language, but it is in a call-by-value
language.
67
The case-of-case transformation, including the idea of using let bindings
to avoid duplication, is very old; for example, both are features of Steele’s
Rabbit compiler for Scheme (Steele, 1978). In Rabbit the transformation is
limited to booleans, but the discussion above shows that it generalizes very
naturally to arbitrary data types. In this more general form, it has been
part of GHC for decades (Peyton Jones & Santos, 1998). Likewise, the idea
of generating different (and much more efficient) code for non-escaping let
bindings is well established in many other compilers (Tolmach & Oliva, 1998;
Reppy, 2002; Keep, Hearn, & Dybvig, 2021) as well as GHC.
Preserving and Exploiting Join Points
So far so good, but there is a serious problem with recognizing join points
only in the back end of the compiler. Consider this expression:
case (let j x = BIG in
case v of { A -> j 1; B -> j 2; C -> True } of
{ True -> False; False -> True }
Here j is a join point. Now suppose we do case-of-case on this expression.
Treating the binding for j as an ordinary let binding (as GHC does today),
we move the outer case past the let, and duplicate it into the branches of the
inner case, yielding
let j x = BIG in
case v of
A -> case (j 1) of { True -> False; False -> True }
B -> case (j 2) of { True -> False; False -> True }
C -> case True of { True -> False; False -> True }
The third branch simplifies nicely, but the first two do not. There are two
distinct problems:
68
1. The binding for j is no longer a join point (it is not tail-called), so the
super-efficient code generation strategy does not apply, and the compiler
will allocate a closure for j at runtime. This happens in practice: we have
cases in which GHC’s optimizer actually increases allocation because it
inadvertently destroys a join point.
2. Even worse, the two copies of the outer case now scrutinize an
uninformative call like (j 1). So the extra code bloat from duplicating
the outer case is entirely wasted. And it’s a huge lost opportunity, as we
shall see.
So it is not enough to generate efficient code for join points; we must identify,
preserve, and exploit them. In our example, if the optimizer knew that the
binding for j is a join point, it could exploit that knowledge to transform our
original expression like this:
let j x = case BIG of True -> False
False -> True
in case v of
A -> j 1
B -> j 2
C -> case True of { True -> False; False -> True }
This is much, much better than our previous attempt:
– The outer case has moved into the right-hand side of the join point,
so it now scrutinizes BIG. That’s good, because BIG might be a data
constructor or a case expression (which would expose another case-of-
case opportunity). So the outer case now scrutinizes the actual result of
the expression, rather than an uninformative join-point call. That solves
problem (2).
69
– The A and B branches do not mention the outer case, because it has
moved into the join point itself. So j is still tail-called and remains an
efficiently-compiled join point. That solves problem (1).
– The outer case still scrutinizes the branches that do not finish with a join
point call, e.g. the C branch.
The Key Idea
Thus motivated, in the rest of this paper we explore the following very
simple idea:
– Distinguish certain let bindings as join-point bindings, and their (tail-
)call sites as jumps.
– Adjust the case-of-case transformation to take account of join-point
bindings and jumps.
– In all the other transformations carried out by the compiler, ensure that
join points remain join points.
Our key innovation is that, by recognising join points as a language construct,
we both preserve join poins through subsequent transformations, and exploit
join points to make other tansformations more effective. Next, we formalize this
approach; subsequent sections develop the consequences.
3.2 System FJ : Join Points and Jumps
We now formalize the intuitions developed so far by describing System
FJ , a small intermediate language with join points. FJ is an extension of
GHC’s Core intermediate language (Peyton Jones & Santos, 1998). We omit
existentials, GADTs, and coercions (Sulzmann et al., 2007), since they are
largely orthogonal to join points.
70
Terms
x ∈ Term variables
j ∈ Label variables
e, u, v ::= x | l | λx:σ. e | e u
| Λa. e | eϕ Type polymorphism
| K #»ϕ #»e Data construction
| case eof #  »alt Case analysis
| let vb in v Let binding
| join jb inu Join-point binding
| jump j #»ϕ #»e τ Jump
alt ::= K #   »x:σ → u Case alternative
Value bindings and join-point bindings
vb ::= x:τ = e Non-recursive value
| rec #              »x:τ = e Recursive values
jb ::= j #»a #   »x:σ = e Non-recursive join point
| rec #                          »j #»a #   »x:σ = e Recursive join points
Answers
A ::= λx:σ. e | Λa. e | K #»ϕ #»v
Types
a, b ∈ Type variables
τ, σ, ϕ ::= a | σ → τ | τ ϕ | ∀a. τ
| T Datatype
Frames, evaluation contexts, and stacks
F ::=  v |  τ Application
| caseof #          »p→ u Case scrutinee
| join jb in Join point
E ::=  | F [E] Evaluation contexts
s ::= ε | F : s Stacks
Tail contexts
L ::=  Empty unary context
| case eof #           »p→ L Case branches
| let vb inL Body of let
| join j #»a #   »x:σ = L inL′ Join point, body
| join rec #                           »j #»a #   »x:σ = L inL′ Rec join points, body
Miscellaneous
C ∈ General single-hole term contexts
Σ ::= · | Σ, x:σ = v Heap
c ::= 〈e; s; Σ〉 Configuration
FIGURE 17. Syntax of System FJ .
71
Syntax
System FJ is a simple λ-calculus language in the style of System F, with
let expressions, data type constructors, and case expressions; its syntax is given
in Fig. 17. System FJ is an explicitly-typed language, so all binders are typed,
but in our presentation we will often drop the type annotations.
The join-point extension is highlighted in the figure and consists of two
new syntactic constructs:
– A join binding that declares a join point. Each join point has a name, a
list of type parameters, a list of value parameters, and a body.
– A jump expression that invokes a join point, passing all indicated
arguments as well as an additional type argument (as discussed below).
Although we use curried syntax for jumps, join points are polyadic;
partial application is not allowed.
Static semantics
The type system for System FJ is given in Fig. 18, where typeof gives the
type of a constructor and ctors gives the set of constructors for a datatype.
The typing judgement carries two environments, Γ and ∆, with ∆ binding
join points. The environment ∆ is extended by a join (rules JBind and
RJBind) and consulted at a jump. Note that we rely on scoping conventions
in some places: if Γ; ∆ ` e : τ , then every variable (type or term) free in e or τ
appears in Γ, and the symbols in Γ are unique. Similarly, every label free in e
appears in ∆.
To enforce that jumps are not used as side effects, ∆ is reset in every
premise for a subterm whose runtime context is not statically known. For
example, consider join j x = RHS in f (jump j True Int). Here the context
in which the jump is invoked is not statically known—in a lazy language it
72
Γ; ∆ ` e : τ
(x:τ) ∈ Γ
Γ; ∆ ` x : τ Var
typeof(K) = ∀ #»a . #»σ → T #»a
#                                        »
Γ; ε ` u : σ{a/ϕ}
Γ; ∆ ` K #»ϕ #»u : T #»ϕ Con
Γ, (x:σ); ε ` e : τ
Γ; ∆ ` λ(x:σ). e : σ → τ Abs
Γ, a; ε ` e : τ
Γ; ∆ ` Λa. e : ∀a. τ TAbs
Γ; ∆ ` e : σ → τ Γ; ε ` u : σ
Γ; ∆ ` e u : τ App
Γ; ∆ ` e : ∀a. τ
Γ; ∆ ` e ϕ : τ{a/ϕ} TApp
(j:∀ #»a . #»σ → ∀r. r) ∈ ∆
#                                        »
Γ; ε ` u : σ #           »{a/ϕ}
Γ; ∆ ` jump j #»ϕ #»u τ : τ Jump
Γ; ε ` u : σ Γ, x:σ; ∆ ` e : τ
Γ; ∆ ` letx:σ = u in e : τ VBind
#                                   »
Γ, #   »x:σ; ε ` u : σ Γ, #   »x:σ; ∆ ` e : τ
Γ; ∆ ` let rec #               »x:σ = u in e : τ RVBind
Γ, #»a , #   »x:σ; ∆ ` u : τ Γ; ∆, (j:∀ #»a . #»σ → ∀r. r) ` e : τ
Γ; ∆ ` join j #»a #   »x:σ = u in e : τ JBind
#                                                                                          »
Γ, #»a , #   »x:σ; ∆,
#                                    »
j:∀ #»a . #»σ → ∀r. r ` u : τ Γ; ∆, #                                    »j:∀ #»a . #»σ → ∀r. r ` e : τ
Γ; ∆ ` join rec #                          »j #»a #   »x:σ = u in e : τ RJBind
Γ; ∆ ` e : T #»ϕ #                                                                 »typeof(K) = ∀ #»a . #»σ → T #»a
#                             »
#»ν = #»σ
#           »{a/ϕ}
#                                     »
Γ, #   »x:ν ; ∆ ` u : τ ctors(T ) = { #»K}
Γ; ∆ ` case eof #                       »K #   »x:ν → u : τ Case
FIGURE 18. Type system for System FJ .
73
depends on how f uses its argument—so it cannot be compiled to “adjust the
stack and jump.” So j is not a valid join point. We exclude such terms by
resetting ∆ to ε when typechecking the argument in rule App.
Nevertheless, the typing of join points is a little bit more flexible than you
might suspect. Consider this expression:

join j x = RHS
in case v of A → jump j True C2C
B → jump j False C2C
C → λc. c

’x’
where C2C = Char → Char . This is certainly well typed. A valid
transformation is to move the application to ’x’ into both the body and the
right hand side of the join, thus:
join j x = RHS ’x’
in

case v of A → jump j True C2C
B → jump j False C2C
C → λc. c
 ’x’
Now we can move the application into the branches:
join j x = RHS ’x’
in case v of A → (jump j True C2C) ’x’
B → (jump j False C2C) ’x’
C → (λc. c) ’x’
Should this be well typed? The jumps to j are not exactly tail calls, but they
can (and indeed must) discard their context—here the application to ’x’—and
74
resume execution at j. We will see shortly how this program can be further
transformed to remove the redundant applications to ’x’, but the point here is
that this intermediate program is still well typed, as reflected by the fact that ∆
is not reset in the function part of an application (rule App).
The types given to join points themselves deserve some attention. A join
point that binds type variables #»a and value arguments of types #»σ is given the
type ∀ #»a . #»σ → ∀r. r (rule JBind). The return type indicated, namely ∀r. r, is
often written ⊥, and it indicates a non-returning function: a function which
does not actually return can be safely given any return value. This is similar to
how Haskell’s error function has type ∀a. String → a. We have merely moved
the universal quantification to the end for consistency with the join syntax,
which does not (and must not2) bind this “return-type parameter.”
So a join point’s type does not reflect the value of its body, and a jump
can have any type whatsoever. What then keeps a join point from returning
arbitrary values? It is the JBind rule (or its recursive variant) that checks the
right hand side of the join point, making sure it is the same as that of the entire
join expression. Thus we cannot have
join j = "Gotcha!" in if b then jump j Int else 4
because j returns a String but the body of the join returns an Int . In short,
the burden of typechecking has moved: whereas a function can be declared to
return any type but can only be invoked in certain contexts, a join point can be
invoked in any context but can only return a certain type.
2When we introduce the abort axiom (Sec. ), it will need to change this type argument
arbitrarily, which it can only safely do if the type is never actually used in the other
parameters.
75
〈e; s; Σ〉 7→〈e′; s′; Σ′〉
〈F [e]; s; Σ〉 7→ 〈e; F : s; Σ〉 (push)
〈λx. e;  v : s; Σ〉 7→ 〈e; s; Σ, x = v〉 (β)
〈Λa. e; ϕ : s; Σ〉 7→ 〈e{a/ϕ}; s; Σ〉 (βτ )
〈let vb in e; s; Σ〉 7→ 〈e; s; Σ, vb〉 (bind)
〈x; s; Σ[x = v]〉 7→ 〈v; s; Σ[x = v]〉 (look)〈 K #»ϕ #»v ;
caseof #  »alt : s;
Σ
〉
7→ 〈u; s; Σ, #         »x = v〉
if (K #»x → u) ∈ #  »alt
(case)
〈 jump j #»ϕ #»v τ ;
s′++(join jb in : s);
Σ
〉
7→
〈
u
#           »{a/ϕ};
join jb in : s;
Σ, #         »x = v
〉
(jump)
if (j #»a #»x = u) ∈ jb〈 A;
join jb in : s;
Σ
〉
7→ 〈A; s; Σ〉 (ans)
FIGURE 19. Call-by-name operational semantics for System FJ .
Finally, the reader may wonder why join points are polymorphic
(apart from the result type). In FJ as presented here, we could manage with
monomorphic join points, but they become absolutely necessary when we add
data constructors that bind existential type variables. We omitted existentials
from this paper for simplicity, but they are very important in practice and GHC
certainly supports them.
Operational Semantics
We give System FJ an operational semantics (Fig. 19) in the style of an
abstract machine. A configuration of the machine is a triple 〈e; s; Σ〉 consisting
of an expression e which is the current focus of execution; a stack s representing
the current evaluation context (including join-point bindings); and a heap Σ of
value bindings. The stack is a list of frames, each of which is an argument to
apply, a case analysis to perform, or a bound join point (or recursive group).
76
Each frame is moved to the stack via the push rule. Most of the rules are quite
conventional. We describe only call-by-name evaluation here, as rule look shows;
switching to call-by-need by pushing an update frame is absolutely standard.
Note that only value bindings are put in the heap. Join points are stack-
allocated in a frame: they represent mere code blocks, not first-class function
closures. As expected, a jump throws away its context (the jump rule); it does
so by popping all the frames from the stack to the binding (as usual, ++ stands
for the concatenation of two stacks):
〈
join j x = x
in case (jump j 2 (Int → Bool)) 3of . . .; ε; ε
〉
7→?
〈 jump j 2 (Int → Bool);
 3 : caseof . . . : join j x = x in : ε;
ε
〉
7→〈x; join j x = x in : ε; x = 2〉
Here three frames are pushed onto the stack: the join-point binding, the
case analysis, and finally the application of 3 to the jump. Then the jump is
evaluated, popping the latter two frames, replacing the term with the one from
the join point, and binding the argument.
The ans rule removes a join-point binding from the context once an
answer A (see Fig. 17) is computed; note that a well-typed answer cannot
contain a jump, so at that point the binding must be dead code. Continuing
our example:
〈x; join j x = x in : ε; x = 2〉 7→? 〈2; ε; x = 2〉
Optimizing Transformations
The operational semantics operates on closed configurations. An
optimizing compiler, by contrast, must transform open terms. To describe
77
e = e′
(λx:σ. e) v = letx:σ = v in e (β)
(Λa. e)ϕ = e{a/ϕ} (βτ )
let vb inC[x] = let vb inC[v] (inline)
if (x:σ = v) ∈ vb
let vb in e = e (drop)
join jb inL[ #»e , jump j #»ϕ #»v τ,
#»
e′ ] = join jb inL[ #»e , let #              »x:σ = v inu
#           »{a/ϕ}, #»e′ ] (jinline)
if (j #»a #   »x:σ = u) ∈ jb
join jb in e = e (jdrop)
caseK #»ϕ #»v of
#  »
alt = let #              »x:σ = v in e (case)
if (K #   »x:σ → e) ∈ #  »alt
E[case eof
#                   »
K #»x → u] = case eof #                           »K #»x → E[u] (casefloat)
E[let vb in e] = let vb inE[e] (float)
E[join jb in e] = joinE[jb] inE[e] (jfloat)
E[jump j #»ϕ #»e τ ] : τ ′ = jump j #»ϕ #»e τ ′ (abort)
FIGURE 20. Common optimizations for System FJ .
possible optimizations, then, we separately develop a sound equational theory
(Fig. 20), which lays down the “rules of the game” by which the optimizer is
allowed to work. It is up to the optimizer to determine how to apply the rules
to rewrite code. All the axioms carry implicit scoping restrictions to avoid free-
variable capture. (For example, drop requires that nothing bound by vb occurs
free in e.)
The β, βτ , and case axioms are analogues of the similarly-named rules
in the operational semantics. Since there is no heap, β and case create let
expressions instead. Compile-time substitution, or inlining, is performed
for values by inline and for join points by jinline. If a binding is inlined
exhaustively, it becomes dead code and can be eliminated by the drop or jdrop
axiom. Values may be substituted anywhere3, which we indicate using a general
78
single-hole context C in inline. Inlining of join points is a bit more delicate. A
jump indicates both that we should execute the join point and that we should
throw out the evaluation context up to the join point’s declaration. Simply
copying the body accomplishes the former but not the latter. For example:
join j (x : Int) = x+ 1 in (jump j 2 (Int → Int)) 3
If we na¨ıvely inline j here, we end up with the ill-typed term:
join j (x : Int) = x+ 1 in (2 + 1) 3
Inlining is safe, however, if the jump is a tail call, since then there is no extra
evaluation context to throw away. To specify the allowable places to inline a
join point, then, we use a syntactic notion called a tail context. A tail context
L (see Fig. 17) is a multi-hole context describing the places where a term may
return to its evaluation context. Since  3 is not a tail context, the jinline
axiom fails for the above term.
The casefloat , float , and jfloat axioms perform commuting conversions.
Of the three, jfloat is novel. It does the transformation we wanted to perform
in Sec. to avoid destroying a join point. It relies on a simple meta-syntactic
function E[·] to push E into a join-point binding:
E[j #»a #»x = u] , (j #»a #»x = E[u])
E[rec
#                       »
j #»a #»x = u] , (rec #                               »j #»a #»x = E[u])
3For brevity, we have omitted rules allowing inlining a recursive definition into the
definition itself (or another definition in the same recursive group).
79
Consider again the example at the beginning of Sec. . With our new
syntax, we can write it as:
case

join j x = BIG
in case v of A → jump j 1 Bool
B → jump j 2 Bool
C → True

of
{True → False; False → True}
We can use jfloat to move the outer case into both the right hand side of
the join binding and into its body; use casefloat to move the outer case into
the branches of the inner case; use abort to discard the outer case where it
scrutinizes a jump; and use case to simplify the C alternative. The result is
just what we want:
join j x = caseBIGof {True → False; False → True}
in case v of A → jump j 1 Bool
B → jump j 2 Bool
C → False
The commute Axiom
The left-hand sides of axioms float , jfloat , and casefloat enumerate the
forms of a tail context. That suggests that the three axioms are all instances of
a single more general (yet equivalent) form:
E[L[ #»e ]] = L[
#     »
E[e]] (commute)
To apply commute (forward) is to move the evaluation context into each hole of
the tail context. Since the tail context describes the places where something is
80
e = e′
let f = Λ #»a . λ #»x . u inL[ #»e ] : τ = join j #»a #»x = u inL[
#             »
tailρ(e)] (contify)
if ρ(f #»a #»x ) = jump j #»a #»x τ
and f /∈ fv(L), u : τ
let rec
#                                          »
f = Λ #»a . λ #»x . L[ #»u ] inL′[ #»e ] : τ = join rec
#                                              »
j #»a #»x = L[
#             »
tailρ(u)] inL
′[
#             »
tailρ(e)] (contifyrec)
if
#                                                                   »
ρ(f #»a #»x ) = jump j #»a #»x τ
and
#                                                 »
f /∈ fv( #»L ), f /∈ fv(L′), #               »L[ #»u ] : τ
tailρ(f
#»σ #»u ) , e #          »{a/σ} #          »{x/u} if ρ(f #»a #»x ) = e and dom(ρ) ∩ fv( #»u ) = ∅
tailρ(e) , e if dom(ρ) ∩ fv(e) = ∅
tailρ(e) , undefined otherwise
FIGURE 21. Contification as a source-to-source transformation.
returned to the evaluation context, commute “substitutes” the context into the
places where it is invoked.4
We can also derive new axioms succinctly using tail contexts. For example,
our commuting conversions as written do quite a bit of code duplication by
copying E arbitrarily many times (into each branch of a case and each join
point). Of course, in a real implementation, we would prefer not to do this, so
instead we might use a different axiom:
E[L[ #»e ] : τ ] = join j x = E[x] inL[
#                        »
jump j e τ ]
This can be derived from commute by first applying jdrop and jinline
backward.
3.3 Contification: Inferring Join Points
Not all join points originate from commuting conversions. Though the
source language doesn’t have join points or jumps, many let-bound functions
can be converted to join points without changing the meaning of the program.
In particular, if every call to a given function is a tail call, and we turn the calls
4In fact, from a CPS standpoint, commute is precisely a substitution operation.
81
into jumps, then whenever one of the jumps is executed, there will be nothing
to drop from the evaluation context (the s′ in the jump rule will be empty).
The process is a form of contification (Kennedy, 2007) (or continuation
demotion), which we describe in Fig. 21, where fv(e) means the set of free
variables of e (and similarly fv(L) for tail contexts), and dom(ρ) means the
domain of the environment ρ (to be described shortly).
The non-recursive version, contify , attempts to decompose the body of
the let (i.e.the scope of f) into a tail context L and its arguments, where the
arguments contain all the occurrences of f , then attempts to run the special
partial function tail on each argument to the tail context. This function will
only succeed if there are no non-tail calls to f .
The tail function takes an environment ρ mapping applications of
contifiable variables f to jumps to corresponding join points j. For each
expression that matches the form of a saturated call to such an f , then, tail
turns the call into a jump to its j, provided that none of the arguments to the
function contains a free occurrence of a variable being contified—an occurrence
in argument position is disallowed by the typing rules. For any other expression,
tail changes nothing but does check that no variable being contified appears;
otherwise, tail fails, causing the contify axiom not to match.
There is one last proviso in the contify and contifyrec axioms, which is that
the body of each function to be contified must have the same type as the body
of the let. This can fail to occur if some function f is polymorphic in its return
type (Downen, Maurer, Ariola, & Peyton Jones, 2016).
Finding bindings to which contify or contifyrec will apply is not difficult.
Our implementation is essentially a free-variable analysis that also tracks
whether each free variable has appeared only in the holes of tail contexts. This
is much simpler than previous contification algorithms because we only look
82
for tail calls. We invite the reader to compare to (Fluet & Weeks, 2001) or to
Sec. 5 of (Kennedy, 2007), which both allow for more general calls to be dealt
with. Yet we claim that, in concert with the simplifier and the Float In pass,
our algorithm covers most of the same ground. To demonstrate, a convenient
point of comparison is the local CPS transformation in Moby (Reppy, 2002),
which produces mutually tail-recursive functions to improve code generation in
much the same way GHC does. Note that Moby uses a direct-style intermediate
representation, though its contification pass is expressed in terms of a CPS
transform.
In essence, the final effect of Moby’s local CPS transform is to turn
let f x = ...
in E[... f y ... f z ...]
(where the calls to f are tail calls within E) into
let { j x = E[x]; f x = j <rhs> }
in ...f y...f z...
where the tail calls to f are now compiled as efficient jumps. Note that f now
matches the contify axiom, but it did not before because of the E in the way.
Nonetheless, our extended GHC achieves the same effect as Moby, only in
stages. Starting with:
let f x = rhs inE[. . . f y . . . f z . . .]
First, applying float from right to left floats f inward:
E[let f x = rhs in . . . f y . . . f z . . .]
83
Next, contify applies, since the calls to f are now tail calls:
E[join f x = rhs in . . . jump f y τ . . . jump f z τ . . .]
And now jfloat pushes E into the join point f and the body:
join f x = E[rhs] in . . . E[jump f y τ ] . . . E[jump f z τ ] . . .]
From here, abort removes E from the jumps, and we can abstract E by running
jdrop and jinline backward:
join {j x = E[x]; f x = jump j rhs τ} in . . . f y . . . f z . . .
Thus we achieve the same result without any extra effort5.
Naturally, contification is more routine and convenient in CPS-based
compilers (Fluet & Weeks, 2001; Kennedy, 2007). The ability to handle
an intervening context comes nearly “for free” since contexts already have
names. Notably, it is still possible to name contexts in direct style (the Moby
paper (Reppy, 2002) does so using labelled expressions), so it is only a matter
of convenience, not feasibility.
3.4 Recursive Join Points and Fusion
We have mentioned, without stressing the point, that join points can be
recursive. We have also shown that it is rather easy to identify let-bindings that
can be re-expressed (more efficiently) as join points. To our complete surprise,
we discovered that the combination of these two features allowed us to solve a
long-standing problem with stream fusion.
5The parts of this sequence not specifically to do with join points were already
implemented before in GHC: The Float In pass applies float in reverse, and the Simplifier
regularly creates join points to share evaluation contexts (except that previously they were
ordinary let bindings).
84
Recursive Join Points
Consider this program, which finds the first element of a list that satisfies
a predicate p:
find = Λa. λ(p : a→Bool)(xs : [a]).
let go xs = case xs of
x : xs ′ → if p x then Just x
else go xs ′
[] → Nothing
in go xs0
Programmers quite often write loops like this, with a local definition for go,
perhaps to allow find to be inlined at a call site. Our first observation is this:
go is a (recursive) join point! The contification transformation of will identify go
as a join point, and will transform the let to a join, and each call to go into a
jump. Moreover, the transformed function is much more efficient because there
is no longer a heap-allocated closure for go.
But it gets better! Because go is a join point, it can participate in a
commuting conversion. Suppose, for example, that find is called from any like
this:
any = Λa. λ(p : a→Bool)(xs : [a]).
casefind p xs of Just → True
Nothing → False
85
The call to find can be inlined:
any = Λa. λ(p : a→Bool)(xs : [a]).
case

join go xs = case xs of
x : xs ′ → if p x then Just x
else jump go xs ′ (Maybe a)
[] → Nothing
in jump go xs (Maybe a)

of
{Just → True; Nothing → False}
Now, we have a case scrutinizing a join so we can apply axiom jfloat from
Figure 20. After some easy further transformations, we get
any = Λa. λ(p : a→Bool)(xs : [a]).
join go xs = case xs of
x : xs ′ → if p x thenTrue
else jump go xs ′ Bool
[] → False
in jump go xs Bool
Look carefully at what has happened here: the consumer (any) of a recursive
loop (go) has moved all the way to the return point of the loop, so that we were
able to cancel the case in the consumer with the data constructor returned at
the conclusion of the loop.
86
Stream Fusion
It turns out that this new ability to move a consumer all the way to
the return points of a tail-recursive loop has direct implications for a very
widely used transformation: stream fusion. The key idea of stream fusion is to
represent a list (or array, or other sequence) by a pair of a state and a stepper
function, thus:6
data Stream a where
MkStream :: s -> (s -> Step s a) -> Stream a
There are two competing approaches to the Step type. In unfold/destroy fusion,
first described by Svenningsson (Svenningsson, 2002), we have:
data Step s a = Done | Yield s a
Hence a stepper function takes an incoming state and either yields an element
and a new state or signals the end.
Now a pipeline of list processors can be rewritten as a pipeline of stepper
functions, each of which produces and consumes elements one by one. A typical
stepper function for a stream transformer looks like:
next s = case <incoming step> of
Yield s’ a -> <process element>
Done -> <process end of stream>
When composed together and inlined, the stepper functions become a nest of
cases, each scrutinizing the output of the previous stepper. It is crucial for
performance that each Yield or Done expression be matched to a case, much as
we did with Just and Nothing in the example that began Sec. 3.1. Fortunately,
6Note that Stream is an existential type, so as to abstract the internal state type s as an
implementation detail of the stream.
87
case-of-case and the other commuting conversions that GHC performs are
usually up to the task.
Alas, this approach requires a recursive stepper function when
implementing filter, which must loop over incoming elements until it finds
a match. This breaks up the chain of cases by putting a loop in the way, much
as our any above becomes a case on a loop. Hence until now, recursive stepper
functions have been un-fusible. Coutts et al. (Coutts et al., 2007) suggested
adding a Skip construtor to Step, thus:
data Step s a = Done | Yield s a | Skip s
Now the stepper function can say to update the state and call again, obviating
the need for a loop of its own. This makes filter fusible, but it complicates
everything else! Everything gets three cases instead of two, leading to more
code and more runtime tests; and functions like zip that consume two lists
become more complicated and less efficient.
But with join points, just as with any , Svenningsson’s original Skip-less
approach fuses just fine! Result: simpler code, less of it, and faster to execute.
It’s a straight win.
3.5 Metatheory of FJ
Type Safety
The way to “run” a program on our abstract machine is to initialize the
machine with an empty stack and an empty store. Type safety, then, says
that once we start the machine, the program either runs forever or successfully
returns an answer.
Theorem 2 (Type safety). If ε; ε ` e : τ , then either:
1. The initial configuration 〈e; ε; ε〉 diverges, or
88
2. 〈e; ε; ε〉 7→? 〈A; ε; Σ〉, for some store Σ and answer A.
We did not give a type system for configurations, so the off-the-shelf proof
of progress and preservation is not quite applicable. However, we can adapt
easily enough by annotating each configuration with a well-typed term that
corresponds to it. Write 〈e/s; Σ; e〉 (or 〈e/c〉) for an annotated configuration.
We will need to track the connection between e and c, for which we need a few
tools. Let B be a binding context, that is, series of let bindings surrounding a
hole. Then write JBK for the store containing those same bindings (but with
recursive groups flattened). Also, let JEK translate the evaluation context E to
a stack (which is of course just another syntax for the same structure). Then let
∼ relate terms to configurations such that
B[E[e]] ∼ 〈e; JEK; JBK〉.
Finally, write
〈e/c〉 : τ
when e ∼ c and ε; ε ` e : τ , and write
〈e/c〉 7→〈e′/c′〉
if c 7→c′.
We need a few utilities before we tackle the proof.
Proposition 3. Substitution
1. If Γ, x : σ; ∆ ` e : τ and Γ; ∆ ` v : σ, then Γ,∆ ` e{x/v} : τ .
2. If Γ, a; ∆ ` e : τ , then Γ; ∆ ` e{a/σ} : τ{a/σ}.
89
Lemma 4. If Γ; ∆ ` E[B[e]] : τ and variables bound by B aren’t free in E,
then Γ; ∆ ` B[E[τ ]].
Proof. By induction on E and then B. 
Now we are ready:
Lemma 5 (Progress and preservation). If 〈e/c〉 : τ , then either
1. c ≡ 〈A; s; Σ〉, where A is an answer, or
2. 〈e/c〉 7→〈e′/c′〉 for some e′ and c′ with 〈e′/c′〉 : τ .
Proof. Let c ≡ 〈e0; s; Σ〉; proceed by case analysis on e0.
– For e0 ≡ anyF [e1], the push rule applies, and we can take e′ ≡ e, so e′ : τ
holds by assumption.
– For e0 ≡ let vb in e1, the bind rule applies, and we finish with Lemma 4.
– For e0 ≡ jump j #»ϕ #»v τ , note that there must be a matching join in s
(provable by induction on s). Then that matching join must have given
the matching body u the same type that all of e has (by induction on s
and then Σ). Thus Jump applies and we finish by Prop. 3.
– For e0 ≡ A, examine s:
∗ If s ≡ ε, we are done (case 1).
∗ If s ≡ join jb in s′, then ans rule applies. The reduct typechecks by a
standard strengthening lemma, since no label can appear free in an
answer.
∗ Otherwise, the outermost frame must be of the correct form
according to the type of e0, so one of β, βτ , or case applies. In each
case we finish with either Lemma 4 or Prop. 3.
90
– For e0 ≡ x, we must have that B : Γ, so bind applies; by induction on B,
we can find the proof that u : τ . 
Proof (of Theorem 2). Generalize from the initial configuration to any 〈e/c〉 : τ ,
since clearly e ∼ 〈e; ε; ε〉 and hence 〈e/ε; ε; e〉 : τ . Proceed by coinduction. By
Lemma 5, either c is an answer configuration (proving case 2) or 〈e/c〉 7→〈e′/c′〉
where 〈e′/c′〉 : τ . This may proceed forever, proving case 1, or else eventually
there must be an answer. 
Correctness of the Optimization Rules
To establish the correctness of our rewriting axioms, we first define a
notion of observational equivalence.
Definition 6. Two terms e and e′ are observationally equivalent, written e ∼=
e′, if, given any stack s and store Σ, either
– both 〈e; s; Σ〉 and 〈e′; s; Σ〉 diverge, or
– for some Σ′1, A1, Σ
′
2, and A2, 〈e; s; Σ〉 7→? 〈A1; ε; Σ′1〉 and 〈e′; s; Σ〉 7→?
〈A2; ε; Σ′2〉.
The equational theory is sound with respect to observational equivalence:
Proposition 7. If e = e′, then e ∼= e′.
Equivalence to System F
The best way to be sure that FJ can be implemented without any
headaches is to show that it is equivalent to GHC’s existing System F-based
language. This would suggest that the introduction of join points does not
allow us to write any new programs, only to implement existing programs more
efficiently. To prove the equivalence, we establish an erasure procedure that
removes all join points from an FJ term, leaving an equivalent System F term.
91
To erase the join points, we want to apply the contify axiom (or its
recursive variant) from right to left. However, we cannot necessarily do so
immediately for each join point, since contify only applies when all invocations
are in tail position. For example, we cannot de-contify j here:
join j x = x+ 1 in (jump j 1 (Int → Int)) 2
Simply rewriting the join point as a function and the jump as a function call
would change the meaning of the program—in fact, it would not even be well-
typed:
let f = λx. x+ 1 in f 1 2
However, if we apply abort first:
join j x = x+ 1 in jump j 1 Int
Now the jump is a tail call, so contify applies.
The abort axiom is not enough on its own, since the jump may be buried
inside a tail context:
join j x = x+ 1 in

case bof
True → jump j 1 (Int → Int)
False → jump j 3 (Int → Int)
 2
However, this can be handled by a commuting conversion:
join j x = x+ 1 in case bof
True → (jump j 1 (Int → Int)) 2
False → (jump j 3 (Int → Int)) 2
And now abort applies twice and j can be de-contified.
92
Lemma 8. For any well-typed term e, there is an e′ such that e′ = e and every
jump in e′ is in tail position.
By “tail position,” we mean one of the holes in a tail context that starts with
the binding for the join point being called. In other words, given a term
join j #»a #»x = u inL[ #»e ],
the terms #»e are in tail position for j.
The proof of Lemma 8 relies on the observation that the places in a term
that may contain free occurrences of labels are precisely those appearing in the
hole of either an evaluation or a tail context. For example, the Case typing rule
propagates ∆ into both the scrutinee and the branches; note that caseof #  »alt
is an evaluation context and case eof
#           »
p→  is a tail context. But e  is (in
call-by-name) neither an evaluation context nor a tail context, and App does
not propagate ∆ into the argument.
Thus any expression can be written as:
L[
#                                                                         »
E[L′[
#                                                        »
E′[. . . [L(n)[
#            »
E(n)[e]]] . . .]]]], (3.1)
which is to say a tree of tail contexts alternating with evaluation contexts,
where all free occurrences of join points are at the leaves. By iterating commute
and abort , we can flatten the tree, rewriting (3.1) to say that any expression
can be written L[ #»e ], where each ei is a leaf from the tree in (3.1). Hence no ei
can be expressed as E[L[. . .]] for nontrivial, non-binding7 E and nontrivial L,
and every jump to a free occurrence of a label is some ei. Let us say a term in
the above form is in commuting-normal form8. By commute and abort , every
7A join can be treated as either an evaluation context or a tail context; using commute to
push a join inward is not necessarily helpful, however.
93
term has a commuting-normal form, and by construction, every jump in a
commuting-normal form is a tail call. Thus every label can be decontified, and
we have:
Theorem 9 (Erasure). For any closed, well-typed FJ term e, there is a
System F term e′ such that e′ = e.
3.6 Join Points in Practice
Is is one thing to define a calculus, but quite another to use it in a full-
scale optimising compiler. In this section we report on our experience of doing
so in GHC.
Implementing Join Points in GHC
We have implemented System FJ as an extension to the Core language
in GHC. Rather than adding two new data constructors for join and jump
to the Core data type, we instead re-use ordinary let-bindings and function
applications, distinguishing join points only by a flag on the identifier itself.
Thus, with no code changes, GHC treats join-point identifiers identically
to other identifiers, and join-point bindings identically to ordinary let bindings.
This is extremely convenient in practice. For example, all the code that deals
with dropping dead bindings, inlining a binding that occurs just once, inlining
a binding whose right-hand side is small, and so on, all works automatically for
join points too.
With the modified Core language in hand, we had three tasks. First, GHC
has an internal typechecker, called Core Lint, that (optionally) checks the type-
correctness of the intermediate program after each pass. We augmented Core
Lint for FJ according to the rules of Fig. 18.
8ANF is simply commuting-normal form with named intermediate values.
94
Second, we added a simple new contification analysis to identify let-
bindings that can be converted into join points (see Sec. 3.3). Since the analysis
is simple, we run it frequently, whenever the so-called occurrence analyzer runs.
Finally, the new Core Lint forensically identified several existing Core-to-
Core passes that were “destroying” join points (see Sec. ). Destroying a join
point de-optimizes the program, so it is wonderful now to have a way to nail
such problems at their source. Moreover, once Lint flagged a problem, it was
never difficult to alter the Core-to-Core transformation to make it preserve join
points. Here are some of the specifics about particular passes:
The Simplifier is a sort of partial evaluator responsible for many local
transformations, including commuting conversions and inlining
(Peyton Jones & Santos, 1998). The Simplifier is implemented as a tail-
recursive traversal that builds up a representation of the evaluation
context as it goes; as such, implementing the jfloat and abort axioms
(Sec. ) requires only two new behaviors:
– (jfloat) When traversing a join-point binding, copy the evaluation
context into the right-hand side.
– (abort) When traversing a jump, throw away the evaluation context.
The Float Out pass moves let bindings outwards (Peyton Jones, Partain, &
Santos, 1996). Moving a join binding outwards, however, risks destroying
the join point, so we modified Float Out to leave join bindings alone in
most cases.
The Float In pass moves let bindings inwards. It too can destroy join points
by un-saturating them. For example, given let j x y = ... in j 1 2,
the Float In pass wants to narrow j’s scope as much as possible:
(let j x y = ... in j) 1 2. We modified Float In so that it never
un-saturates a join point.
95
Strictness analysis is as useful for join points as it is for ordinary let bindings,
so it is convenient that join bindings are, by default, treated identically
to ordinary let bindings. In GHC, the results of strictness analysis are
exploited by the so-called worker/wrapper transform (Peyton Jones &
Santos, 1998; Gill & Hutton, 2009). We needed to modify this transform
so that the generated worker and wrapper are both join points. We found
that GHC’s constructed product result (CPR) analysis (Baker-Finch,
Glynn, & Peyton Jones, 2004) caused the wrapper to invoke the worker
inside a case expression, thus preventing the worker from being a join
point. We simply disable CPR analysis for join points; it turns out that
the commuting conversions for join points do a better job anyway.
Benchmarks
The reason for adding join points is to improve performance;
expressiveness is unchanged (Sec. 3.5). So does performance improve? Table 1
presents benchmark data on allocations, collected from the standard spectral,
real and shootout NoFib benchmark suites9. We ran the tests on our
modified GHC branch, and compared them to the GHC baseline to which
our modifications were applied. Remember, the baseline compiler already
recognises join points in the back end and compiles them efficiently (Sec. );
the performance changes here come from preserving and exploiting join points
during optimization.
We report only heap allocations because they are a repeatable proxy for
runtime; the latter is much harder to measure reliably. All tests omitted from
the tables had an improvement in allocations, but less than 0.3%.
9The imaginary suite had no interesting cases. We believe this is because join points tend
to show up only in fairly large functions, and the imaginary tests are all micro-benchmarks.
96
spectral
Program Allocs
fibheaps -1.1%
ida -1.4%
nucleic2 +0.2%
para -4.3%
primetest -3.6%
simple -0.9%
solid -8.4%
sphere -3.3%
transform +1.1%
(45 others)
Min -8.4%
Max +1.1%
Geo. Mean -0.4%
real
Program Allocs
anna +0.5%
cacheprof -0.5%
fem +3.6%
gamteb -1.4%
hpg -2.1%
parser +1.2%
rsa -4.7%
(18 others)
Min -4.7%
Max +3.6%
Geo. Mean -0.2%
shootout
Program Allocs
k-nucleotide -85.9%
n-body -100.0%
spectral-norm -0.8%
(5 others)
Min -100.0%
Max +0.0%
Geo. Mean n/a
TABLE 1. Benchmarks from the spectral, real, and shootout NoFib suites.
97
There are some startling figures: using join points eliminated all
allocations in n-body and 85.9% in k-nucleotide. We caution that these are
highly atypical programs, already hand-crafted to run fast. Still, it seems that
our work may make it easier for performance-hungry authors to squeeze more
performance out of their inner loops.
The complex interaction between inlining and other transformations
makes it impossible to give guaranteed improvements. For example, improving a
function f might make it small enough to inline into g, but this may cause g to
become too large to inline elsewhere, and that in turn may lose the optimization
opportunities previously exposed by inlining g. GHC’s approach is heuristic,
aiming to make losses unlikely, but they do occur, including a 1.1% increase in
allocations in spectral/transform and a 3.6% increase in real/fem.
Beyond Benchmarks
These benchmarks show modest but fairly consistent improvements for
existing, unmodified programs. But we believe that the systematic addition of
join points may have a more significant effect on programming patterns. Our
discussion of fusion in Sec. 3.4 is a case in point: with join points we can use
skip-less unfoldr/destroy streams without sacrificing fusion. That knowledge
in turn affects the way in which libraries are written: they can be smaller and
faster.
Moreover, the transformation pipeline becomes more robust. In GHC
today, if a “join point” is inlined we get good fusion behavior, but if its size
grows to exceed the (arbitrary) inlining threshold, suddenly behavior becomes
much worse. An innocuous change in the source program can lead to a big
change in execution time. That step-change problem disappears when we
formally add join points.
98
CHAPTER IV
Lazy Functions as Processes
Much of the text in this chapter comes from (Downen, Maurer, Ariola, &
Varacca, 2014), which was a collaboration with Paul Downen (UO), Zena M.
Ariola (UO), and Daniele Varacca (UPD). I was the primary designer of the
new language.
In Chapter III, we saw that we could improve the results of GHC’s
optimizer by giving it more opportunities for code motion. However, code
motion is inherently limited by the level of abstraction presented by the
language. By design, Core leaves implicit the operations that make lazy
evaluation efficient, namely the update to each memo-thunk once its value has
been computed. Conceivably, being able to move these updates around might
prove beneficial. For instance, suppose we have x = fst y and y is demanded
before x. As soon as y is updated, x becomes such a quick operation that we
might as well update it right away, thus avoiding an indirect jump the first time
x is accessed. But in order to group together updates in this way, we would first
need to extend Core to make the memo-thunk updates explicit.
While actually implementing this in GHC remains future work, this
chapter demonstrates one possible approach by showing how to take a fragment
of a language where the update is easily expressed and translate it into a more
practical form for implementation in an optimizing compiler.
Continuations and continuation-passing style (CPS) provide powerful and
versatile tools for understanding programming languages (Reynolds, 1993).
By representing the “future of the program” as a first-class entity, a CPS
transform gives a denotational semantics for a programming language in terms
of a simple, well-understood low-level language. In a particularly influential use
99
of continuations, Plotkin (Plotkin, 1975) demonstrated how a CPS transform
can weave an implementation strategy for a program into the syntax of the
program itself. This methodology gave rise to the call-by-value λ-calculus
and was instrumental in closing the gap between the theory and practice of
functional languages.
Since that time, CPS transforms have continued to further our
understanding of programming languages. The call-by-value CPS transform
was more descriptive than Plotkin’s original call-by-value λ-calculus, motivating
a more thorough study of strict functional languages; in turn, this lead to more
advanced techniques for reasoning about programs in continuation-passing style
and to a more complete development of the call-by-value λ-calculus (Sabry &
Felleisen, 1993; Sabry & Wadler, 1997). CPS transforms and related techniques
have also provided a formal method for reasoning about effects, such as mutable
references and non-local jumps, that lie outside of the pure model of the λ-
calculus. Of particular note is delimited control, especially the shift and reset
operators (Danvy & Filinski, 1989), which were originally developed by defining
them in continuation-passing style.
Flexible as CPS transforms are, they inherit some of the limitations of the
λ-calculus. A λ-term describes a sequential and (due to confluence) determinate
computation. Features such as parallelism, distributed computation, and
nondeterminacy have no natural expression in the base λ-calculus. We can
extend the λ-calculus with such features, but we argue that it would be better
to find a different target language altogether, one with these features “out-of-
the-box.”
A good candidate for such a target language is the pi-calculus (Milner,
Parrow, & Walker, 1992; Sangiorgi & Walker, 2003), a process calculus
describing interacting systems running in parallel. Early in the study of the pi-
100
calculus, Milner proved (Milner, 1992) that the pi-calculus was powerful enough
to embed the λ-calculus by a simple interpretation function. His pi-encoding was
formulated from scratch, but later Sangiorgi (Sangiorgi, 1999) discovered that
we could also derive a pi-encoding from a CPS transform, if we can account for
the first-order nature of the pi-calculus—processes don’t transmit processes to
each other the way higher-order functions pass functions as arguments. As an
alternative, Amadio (Amadio, 2011) showed that we could instead translate a
CPS term into a language of first-order functions and bound names, which then
corresponds directly to the pi-calculus.
This analysis has been applied to CPS transforms for both call-by-
name and call-by-value evaluation. In this paper, we extend it to two calculi
modelling real-world implementations: first, a variation on call-by-value that
reflects the way a typical interpreter performs variable lookup; and second, the
call-by-need λ-calculus (Ariola et al., 1995; Ariola & Felleisen, 1997), which
models implementations of call-by-name languages that cache each value
computed.
We start, in Section 4.1, with an introduction of the call-by-name and call-
by-value lambda-calculi. We present their operational semantics and a uniform
CPS transform, from which we derive CBN and CBV pi-encodings, along with
abstract machines. In Section 4.5, we consider call-by-need evaluation; we
present a novel call-by-need CPS transform, which leads to an interesting
concept in its own right: the notion of constructive update. From the new CPS
transform, we derive the call-by-need pi-encoding and an abstract machine in
much the same way as we did for CBN and CBV.
101
4.1 Call-by-Name and Call-by-Value
The λ-calculus, defined by Church (Church, 1932) in the 1930s, is a simple
yet powerful model of computation. It consists of only three parts: functions
from inputs to outputs, variables that stand for the inputs, and applications
that invoke the functions:
Terms: M,N ::= λx.M
∣∣ x ∣∣MN
As the λ-calculus can be a foundation of both strict and lazy languages, a λ-
term M can be evaluated according to different evaluation strategies, which
dictate the operation to be performed first. The two most studied evaluation
strategies are call-by-name (CBN) and call-by-value (CBV). In CBN evaluation,
the argument to a function is kept unevaluated as long as possible, then
evaluated each time its value is required for computation to continue. In CBV
evaluation, the argument is always evaluated before the function receives it.
It is usually better to precompute arguments, as CBV does, since then an
argument will not be evaluated more than once; however, if the function does
not actually use the value, computing it is wasteful. In the extreme case, if the
argument diverges (that is, loops forever) but is never used by the function,
CBV diverges when CBN does not.
The distinct CBN and CBV reduction strategies are captured by the two
reduction rules, βn and βv (see Figs. 22 and 23). A βn-reduction starts from
a term (λx.M)N , then proceeds by substituting the unevaluated argument
N into the body wherever x appears. Thus the function call precedes the
evaluation of the argument. In contrast, a βv-reduction evaluates the argument
first: Only a value may be substituted into the body. A value is either a
variable or a function literal (also called a λ-abstraction).
102
Evaluation Contexts: E ::= 
∣∣ EM
(λx.M)N →M{N/x} βn
FIGURE 22. The call-by-name λ-calculus, λn.
Values: V ::= x
∣∣ λx.M
Evaluation Contexts: E ::= 
∣∣ EM ∣∣ V E
(λx.M)V →M{V/x} βv
FIGURE 23. The call-by-value λ-calculus, λv.
A term that pattern-matches the left-hand side of a reduction
rule is called a redex, short for reducible expression. For instance,
(λx. xx)((λy. y)(λz. z)) is a call-by-name redex. However, it is not a call-by-
value redex, because the argument (λy. y)(λz. z) is not a value. If a redex
somewhere in M is reduced, and the resulting term is N , we write M → N
and say that M reduces to N ; we also write →? for zero or more reductions and
→+ for one or more reductions.
To complete the semantics, one has to specify where a reduction should
take place. Felleisen and Friedman (Felleisen & Friedman, 1986) introduced
a concise way to do so, using evaluation contexts. A context is a “term with
a hole”: It is the outer portion of some term, surrounding a single occurrence
of the symbol . A language’s evaluation contexts delineate the places in a
term where evaluation may take place. For instance, consider the grammar
for evaluation contexts in Fig. 22: An evaluation context can be either just
a hole (the trivial or top-level context, ) or EM , a subcontext applied to
an argument. Thus CBN evaluation can take place either at the top level or
103
within the operator in a function application, but not within the argument
(since arguments are left unevaluated). The CBV calculus also has contexts
V E, so once the function in an application has become a value, CBV evaluation
proceeds within the argument.
Given an evaluation context E, we write E[M ] for E with the term
M in place of the hole, and we say M is plugged into E. (This notation
applies to general contexts as well.) The inverse of the “plugging in”
operation is decomposition, and it plays a critical role in evaluation by
finding where to perform the next reduction. For example, we can decompose
(λx. xx)((λy. y)(λz. z)) in CBN as E[M ] where E is the top-level context  and
M is the entire term. In CBV, we can decompose it with E being (λx. xx)
and M being (λy. y)(λz. z), since the next step of CBV evaluation is to reduce
the argument.
If M ≡ E[M ′] and M ′ is a redex reducing to N ′, then we write M 7→
N where N , E[N ′]. In other words, 7→denotes reduction only within an
evaluation context, which we call standard reduction or evaluation. Often we
will say that M steps or takes a step in this case. As before, we write 7→? for the
reflexive and transitive closure and 7→+ for the transitive closure.
Finally, we introduce notations for the possible observations one can make
about a term. These are the potential outcomes of computation, without regard
to the particular steps taken. If M is a λ-abstraction, which we also call an
answer, we write AM . If M evaluates to an answer (perhaps because it is one),
we write M ⇓. A term M with no possible evaluation step, but which is not
an answer, is called stuck, written sM ; a term M that evaluates to a stuck
term gets stuck, written M 6⇓. If M never finishes evaluating—that is, it takes
infinitely many evaluation steps—it is said to diverge, written M ⇑.
104
Example 10. Consider the term (λx. xx)((λy. y)(λz. z)). CBN and CBV
evaluate the term differently (the redex at each step is shaded):
(λx. xx)((λy. y)(λz. z)) (λx. xx)((λy. y)(λz. z))
7→((λy. y)(λz. z))((λy. y)(λz. z)) 7→(λx. xx)(λz. z)
7→(λz. z)((λy. y)(λz. z)) 7→(λz. z)(λz. z)
7→(λy. y)(λz. z) 7→(λz. z)
7→(λz. z)
CBN reduces the outer β-redex immediately, substituting the argument as
is. This duplicates work, since the βn-redex (λy. y)(λz. z) now appears for each
x in the body of λx. xx. Instead, CBV evaluates the argument first, reducing it
to a value before substituting, thus saving one reduction.
4.2 A Uniform CPS Transform
As an alternative to specifying the semantics of a language in terms
of rewrite rules for programs, one can specify a function that “compiles,”
or transforms, programs into some lower-level form. The advantage is that
analyzing a lower-level form is easier, since the syntax itself prescribes how
a program should be executed, just as assembly code specifies not only
calculations but which registers, including the program counter, to use to
perform them. A transform into continuation-passing style, called a CPS
transform, is an example of such a compilation function. It produces λ-terms
whose evaluation order is predetermined: The same calculations will be
performed, in the same order, by call-by-name or call-by-value evaluation. The
105
CJxK , λk. xk
CJλx.MK , λk. k(λ(x, k′). CJMKk′)
CJMNK , {λk. CJMK(λv. v(λk′. CJNKk′, k)) CBN
λk. CJMK(λv. CJNK(λw. v(λk′. k′w, k))) CBV
FIGURE 24. A uniform CPS transform for call-by-name and call-by-value.
trick is to pass only precomputed values as arguments to functions, making the
question of when to evaluate arguments moot. Then, rather than returning its
result in the usual way, a CPS function passes the result to one of its arguments,
the so-called continuation. A continuation represents the evaluation context in
which a function was invoked; hence it plays a similar role to the call stack used
in most computer architectures. Since their evaluation contexts differ, we can
elucidate the difference between CBN and CBV evaluation by translating each
to continuation-passing style.
We focus on a uniform CPS transform C, given in Fig. 24, so called
because the translations for variables and abstractions are the same between
CBN and CBV. This uniformity highlights the differences in evaluation order by
varying only the translation of applications. Specifically, once M has evaluated
to a function v, the continuation in the CBN transform invokes v immediately,
passing it the unevaluated CPS term λk′. CJNKk′ as x. Evaluating a variable is
done by invoking it with a continuation, so each invocation of x within the body
of v will evaluate the argument. The CBV transform evaluates M the same
way, but its continuation does not use v immediately; instead, it evaluates N to
a function w, and only its continuation invokes v. This time, the x argument
is a function that immediately passes w to the continuation; therefore each
invocation of x within the body of v immediately returns the precomputed
argument value w.
106
Example 11. Consider the term (λx. xx)((λy. y)(λz. z)) from above, calling it
M for now. In CBN (the redex is always the whole CPS term, so we omit the
shading):
CJMKk , CJλx. xxK(λv. v(λk′. CJ(λy. y)(λz. z)Kk′, k))
, (λk. k(λ(x, k′). CJxxKk′))(λv. v(λk′. CJ(λy. y)(λz. z)Kk′, k))
7→(λv. v(λk′. CJ(λy. y)(λz. z)Kk′, k)(λ(x, k′). CJxxKk′)
This last reduction step duplicates work, as the evaluation of CJ(λy. y)(λz. z)K
must now occur twice.
Now for CBV:
CJMKk , CJλx. xxK(λv. CJ(λy. y)(λz. z)K(λw. v(λk. kw, k)))
, (λk. k(λ(x, k′). CJxxKk′))(λv. CJ(λy. y)(λz. z)K(λw. v(λk. kw, k)))
7→(λv. CJ(λy. y)(λz. z)K(λw. v(λk. kw, k)))(λ(x, k′). CJxxKk′)
The function has evaluated to v, but this time we evaluate the argument next:
7→CJ(λy. y)(λz. z)K(λw. (λ(x, k′). CJxxKk′)(λk. kw, k))
Once the argument is computed as w, then the function will be invoked,
but this time with x being a function that immediately passes along the
precomputed value w.
The uniform CPS transform reflects the behavior of common language
implementations: Evaluation always stops at a λ, and a variable always causes
a lookup (hence a free variable halts execution). However, these behaviors don’t
107
faithfully represent the full theory of the λ-calculus. For instance, the calculus is
often considered with the η rule in addition to β. An η-reduction takes λx.Mx
to just M whenever x does not appear in M . For a free variable y, then, the
term λx. yx would reduce to y and then become stuck, whereas the uniform
CPS transform gives a term that immediately returns the value λx. yx rather
than becoming stuck.
The η rule is often considered unimportant for language implementations:
Nearly all compilers and interpreters for functional languages stop evaluating
when they find a λ. In fact, Plotkin’s CBN CPS transform (Plotkin, 1975),
which is very similar to the CBN fragment of our uniform transform, does not
validate η either. The CBV fragment, however, differs more fundamentally:
It doesn’t follow the conventional βv rule, either. In the CBV λ-calculus,
a variable is considered a value, yet few compilers or interpreters operate
this way: The term (λx. λy. y)z should reduce to λy. y, but a typical
implementation would attempt to evaluate z and raise an “unbound variable”
error. Accordingly, the CBV portion of the uniform CPS transform produces a
term that becomes stuck on z rather than reducing to a value.
Therefore the CBV language truly implemented by the uniform CPS
transform is not the one given in Fig. 23. Rather, it implements a calculus that
further restricts the βv rule to apply only to a λ-abstraction as an argument.
Equivalently, this revised calculus consideres only a λ-abstraction to be a value.
From now on, then, when we speak of the call-by-value calculus, we will refer
to the version in Fig. 25. In particular, 7→will refer to the restricted notion of
evaluation context.
In most cases, our departure from orthodoxy will make no difference.
In standard reductions of closed terms, it never happens that a free variable
108
Values: V ::= λx.M
Evaluation Contexts: E ::= 
∣∣ EM ∣∣ V E
(λx.M)V →M{V/x} βv
FIGURE 25. A revised call-by-value λ-calculus.
Terms: M,N ::= V (W+)
Values: V,W ::= x
∣∣ ret ∣∣ λ(x+).M
Evaluation Contexts: E ::= 
(λ(x+).M)(V +)→M{V +/x+} β
FIGURE 26. The syntax and semantics of the CPS λ-calculus, λcps .
appears as an argument, and thus it does not matter whether we consider it a
value or not.
The CPS Language λcps
The terms produced by the uniform CPS transform comprise a restricted
λ-calculus. The grammar is given in Fig. 26. In an application, the function
must be a value, and it can take one or two arguments, which must also
be values; we denote this V (W+) (we will omit parentheses when there is
one argument). A value is a variable, a λ-abstraction, or the constant ret.
Note that the body of an abstraction must again be a CPS term—that is, an
application. In CPS, a function never returns to its caller; it only performs more
function calls. Accordingly, the only evaluation context is the trivial context ,
as the redex is always at the top level.
There are three kinds of value that appear in a CPS term:
109
Thunks A thunk is a suspended computation. In CPS terms, this is a function
λk.M that takes a continuation, calculates a result, then passes the
result to the continuation. Each term of the form CJMK is a thunk. In
the uniform CPS transform, variables are represented by thunks.
Continuations A continuation is a handler for a result; it has the form λv.M .
It takes a computed source value and performs the next step of evaluation.
We can see it as a reification of a term’s evaluation context from the
source language.
Source Values Each value from the source calculus has a CPS encoding. As we
are translating from calculi having only functions as values, we need only
consider how to encode a function. Namely, a source function becomes
a binary function λ(x, k).M that takes a thunk x for computing the
argument and a continuation k to invoke with the result.
Before we consider observations, we should consider what it means for
a CPS program to be evaluated. A term CJMK is an inert λ-abstraction; it
must be given a continuation as its argument for evaluation to occur. This
argument represents the context in which to evaluate M . If we consider M to
be the whole program, we need an initial continuation to represent the top-level
context. Thus we introduce the constant ret; to evaluate M as a CPS program,
then, one writes CJMK ret. If one thinks of a term CJMK as meaning “evaluate
M and then,” CJMK ret then reads “evaluate M and then return.” Thus ret
is analogous to the C function exit, which terminates execution and yields its
argument as the result of the program.
Since an answer is the result of successful computation, then, a CPS
answer is a term of the form retV , for a λ-abstraction V .1 Thus AM means
that M has the form ret(λ(x+). N), and M ⇓ means that M evaluates to such a
110
term. As before, sM means that M is stuck (hence not an answer); M 6⇓ means
that M gets stuck; and M ⇑ means that M diverges.
Environment-Based CPS Transform
So far, we have expressed argument passing by substitution: Each β-
reduction substitutes the arguments for the free occurences of the corresponding
variables. Effectively, the term is rewritten, with a copy of the argument in
place of each occurrence. Interpreters typically operate differently: Each
argument is put into an environment, indexed by the variable it is bound to.
Then, when a variable appears as a function being invoked, its value is retrieved
from the environment.
We can simulate this mechanism by giving a name to each abstraction in
argument position, substituting only names during β-reduction, and copying
the value only as necessary. This is analogous to graph rewriting and can
be captured by extending the syntax with a let construct (Ariola & Klop,
1996): A bound name identifies a node in a graph. However, we prefer an
alternative syntax which expresses the dynamic allocation of names. We write
νx. x := λ(x+).M in N to indicate that a new name x is generated and a λ-
abstraction is bound to it. Note that we will always bind an abstraction to
a name immediately after allocation and only then. The value-named CPS
λ-calculus, λcps,vn , is given in Fig. 27. Each term is now an application inside
some number of bindings, which effectively serve as the environment. Each
argument to an application must be a variable.
Note that we now have nontrivial evaluation contexts, unlike with λcps ,
whose only evaluation context was . However, the contexts in λcps,vn do not
1In principle, we could avoid adding a constant by simply using some free variable k for
the initial continuation. We would then have to have a predicate ⇓k for each name k, and
correctness theorems would be quantified over k. Thus ret is merely a convenience.
111
Terms: M,N ::= V (x+)∣∣ νx. x := λ(x+).M inN
Values: V ::= x
∣∣ ret ∣∣ λ(x+).M
Binding Contexts: B ::= []
∣∣ νx. x := λ(x+).M inB
Eval. Contexts: E ::= B
(λ(x+).M)(y+)→M{y+/x+} β
νf. f := λ(x+).M
in E[f(y+)]
→ νf. f := λ(x
+).M
in E[(λ(x+).M)(y+)]
deref
FIGURE 27. The value-named CPS λ-calculus, λcps,vn .
specify work to be done but simply bindings for variables. To emphasize this,
we call a context providing only bindings a binding context, and say that a CPS
calculus has only binding contexts as evaluation contexts.
To convert an unnamed term to a named term, we introduce a naming
transform, N . The naming transform goes through all arguments appearing in a
term, moving each λ-abstraction into a new variable.
N JV (λ(x+).M)K , νy. y := λ(x+).N JMK inN JV (y)K
N JV (λ(x+).M,W )K , νy. y := λ(x+).N JMK inN JV (y,W )K
N JV (y, λ(x+).M)K , νz. z := λ(x+).N JMK inN JV (y, z)K
N J(λ(x+).M)(y+)K , (λ(x+).N JMK)(y+)
N Jf(y+)K , f(y+)
For clarity, here we assume that each function has at most two arguments, as is
true for our CPS terms; N generalizes straightforwardly by iteration.
The uniform CPS transform under the naming transform is given in
Fig. 28.
112
CvnJxK , λk. xk
CvnJλx.MK , λk. νf. f := λ(x, k′). CvnJMKk′ in kf
CvnJMNK ,

λk. νk′. k′ :=
(
λv.
νx. x := λk′′. CvnJNKk′′
in v(x, k)
)
in CvnJMKk′ CBN
λk. νk′. k′ :=
(
λv. νk′′. k′′ := (λw.
νx. x := λk′. k′w in v(x, k))
in CvnJNKk′′) in CvnJMKk′ CBV
FIGURE 28. Uniform CPS transform in named form.
Proposition 12. CvnJMK ≡ N JCJMKK.
Proof. Straightforward induction on M . 
4.3 Preservation of Observations
We show correctness of the value-named uniform CPS transform in two
steps. We start with the correctness of the unnamed transform, then prove the
correctness of the naming step.
Proof Methodology
For a CPS transform to be considered correct, we would want it to
preserve termination (Meyer & Cosmadakis, 1988):
Criterion 13. M ⇓ iff CJMK ret ⇓.
In order to prove Criterion 13, we want to proceed by induction on the
evaluation steps. However, in order for the induction to go through, we need to
establish an invariant : Something that is true at the beginning of evaluation
and remains true after each step. For C, the simplest invariant one can imagine
113
would be this:
M N
CJMKK CJNKK (4.1)
In words, for any continuation K, whenever M reduces to N , CJMKK
reduces to CJNKK.2 However, this invariant does not hold. One reason is that
the CPS transform introduces many administrative redexes into the term. These
are intermediate computations that do not correspond to actual β-reductions in
the source language. (Non-administrative redexes are called proper.) Hence one
step for M may correspond to many in CJMKK. Thus consider:
M N
CJMKK CJNKK (4.2)
Unfortunately, there is a more serious issue with (4.2). As noted by
Plotkin (Plotkin, 1975), administrative reductions do not line up with the CPS
transform in this way. Because CJNKK introduces administrative redexes of its
own, the true situation is this:
M N
CJMKK P CJNKK (4.3)
Plotkin’s solution was to derive a new transform that eliminated these
initial administrative redexes, thus regaining (4.2). The problem with this and
similar solutions (Danvy & Filinski, 1992; Danvy & Nielsen, 2003) is that the
resulting transforms are more complex and difficult to reason about than the
2Ultimately, of course, we observe what happens when K is ret. But there is nothing
special about ret; we expect our diagrams to hold for any K.
114
original CPS transform. For instance, usually such administration-free CPS
transforms are non-compositional (Danvy & Nielsen, 2003).
Instead of changing the transform, we can further loosen the invariant
using the bisimulation technique. Bisimulation is an alternative approach to
soundness and completeness that requires only that we find some suitable
relation to act as the invariant. Given a relation ∼, we can use it to prove
Criterion 13 so long as the following hold:3
M
CJMK ret∼
M N
P Q
∼ ∼
M N
P Q
∼ ∼
M ↓
P AQ
∼
M AN
Q ↓
∼
(4.4)
So, M is related to its image under the transform; when either related term
takes a step, the other can take some number of steps to remain in the relation;
and if either is an answer, the other evaluates to an answer.
Once we have that evaluation to an answer is preserved, what can we
say about a stuck term? It could happen that M is stuck but CJMK ret loops
forever, or vice versa. Thus we consider an additional criterion:
Criterion 14. M 6⇓ iff CJMK ret 6⇓.
3Technically, it is the second and third diagrams that characterize a bisimulation. The
others are additional properties that we need in order to finish the proof. The fourth and fifth
are very similar to the requirements on a barbed bisimulation (Milner & Sangiorgi, 1992).
115
To prove Criterion 14, we require two more properties of the simulation ∼,
in addition to those in (4.4):
M 6 ↓ M sN
P sQ P 6 ↓
∼ ∼ (4.5)
In words, if either M or P is stuck, then the other must get stuck.
The final observation that we want to preserve is divergence:
Criterion 15. M ⇑ iff CJMK ret ⇑.
However, because evaluation is deterministic in our calculi, we can get
Criterion 15 “for free” from Criteria 13 and 14: If one term diverges, it can
neither reduce to an answer nor get stuck, and hence the other term can only
diverge.
We can further simplify the proof methodology thanks to Leroy’s
observation (Leroy, 2009) that if the source language is deterministic, the
forward simulation is sufficient, so long as each source evaluation step maps
to at least one CPS step.4 In short, it will suffice to show:
M
CJMK ret∼
M N
P Q
∼ ∼
+
M ↓
P AQ
∼
M 6 ↓
P sQ
∼
(4.6)
From these properties and determinacy, we can prove both directions of
Criteria 13 to 15. The forward direction follows directly by induction. For the
4Otherwise, a diverging source term could translate to an answer.
116
backward direction of Criterion 13, we can argue by contraposition: If M does
not reduce to an answer, then it must either diverge or get stuck. If it diverges,
then by the second diagram in (4.6), it must hold that CJMK ret diverges, and
hence by determinacy it cannot reduce to an answer. Similarly, if M gets stuck,
CJMK ret must get stuck, and hence cannot reduce to an answer. The reasoning
for the backward directions of Criteria 14 and 15 is similar.
Correctness of the CPS Transform
We define the simulation ∼ by comparing terms in a way that ignores all
administrative reductions. We consider a λ-abstraction administrative when
it always forms an administrative redex. We mark these administrative λ-
abstractions by placing a line over the λ, as in λ¯k.M . The explicitly marked
uniform CPS transform is then:
CJxK , λ¯k. xk
CJλx.MK , λ¯k. k(λ(x, k′). CJMKk′)
CJMNK ,

λ¯k. CJMK(λ¯v. v(λ¯k′. CJNKk′, k)) CBN
λ¯k. CJMK(λ¯v. CJNK(λ¯w. v(λ¯k′. k′w, k))) CBV
Notice that the only proper λ-abstractions are the ones that correspond to a
λ-abstraction from the original term, since these are the abstractions whose
reductions correspond to the actual β-reductions in the source language. To
distinguish administrative computation, we introduce the reduction relation
→ad , defined by the administrative β-rule:
(λ¯(x+).M)(V +)→ad M{V +/x+} βad
117
Keeping to our notational conventions, the reflexive and transitive
closure of →ad is →?ad and its transitive closure is →+ad . Also, its reflexive,
symmetric, and transitive closure is =ad ; in other words, =ad extends →?ad by
allowing rules to be applied in reverse (as expansions rather than reductions).
Furthermore, 7→ad stands for a standard administrative reduction, which is to
say an administrative reduction in the empty context (at top level), and 7→?ad
are 7→+ad are the usual closures. We will also use the subscript pr in place of ad
to denote a proper reduction. Finally, 7→+pr1 is short for 7→?ad 7→pr 7→?ad , which is
to say, some number of standard reductions, exactly one of which is proper.
If a term cannot take an administrative standard reduction, then for the
moment, the administrative work in that term is finished. Hence, if we consider
the administrative subcalculus of λcps , such a term is the result, or answer,
of administrative computation. Therefore let a term with no administrative
standard reduction be called an administrative answer. In the following, we
rely on some known properties of the λ-calculus, which also apply to the
administrative subset of the CPS λ-calculus.
Proposition 16. Administrative reduction in λcps :
1. is confluent, so that if M =ad M
′, then there is some N such that M →?ad
N and M ′ →?ad N ; and
2. has the standardization property, so that if M →?ad N and N is an
administrative answer, then there is an administrative answer M ′ such
that M 7→?ad M ′ →?ad N .
Next we need to know how non-standard, or internal, administrative
reductions interact with proper standard reductions. In short, they don’t—
administrative reductions commute with proper standard reductions (see
Fig. 29).
118
M N ′
M ′ N
pr
ad ad
pr
(a) Proposition 17.1
M N ′
M ′ N
pr
ad
pr
ad
(b) Proposition 17.2
M N
M ′ N ′
pr
ad ad
pr +
(c) Lemma 18
FIGURE 29. Diagrams of Proposition 17 and Lemma 18.
Proposition 17.
1. If M →?ad M ′ 7→pr N and M is an administrative answer, then there is N ′
with M 7→pr N ′ →?ad N .
2. If M ←?ad M ′ 7→pr N , then there is N ′ with M 7→pr N ′ ←?ad N .
Proof.
1. If M is an administrative answer, then it is a proper β-redex; let M ,
(λ(x+). P )(V +). Any administrative reductions in M must take place
either in P or in V +; in general, they could take P to some P ′ and V + to
some V ′+. Hence M ′ ≡ (λ(x+). P ′)(V ′+). Since M ′ 7→pr N , this means
N ≡ P ′{V ′+/x+}, and we take N ′ , P{V +/x+}.
2. Similar. 
As a consequence, we have that =ad , which can involve arbitrary
administrative reductions in either direction, commutes with proper standard
reduction.
Lemma 18. If M =ad M
′ 7→pr N , then there is N ′ such that M 7→+pr1 N ′ =ad
N .
This is a crucial lemma; we will prove it later.
119
Now that we know what administrative reductions don’t do, we should
see what they can do: They serve to bring the standard redex in the source
term to the top of the CPS term. This entails reifying the evaluation context
as a continuation, so that we begin with CJE[M ]K ret and build toward CJMKK,
where K is a continuation that “represents” E somehow. We can formalize this
intuition:
Proposition 19. For each evaluation context E in λn or λv and each
continuation K, there is a continuation K ′ such that for every term M we
have CJE[M ]KK 7→?ad CJMKK ′.
Proof. By induction on the structure of E in each calculus. For call-by-name,
we have two cases:
– If E ≡ , take K ′ , K.
– For E ≡ E ′N , we have:
CJE ′[M ]NKK 7→ad CJE ′[M ]K(λ¯v. v(CJNK, K))
7→?ad CJMKK ′ (by I.H.)
For call-by-value, we have three cases:
– If E ≡ , take K ′ , K.
– For E ≡ E ′N , we have:
CJE ′[M ]NKK 7→ad CJE ′[M ]K(λ¯v. CJNK(λ¯w. v((λ¯k′. k′w), K)))
7→?ad CJMKK ′ (by I.H.)
120
– Finally, suppose E ≡ V E ′. In our modified CBV calculus, V must be a
λ-abstraction and not a variable, so we have:
CJ(λx.N)(E ′[M ])KK
7→ad (λ¯k. k(λ(x, k′). CJNKk′))(λ¯v. CJE ′[M ]K(λ¯w. v((λ¯k′′. k′′w), K)))
7→?ad CJE ′[M ]K(λ¯w. (λ¯k. k(λ(x, k′). CJNKk′))((λ¯k′′. k′′w), K))
Note that if V could be a variable, then the CPS transformation would get
stuck after the first step, and we would not be able to bring E ′[M ] to the
top of the transformed term. 
We now want to show that observations in the λn and λv calculi line up
with observations of the CPS-transformed terms. In other words, we prove that
C meets Criteria 13 to 15.
We define our forward simulation ∼ as follows:
Definition 20. For a λ-term M (either CBN or CBV) and CPS term P , let
M ∼ P when CJMK ret =ad P .
Our task is to prove that ∼ satisfies the diagrams in (4.6), making it a
forward simulation. To begin, we first prove that “answerness” and “stuckness”
are preserved by administrative operations:
Proposition 21. If P =ad P
′ and AP , then P ′ ⇓.
Proof. By confluence, there must be a term Q such that P →?ad Q and P ′ →?ad
Q. Since P is an answer it must have the form retV , therefore the reductions
in →?ad must have been within V , so Q must have the form retV ′. Finally, by
standardization, since P ′ →?ad retV ′, there must be R with P ′ 7→?ad R →?ad
retV ′; since non-standard reductions cannot disturb the top redex, we must
have R ≡ retV ′′, so P ′ ⇓. 
121
Proposition 22. If P =ad P
′ and sP , then P ′ 6⇓.
Proof. Similar to Proposition 21, again invoking confluence and
standardization. 
We are now ready to prove the third and forth commuting diagrams of
Eq. (4.6), showing that ∼ relates answers to answers and stuck terms to stuck
terms (up to some remaining steps in the CPS term):
Now we can prove the third and fourth commuting diagrams of (4.6):
Lemma 23. If M ∼ P and AM , then P ⇓.
Proof. Let M , λx.N for some N . Since (λx.N) ∼ P , we know that
P =ad CJλx.NK ret =ad ret(λ(x, k′). CJKNk′), so the result is immediate by
Proposition 21. 
Lemma 24. If M ∼ P and sM , then P 6⇓.
Proof. A stuck λ-term, in either CBN or CBV, is one of the form E[x] for some
free x. By Proposition 19, CJE[x]K ret 7→?ad CJxKK 7→ad sxK. So CJE[x]K ret 6⇓,
and hence by Proposition 22, P 6⇓. 
We also have the first commuting diagram, showing that ∼ relates a
source term to its translation:
Lemma 25. M ∼ CJMK ret.
Proof. Unfolding definitions, we need that CJMK ret =ad CJKM ret, which is
immediate since =ad is reflexive. 
To show correctness of C it remains to satisfy the second commuting
diagram, showing that our invariant ∼ is preserved under reduction. We prove
this in three steps:
122
1. Show that, if M 7→N by a β-reduction in the empty context, then we have
CJMKK 7→?=ad CJNKK (Proposition 27).
2. Allow the reduction M 7→N to occur in any evaluation context, not only
at the top of the term (Proposition 28).
3. Let the CPS term be any P =ad CJMK ret (Lemma 29).
We can get the first step using a simple proposition concerning
substitution:
Proposition 26.
1. CJMK{CJNK/x} →?ad CJM{N/x}K
2. CJMK{(λ¯k. CJNKk)/x} →?ad CJM{N/x}K
Proof.
1. By induction on the structure of M . The most interesting case is when
M ≡ x, and thus we actually substitute N for x:
CJxK{CJNK/x} , λ¯k. CJNKk
By inspection of C, CJNK must be some administrative λ-abstraction of
the form (λ¯k′. P ):
, λ¯k. (λ¯k′. P )k
→ad λ¯k. (P{k/k′})
≡ λ¯k′. P
≡ CJNK
123
2. Similar, only with one extra reduction at the beginning:
CJxK{λ¯k. CJNKk/x} , (λ¯k. xk){(λ¯k. CJNKk)/x}
, (λ¯k. (λ¯k. CJNKk)k)
→ad λ¯k. CJNKk
→ad CJNK (as before) 
And now:
Proposition 27. If M 7→ N by a reduction in the empty context, then
CJMKK 7→+pr1=ad CJNKK.
Proof. Since the reduction takes place at the top level, we must have that M is
a redex. From here, we must consider CBN and CBV separately:
– In CBN, M must have the form (λx.M ′)N ′ with N ≡M ′{N ′/x}, and:
CJMKK ≡ CJ(λx.M ′)N ′KK
7→?ad (λ(x, k′). CJMKk′)(λk. CJNKk,K)
7→pr (CJMKk){λk. CJNKk/x}
→?ad CJM ′{N ′/x}KK (by Proposition 26)
≡ CJNKK
124
– In CBV, M must have the more specific form (λx.M ′)V with N ≡
M ′{V/x}. Let V , λy.N ′.
CJMKK ≡ CJ(λx.M ′)(λy.N ′)KK
7→?ad (λ(x, k′). CJM ′Kk′)((λ¯h. h(λ(y, h′). CJN ′Kh′)), K)
, (λ(x, k′). CJM ′Kk′)(CJλy.N ′K, K)
, (λ(x, k′). CJM ′Kk′)(CJV K, K)
7→pr CJM ′K{CJV K/x}K
→?ad CJM ′{V/x}KK (by Proposition 26)
≡ CJNKK 
The second step is to show that a reduction in an evaluation context is
performed faithfully by the CPS-transformed term.
Proposition 28. If M 7→N , then CJMKK 7→+pr1=ad CJNKK.
Proof. By definition of 7→, we have that M ≡ E[M ′] and N ≡ E[N ′], where
M ′ 7→N ′ at top level.
CJMK ret ≡ CJE[M ′]KK
7→?ad CJM ′KK ′ (by Proposition 19)
7→+pr1=ad CJN ′KK ′ (by Proposition 27)
←[?ad CJE[N ′]K ret (property of K ′ from Proposition 19)
≡ CJNKK 
The third step of the proof is the most difficult: We must generalize
the hypothesis of Proposition 28 so that it may be chained through multiple
reduction steps. This hinges on the commutation lemma, which we now prove:
125
M M ′ Nad
pr
(a) Hypothesis
M ′ N
M M ′′
pr
ad
ad
ad
(b) Confluence
M ′ N
M M ′′
M ′′′
pr
ad
ad
ad
ad
ad
(c) Standardization
M ′ N
M M ′′ ·
M ′′′ N ′
pr
ad ad
ad
ad
ad
pr
ad
pr
ad
(d) Commutation
FIGURE 30. A summary of the proof of Lemma 18.
Proof (of Lemma 18). By Proposition 16.1, we know there must be some
M ′′ such that M →?ad M ′′ ←?ad M ′. By Proposition 17, M ′′ can take
a proper standard reduction, so Proposition 16.2 applies, giving M ′′′ with
M 7→?ad M ′′′ →?ad M ′′. From there, we use Proposition 17 to “fill in the arrows”
(see Fig. 30). 
From there, we have that reduction preserves the invariant ∼:
Lemma 29. If M 7→N and M ∼ P , then there is Q such that P 7→+ Q and
N ∼ Q.
Proof. Since M ∼ P , we have P =ad CJKM ret. By Proposition 28, then,
CJMK ret 7→+=ad CJKN ret, so P =ad 7→+pr1=ad CJKN ret. Since 7→+pr1 is short for
7→?ad 7→pr 7→?ad , we have P =ad 7→pr=ad CJKN ret. By Lemma 18, we have Q such
that P 7→+Q =ad CJKN ret, so P 7→+Q and N ∼ Q. 
With all the pieces of (4.6) in hand, we can show that the uniform CPS
transform preserves observations forward:
Theorem 30. For any λ-term M and variable k:
126
1. If M ⇓ then CJMK ret ⇓.
2. If M ⇑ then CJMK ret ⇑.
3. If M 6⇓ then CJMK ret 6⇓.
Proof. By Lemma 25, we can generalize CJMK ret to any P with M ∼ P . From
there, 1 and 3 follow by induction on the reduction sequence and Lemmas 23,
24, and 29. 2 is immediate from Lemma 29, since P must take at least as many
steps as M . 
Corollary 31. For any λ-term M and variable k:
1. M ⇓ iff CJMK ret ⇓.
2. M ⇑ iff CJMK ret ⇑.
3. M 6⇓ iff CJMK ret 6⇓.
Proof. The forward directions are Theorem 30. For the backward direction of 1,
assume M does not reduce to an answer. Then it must either diverge or become
stuck. By Theorem 30, in either case, CJMK ret must do the same, and thus
(by determinism) it cannot reduce to an answer; by contraposition, CJMK ret ⇓
implies M ⇓. The other clauses are similar. 
Correctness of the Naming Transform
Passing names instead of values has a subtle effect on the execution of a
program: The CPS terms now express sharing. Since values aren’t copied but
are shared among subterms, relating reductions of unnamed terms to those of
named terms requires care. For instance, consider this CPS term:
M , (λx. f(x, x))(λx.N)
127
It β-reduces and duplicates (λx.N):
M 7→f(λx.N, λx.N)
Now consider M under the naming transform:
N JMK , νy. y := λx.N JNK in (λx. f(x, x))y
It only duplicates the name y:
N JMK 7→νy. y := λx.N JNK in f(y, y)
Notice, however, that if we reduce M and then translate, we get
something different:
N Jf(λz.N, λz.N)K , νx. x := λz.N JNK in νy. y := λz.N JNK in f(x, y)
Now there is no sharing of the value λz.N .
In short, reduction does not commute with naming: Reducing the named
term can produce shared references that do not appear when naming the
reduced term. However, differences in sharing do not affect the outcome of
the computation. Therefore we seek a way to reason up to sharing—that is, we
want to consider a term with the same computational content, but more sharing,
as “close enough.” A straightforward way to remove sharing from the picture
is to consider terms under a readback function that “flattens” a term’s bound
128
variables, returning it to the unnamed form:
N−1〈〈f(x+)〉〉 , f(x+)
N−1〈〈νx. x := λ(x+). N inM〉〉 , N−1〈〈M〉〉{λ(x+).N−1〈〈N〉〉/x}
We can now define our invariant as follows:
Definition 32. For an unnamed CPS term M and a named CPS term P , let
M ∼ P when M ≡ N−1〈〈P 〉〉.
To show correctness of the naming step we will show that ∼ is a backward
simulation: If M ∼ P and P 7→P ′ then there is M ′ with M 7→?M ′ and M ′ ∼ P ′.
Note that, because deref reductions disappear under N−1, the correspondence
between reductions is not one-to-at-least-one, as it was in (4.6). We relied on
this property in proving Theorem 30.2, so we will have to adjust our reasoning
this time.
First, we prove that reductions are preserved:
Proposition 33. Given a λcps,vn term P :
1. If P 7→β P ′, then N−1〈〈P 〉〉 7→N−1〈〈P ′〉〉.
2. If P 7→deref P ′, then N−1〈〈P 〉〉 ≡ N−1〈〈P ′〉〉.
Proof. Given any context E in the named CPS language, let σE be the
substitution built up by the unnaming function as it traverses E. (In other
words, σE is the substitution such that N−1〈〈E[P ]〉〉 ≡ N−1〈〈P 〉〉σE.)
129
1. If P 7→P ′ by the β rule, we have that:
P ≡ E[(λ(x+). Q)(y+)]
P ′ ≡ E[Q{y+/x+}]
N−1〈〈P 〉〉 , (λ(x+).N−1〈〈Q〉〉)(y+)σE
7→N−1〈〈Q〉〉{y+/x+}σE
≡ N−1〈〈E[Q{y+/x+}]〉〉
≡ N−1〈〈P ′〉〉
2. If P 7→P ′ by the deref rule, we have that:
P ≡ E[νf. f := λ(x+). Q in E ′[f(y+)]
P ′ ≡ E[νf. f := λ(x+). Q in E ′[(λ(x+). Q(y+)]]
N−1〈〈P 〉〉 , f(y+)σE′{λ(x+). Q/f}σE
≡ (λ(x+). Q)(y+)σE′{λ(x+). Q/f}σE
≡ N−1〈〈P ′〉〉 
Now we can show that N−1 preserves observable behavior:
Theorem 34.
1. If N JMK ⇓ then M ⇓.
2. If N JMK ⇑ then M ⇑.
3. If N JMK 6⇓ then M 6⇓.
Proof. The initial condition M ∼ N JMK follows from the fact that N and N−1
form a retraction pair: for any unnamed CPS term M , N−1〈〈N JMK〉〉 ≡M . The
130
invariant holds by Proposition 33. Since answers and stuck terms are virtually
the same between unnamed and named terms, the final conditions are trivial.
It only remains to show that divergence is preserved. This is easy, however, as
each deref -reduction always produces a β-redex, so M and N JMK must take the
same number of β-reductions. 
As we did for the uniform CPS transform (only in reverse), we get the
forward directions from the backward ones, completing the correctness proof.
Corollary 35.
1. M ⇓ iff N JMK ⇓.
2. M ⇑ iff N JMK ⇑.
3. M 6⇓ iff N JMK 6⇓.
Proof. The backward directions are Theorem 34. The forward directions use
case analysis and determinism in the same way as Corollary 31. 
4.4 CPS and Processes
The pi-calculus describes computation as the exchange of simple messages
by independent agents, called processes. Each term in the pi-calculus describes
a process, and processes are built by composing them together in parallel,
prefixing them with I/O actions, and replicating them. Communication takes
place over channels, each of which has a name; processes interact when one
is writing to a channel and, in parallel, another is reading from it. The values
sent over the channels are themselves channel names, so processes can discover
each other dynamically. Names act much like variables in the λ-calculus, with
α-equivalent terms identified in the same way. They can be allocated by the ν
construct, which guarantees that the name it binds will be distinct from any
other allocated or free name.
131
Processes: P,Q ::= x〈y+〉 ∣∣ x(y+). P ∣∣ (P |Q) ∣∣ !P ∣∣ νz P
x〈y+〉 | x(z+). P →P{y+/z+}
P ≡ P ′ P ′ →Q′ Q′ ≡ Q
P →Q
P →P ′
P |Q→P ′ |Q
P →P ′
νz P →νz P ′
P |Q ≡ Q | P !P ≡ !P | P
(P |Q) |R ≡ P | (Q |R) (νz P ) |Q ≡ νz (P |Q)
νx νy P ≡ νy νx P if z not free in Q
FIGURE 31. A fragment of the pi-calculus.
The syntax and semantics for the fragment of the pi-calculus we are
considering are given in Fig. 31. This fragment is called the asynchronous pi-
calculus because there are no processes of the form x〈y〉. P . In other words, no
process is ever blocked waiting for a write operation to complete. This property
reflects the behavior of CPS terms: They never wait for a subterm to compute,
instead providing a continuation that performs the remaining work.
Processes in the pi-calculus are meant to be considered up to a relation
called structural congruence, which we write as ≡.5 The rules (other than
those making ≡ an equivalence relation and a congruence) are given in
Fig. 31. Reductions are closed up to structural congruence (as well as parallel
composition and name allocation). Besides eliminating unimportant differences
such as the order of parallel composition, structural congruence accounts for the
spawning of replicated processes and the scoping of allocated names.
5In the pi-calculus literature, structural congruence is usually written as ≡.
132
pi from CPS
Despite the radically different approaches to expressing computation,
CPS transforms and pi-encodings are interrelated. In particular, given a CPS
transform, one can systematically derive a pi-encoding from it (Sangiorgi, 1999;
Sangiorgi & Walker, 2003). The major difficulty in deriving a pi-encoding from a
CPS transform arises from an important difference: Functions in the λ-calculus
can take functions as arguments, but processes in the pi-calculus do not send
processes over channels, only names. In other words, λ is higher-order, but pi is
first-order.
But we have already addressed this mismatch: The value-named CPS
language λcps,vn is “first-order” in much the same way. In fact, nearly every
construct in the named CPS calculus λcps,vn corresponds directly to a construct
in the pi-calculus:
– An application x(y+) becomes a process x〈y+〉, which performs a write on
channel x, then halts. The tuple (y+) is transmitted over x.
– Each binding νx. x := λ(y+). N inM becomes a process of the form νx (P |
!x(y+). Q). This process allocates a fresh channel name x, then runs a
process P in parallel with the process !x(y+). Q. The latter acts as a
“server”: It listens on the channel x for a request, then runs the process
Q with the request’s values as arguments. The ! makes the server process
replicated, so that it handles any number of requests over time.
The only terms without couterparts are applications with λ-abstractions in head
position—that is, β-redexes. But we can handle these by reducing them during
translation.
133
Thus we can faithfully translate λcps,vn to the pi-calculus:
PJV (y+)K , PJV Ky+
PJνx. x := λ(y+). N inMK , νx (PJMK | !x(y+).PJNK)
PJfKy+ , f〈y+〉
PJretKy+ , ret〈y+〉
PJλ(x+).MKy+ , PJMK{y+/x+}
The subscripted form of P translates a term, given the arguments it is being
applied to, performing β-reduction as needed. To translate ret, we simply
assume some fresh pi-calculus channel name ret.
Finally, we obtain the pi-calculus encoding (Fig. 32) by running the
uniform CPS transform C through the naming transform N and then through
the pi-calculus translation P. The final product coincides with the established
uniform pi-encoding (Sangiorgi & Walker, 2003).6
Correctness
For a pi-calculus term P , if P is capable of performing a write on the free
channel name k (possibly after some reductions), we write P ⇓k. A named
CPS term signals termination by invoking initial continuation ret, which is
translated to a write on ret. Hence we expect the pi-encoded term to write on
ret if and only if the named CPS term would invoke ret:
Lemma 36. For any λcps,vn term M , M ⇓ iff PJMK ⇓ret.
6In fact, the final transform in Fig. 32 differs slightly from the uniform pi-encoding in
the literature, in that all input processes are replicated, even those used at most once
(e.g.continuation processes). However, this is harmless, as garbage collection is sound in
the pi-calculus (up to bisimulation). The call-by-need pi-encoding will keep some processes
unreplicated, as this is necessary for correctness.
134
CJxK , λk. xk
CJλx.MK , λk. k(λ(x, k′). CJMKk′)
CJMNK , {λk. CJMK(λv. v(λk′. CJNKk′, k)) CBN
λk. CJMK(λv. CJNK(λw. v(λk′. k′w, k))) CBV
CvnJxK , λk. xk
CvnJλx.MK , λk. νf. f := λ(x, k′). CvnJMKk′ in kf
CvnJMNK ,

λk. νk′. k′ := (λv. νx. x := λk′. CvnJNKk′ in
v(x, k)) in CvnJMKk′ CBN
λk. νk′. k′ :=
λv. νk′′. k′′ := (λw.νx. x := λk′. k′w in
v(x, k)) in CvnJNKk′′
 in
CvnJMKk′
CBV
EJxKk , x〈k〉
EJλx.MKk , νf (k〈f〉 | !f(x, k). EJMKk)
EJMNKk ,

νk′ (EJMKk′ |
!k′(v).νx (v〈x, k〉 | !x(k′′). EJNKk′′)) CBN
νk′ (EJMKk′ | k′(v).νk′′ (EJNKk′′ |
!k′′(w).νx (v〈x, k〉 | !x(k′). k′〈w〉))) CBV
FIGURE 32. The CPS transform, named CPS transform, and pi-encoding for
CBN and CBV.
135
Proof. We need only that our syntactic embedding of λcps,vn into pi is also
a semantic embedding—in other words, that reductions in the CPS term
correspond to reductions in the pi-term and vice versa. The only catch is
that β-redexes from the CPS term disappear under P; however, P is still a
bisimulation.
Note that any CPS term E[M ] will translate to processes representing the
bindings in E in parallel with the process representing M . Thus we can consider
translating E and M separately. Then:
νf. f := λ(x+).M in E[f(y+)]
7→? νf. f := λ(x+).M in E[M{y+/x+}]
corresponds to
νf (f〈y+〉 | PJEK | !f(x+).PJMK)
7→νf (M{y+/x+} | PJEK | !f(x+).PJMK).
It is straightforward to construct a grammar of possible pi-terms produced by P
to show that the correspondence works in both directions. 
Finally, we have the complete proof of the correctness of the uniform
pi-encoding, simply by composing:
Theorem 37. For any λn or λv term M , M ⇓ if and only if EJMKret ⇓ret.
Proof. Note that EJMKret ≡ PJN JCJMK retKK; the result follows by
Corollaries 31 and 35 and Lemma 36. 
136
Uniform Abstract Machine
In addition to the pi-encoding, we can derive other artifacts from the
uniform CPS transform. In particular, through the functional correspondence
(Ager, Danvy, & Midtgaard, 2004), we obtain a uniform abstract machine for
CBN and CBV. See Fig. 33.
The CBN fragment of the machine is essentially the same as the well-
known Krivine machine for CBN evaluation (Krivine, 2007). On closed
terms, the CBV fragment strongly resembles the CEK machine (Felleisen &
Friedman, 1986) (without control operators). Unlike most environment-based
abstract machines, ours does not exclude open terms, and thus its behavior can
meaningfully be more finely specified: Variable lookup happens when a variable
is evaluated, and only λ-abstractions are treated as values. We could instead
delay the variable lookup until a λ-abstraction is required; this machine would
implement the full CBV β-rule. Much as with the uniform CPS transform,
we can observe the difference using a term such as (λx. λy. y)z, which is stuck
according to the uniform abstract machine.
4.5 Call-by-Need and Constructive Update
As we have seen, CBV usually takes fewer evaluation steps to reach an
answer than CBN. However, CBV evaluation wastes work whenever a function
does not use its argument. The call-by-need λ-calculus (Ariola et al., 1995;
Ariola & Felleisen, 1997) is efficient in both cases: Unneeded arguments are
never computed, yet each argument is evaluated at most once. Hence call-by-
need models efficient implementations of lazy evaluation, which memoize, or
cache, each computed value.
The syntax and semantics are given in Fig. 34. Rather than perform
a substitution, the βneed rule suspends the argument in a let binding. The
grammar for evaluation contexts expresses lazy evaluation order: Given a term
137
Terms: M,N ::= x
∣∣ λx.M ∣∣MN
Continuations: k ::= KRet
∣∣ KAppNM, k, ρ∣∣ KAppV 1M,k, ρ∣∣ KAppV 2λx.M, k, ρ
Environments: ρ ::= 
∣∣ ρ[x = KClosM, ρ]
States: S ::= 〈M,k, ρ〉M
∣∣ 〈k, λx.M, ρ〉K∣∣ 〈λx.M, ρ〉H
M 7→〈M,KRet, 〉M
〈x, k, ρ〉M 7→〈M,k, ρ′〉M
where ρ(x) ≡ KClosM, ρ′
〈λx.M, k, ρ〉M 7→〈k, λx.M, ρ〉K
〈MN, k, ρ〉M 7→〈M,k′, ρ〉M
where k′ ,
{
KAppNN, k, ρ CBN
KAppV 1N, k, ρ CBV
〈KRet, λx.M, ρ〉K 7→〈λx.M, ρ〉H
〈KAppNN, k, ρ′, λx.M, ρ〉K 7→〈M,k, ρ′′〉M
where ρ′′ , ρ[x = KClosN, ρ′]
〈KAppV 1N, k, ρ′, λx.M, ρ〉K 7→〈N, k′, ρ′〉M
where k′ , KAppV 2λx.M, k, ρ
〈KAppV 2λy.N, k, ρ′, λx.M, ρ〉K 7→〈N, k, ρ′′〉M
where ρ′′ , ρ′[y = KClosλx.M, ρ]
FIGURE 33. Uniform abstract machine for CBN and CBV.
138
Expressions: M,N ::= x
∣∣ λx.M ∣∣MN ∣∣ letx = M inN
Values: V ::= λx.M
Answers: A ::= V
∣∣ letx = M inA
Evaluation Contexts: E,F ::= []
∣∣ EM ∣∣ letx = E inF [x]∣∣ letx = M inE
(λx.M)N →letx = N inM βneed
(let y = L inA)N →let y = L inAN lift
letx = V inE[x]→letx = V inE[V ] deref
letx = (let y = L inA) inE[x]→let y = L in letx = A inE[x] assoc
FIGURE 34. The call-by-need λ-calculus, λneed , of Ariola et al. (Ariola,
Felleisen, Maraist, Odersky, & Wadler, 1995).
letx = M inN , we evaluate N until it becomes either a value or a term of the
form F [x] for some evaluation context F . Such a term needs the value of x to
continue, so now we evaluate the term M that x is bound to. But since this
computation is done in place, it only needs to be done once: After M becomes
a value, this value will simply be substituted directly (by the deref rule) if x is
needed again. If N instead becomes a value, then we say the whole term is an
answer—a value surrounded by a number of let bindings.
An answer in call-by-need is “almost a value.” Evaluation stops when a
term becomes an answer, but it’s not a value for the purposes of the β or deref
rule. When a subterm evaluates to an answer, either the lift or the assoc rule
moves each binding into the outer environment.
139
To illustrate, we turn to our previous example, (λx. xx)((λy. y)(λz. z)). It
reduces as follows:
(λx. xx)((λy. y)(λz. z))
7→βneed letx = (λy. y)(λz. z) inxx
7→βneed letx = (let!y = λz. z in y) inxx
7→deref letx = (let y = λz. z inλz. z) inxx
7→assoc let y = λz. z in let!x = λz. z inxx
7→deref let y = λz. z in letx = λz. z in (λz. z)x
7→βneed let y = λz. z in let!x = λz. z in let z = x in z
7→deref let y = λz. z in letx = λz. z in let!z = λz. z in z
7→deref let y = λz. z in letx = λz. z in let z = λz. z inλz. z
Evaluation begins in a call-by-name manner, in the sense that the outer
β-redex is reduced immediately. The argument is suspended in a let binding.
Next, since x is in head position in the body, we need to evaluate it, and so we
reduce the inner β-redex. However, since the reduction is done in place, this
step will only be done once. In all, three β-reductions are performed, as few as
is done by call-by-value.
Unlike call-by-value, when a function ignores its argument, call-by-need
does not waste work. In extreme cases, call-by-value never finishes when call-by-
name or call-by-need would. For example, consider the term
(λx. λy. y)Ω,
140
where Ω is a term (λx. xx)(λx. xx) that diverges. Call-by-name substitutes
the Ω immediately (the substitution is trivial since x does not occur in the
body). Call-by-need similarly suspends the Ω without attempting to evaluate it.
Call-by-value insists on computing the argument first, and thus is caught in an
infinite loop.
Continuation-Passing Style
There does exist a call-by-need CPS transform due to Okasaki et al.
(Okasaki et al., 1994) It requires mutable storage, which our CPS languages
do not support. However, suppose we borrow the assignment syntax from the
named CPS language λcps,vn . Then we can build on the uniform CPS transform
(Fig. 24) and use a call-by-need application rule:
CJMNK , λk. CJMK(λv. νx. x:=(
λk′. CJNK(λw. x := λk′′. k′′w in k′w)) in v(x, k))
This is roughly the same as in the Okasaki CPS transform. Unfortunately,
this is not valid λcps,vn syntax, as the assignment operator := is only allowed
immediately inside a ν binding for the variable assigned to. As a result, λcps,vn
only allows an assignment to a variable that presently has no value. The inner
continuation, λw. x := λk′′. k′′w in k′w, violates this restriction by attempting to
“overwrite” x. We will call this a double assignment.
Of course, this is precisely what we wish to happen: The term bound to x
should change, in order to cache the computed value. But we don’t need the full
power of mutable storage; a much weaker effect will suffice.
To see this, suppose for a moment we allow := anywhere, with the
semantics of destructive update (that is, each assignment overwrites any
141
previous one). Inspecting the rule, we see that each variable is now assigned
to (at most) twice: Once when it is initialized with a thunk, and again when the
thunk’s result is memoized. However, after the second assignment, the stored
value never changes again. Furthermore, note that the initial thunk cannot
refer to x, even indirectly, as x is not in the scope of the computation (our let
is not recursive). Therefore the initial thunk is only used once; since that very
thunk performs the second assignment, the first lookup must precede the second
assignment, with no other accesses in between.
In the language of data-flow analysis, after the first lookup, x cannot be
live. Hence its value does not matter. In other words, it may as well have no
value. If we clear x after the first lookup, then the second assignment is just like
the first: It is giving a value to a variable that currently has none. There is no
double assignment.
This analysis suggests a special assignment operation that always clears
the variable the next time it is used. The assigned value will therefore only
be used once, and thus the assignment is ephemeral, as opposed to permanent.
After a permanent assignment, the variable will never be cleared, so permanent
assignments are final.
The Transform
Writing x := M inN for a permanent assignment and x :=1 M inN for an
ephemeral assignment, we can modify the call-by-need CPS transform so that it
does not require destructive update:
CJMNK , λk. CJMK(λv. νx. x :=1(
λk′. CJNK(λw. x := λk′′. k′′w in k′w)) in v(x, k))
142
CJxK , λk. xk
CJλx.MK , λk. k(λ(x, k′). CJMKk′)
CJMNK , λk. CJMK(λv. νx. x :=1(
λk′. CJNK(λw. x := λk′′. k′′w in k′w))
in v(x, k))
CJletx = L inMK , λk. νx. x :=1(
λk′. CJLK(λw. x := λk′′. k′′w in k′w))
in CJMKk
FIGURE 35. A call-by-need CPS transform using constructive update.
Since the initial thunk is now assigned ephemerally, there is never a double
assignment. In fact, we can prove so: By the above data-flow analysis, x is
unassigned before the second assignment. Each assignment is performed by
a term that is used at most once, and thus no further assignments will be
attempted.
Note what has happened here: x takes on different values over time, due
to multiple assignments. Therefore it is fair to say it was updated. However, no
previous value was destroyed by any update, and in fact a previous value cannot
be destroyed. In this language, updates only construct, never destroy; hence we
call the phenomenon constructive update.
This CPS transform, summarized in Fig. 35, is the one we will relate
to the pi-calculus. The syntax and semantics for permanent and ephemeral
assignment are given in Fig. 36. The deref 1 rule is similar to deref , only it
removes the ephemeral assignment.
We have shown that terms produced by the call-by-need CPS transform
never attempt a double assignment—that is, they never reduce to a term such
as x := V in x := W inM . Let us call such terms safe:
143
Terms: M,N ::= V (V +)
∣∣ νx.M∣∣ x := λ(x+).M inN∣∣ x :=1 λ(x+).M inN
Values: V ::= x
∣∣ ret ∣∣ λ(x+).M
Binding Contexts: B ::= 
∣∣ νx.B∣∣ x := λ(x+).M inB∣∣ x :=1 λ(x+).M inB
Evaluation Contexts: E ::= B
(λ(x+).M)(V +)→M{V +/x+} β
f := λ(x+).M in
E[f(V +)]
→ f := λ(x
+).M in
E[(λ(x+).M)(V +)]
deref
f :=1 λ(x
+).M in
E[f(V +)]
→E[(λ(x+).M)(V +)] deref 1
FIGURE 36. The CPS λ-calculus with constructive update, λ:=1cps .
Definition 38. A λ:=1cps term M is safe when it does not reduce to a term with a
subterm of the form
x :=∗ V in E[x :=∗W inN ],
where :=∗ stands for either := or :=1 in each appearance.
Proposition 39. For any M and K, if K is a variable, ret, or λv. P where P
is safe, then CJMKK is safe.
Proof. See the above data-flow analysis. 
Naming and pi-Encoding
Now that we have the call-by-need CPS transform, we need only adapt the
development in Sections and 4.4 to derive the pi-calculus encoding.
144
In Section 4.1, we were somewhat sloppy in deriving a pi-encoding from
the CPS—all input processes in the encoding were replicated, even those that
are provably used at most once. Now that we have ephemeral assignment, we
can be more precise. Some λ-abstractions are affine, i.e. never duplicated; these
are the ones representing continuations or suspended computations. We can
mark the λs in the CPS transform to indicate which values are affine:
CJxK , λk. xk
CJλx.MK , λk. k(λ(x, k′). CJMKk′)
CJMNK , λk. CJMK(λ1v. νx. x :=1(
λk′. CJNK(λ1w. x := λk′′. k′′w in k′w))
in v(x, k))
CJletx = L inMK , λk. νx. x :=1(
λk′. CJLK(λ1w. x := λk′′. k′′w in k′w))
in CJMKk
Now we augment the naming transform to treat affine values specially:
N JV (λ1(x+).M)K , νy. y :=1 λ(x+).N JMK inN JV (y)K
Finally, the pi-calculus translation P should use an unreplicated process to
simulate ephemeral assignment:
PJx :=1 λ(y+).M inNK , PJNK | x(y+).PJMK
145
CJxK , λk. xk
CJλx.MK , λk. k(λ(x, k′). CJMKk′)
CJMNK , λk. CJMK(λv. νx.
x :=1
(
λk′. CJNK(λw. x := λk′′. k′′w in k′v)) in v(x, k))
CJletx = L inMK = λk. νx. x :=1 (λk′. CJLK(λw. x := λk′′. k′′w in kw)) in CJMKk
CvnJxK , λk. xk
CvnJλx.MK , λk. νf. f := λ(x, k′). CvnJMKk′ in kf
CvnJMNK , λk. νh. h :=1 λv. νx. x :=1
λk
′. νh′. h′ :=1
λv. x := λk′′. k′′v in k′v
in CvnJNKh′
 in v(x, k)
in CvnJMKh
CvnJletx = L inMK , λk. νx. x :=1
λk. νh
′. h′ :=1
λv. x := λk. kv in kv
in CvnJLKh′
 in CvnJMKk
EJxKk , x〈k〉
EJλx.MKk , νf (k〈f〉 | !f(x, k). EJMKk)
EJMNKk , νh (EJMKh | h(v).νx (v〈x, k〉 |
x(k′).νh′ (EJNKh′ | h′(w). (k′〈w〉 | !x(k′′). k′′〈w〉))))
EJletx = L inMKk , νx (EJMKk | x(k′).νh′ (EJLKh′ | h′(w). (k′〈w〉 | !x(k′′). k′′〈w〉)))
FIGURE 37. The CPS transform, named CPS transform, and pi-encoding for
call-by-need.
Putting these transforms together (Fig. 37), we arrive at the same call-by-
need pi-encoding found in the literature (Brock & Ostheimer, 1995; Sangiorgi &
Walker, 2003).
Correctness
Now we establish the correctness of the call-by-need CPS transform C. We
do so by further decomposing it into three steps: A switch to a call-by-need
calculus with rules that act at a distance; an annotation step; and simulation
proofs for the CPS transform on annotated terms.
146
Binding Contexts: B ::= []
∣∣ letx = M inB
B[λx.M ]N →B[letx = N inM ] βd
letx = B[V ] inE[x]→B[letx = V inE[V ]] derefd
FIGURE 38. Reductions for the distance call-by-need λ-calculus, λdneed .
Distance Rules
As presented, λneed (Fig. 34) has certain reductions—the lift and assoc
rules—that hardly seem to perform any computation. They only shuffle
bindings around in preparation for a βneed - or deref -reduction, respectively.
In fact, if CJMK is always an administrative λ-abstraction7, then lift- and assoc-
reductions are simulated as administrative reductions alone. In a sense, the
lift and assoc rules are administrative: They only serve to bring the parts of a
redex together.
We can avoid administrative work in the source calculus by using a
suggestion of Accattoli (Accattoli, 2013) for the pi-calculus: We express
λneed using rules that apply at a distance, that is, where parts of a redex are
separated by an evaluation context.8 The new calculus, λdneed , supplants the fine-
grained lift and assoc rules with coarser β and deref rules. The syntax is the
same as for λneed (Fig. 34), except that we specify that some evaluation contexts
are binding contexts ; the reductions are given in Fig. 38.
7We could always not mark these as administrative, but then we would lose the flexibility
that administrative congruence provides.
8Of course, the deref rule already works this way in part.
An alternative is suggested by Chang and Felleisen (Chang & Felleisen, 2012), who use
distance rules and manage to do away with the deref rule as well, but at the cost of a more
complex β-rule.
147
Proposition 40. An λneed term reduces to an answer, diverges, or gets stuck by
the distance rules if and only if it does so by the original rules.
Proof. Since 7→βd is the same as 7→?lift 7→βneed , 7→derefd is the same as 7→?assoc 7→deref ,
and the language is deterministic, it suffices to define a backward simulation
that compares terms up to lift and assoc. 
Annotations
The single let construct does not tell the full story of a standard reduction
sequence in λneed : There is an implicit statefulness that is made manifest by
the CPS transform. Specifically, there are three stages in the life cycle of a let
binding:
Suspended Initially, the binding letx = M inN represents a suspended
computation. Computation takes place within N .
Active For x to be demanded, N must reduce to the form E[x]. Then the
binding becomes active, with the form letx = M inE[x], and computation
takes place within M .
Memoized Eventually, M becomes an answer B[V ], and the body E[x] receives
V while the bindings in B are added to the environment. Subsequently,
the binding is letx = V inN , and computation takes place within N .
The CPS translation exposes this state, which makes it difficult to relate
a term to its CPS form: In what state is letx = V inE[x]? In fact, it could be
in any of the three states, and thus the running CPS program could have any of
three forms.
Therefore we annotate each let, giving it a subscript s, a, or v (for
suspended, active, or value, respectively). We will need new reduction rules,
148
Expressions: M,N ::= x
∣∣ V ∣∣MN∣∣ lets x = M inN∣∣ leta x = M inE[x]∣∣ letv x = V inN
Evaluation Contexts: E,F ::= 
∣∣ EM∣∣ lets x = M inE∣∣ leta x = E inF [x]∣∣ letv x = V inE
Binding Contexts: B ::= 
∣∣ lets x = M inB∣∣ letv x = V inB
B[λx.M ]N →B[lets x = N inM ] βa
lets x = M inE[x]→leta x = M inE[x] act
leta x = B[V ] inE[x]→B[letv x = V inE[V ]] deact
letv x = V inE[x]→letv x = V inE[V ] deref a
FIGURE 39. The annotated call-by-need λ-calculus, λaneed .
which we call act and deact , to represent binding state transitions; from the
perspective of λdneed , these will be administrative. See Fig. 39 for the resulting
calculus λaneed . We write AJMK for the annotation of a λdneed term M , which
consists simply of tagging each let with s.
Our λaneed is much like previous languages with similar goals. In particular,
both Brock and Ostheimer (Brock & Ostheimer, 1995) and Danvy and
Zerny (Danvy & Zerny, 2013) include versions of what we call leta, the “active
let.”
Proposition 41.
1. M ⇓ if and only if AJMK ⇓.
2. M ⇑ if and only if AJMK ⇑.
3. M 6⇓ if and only if AJMK 6⇓.
149
Annotated CPS Transform
Now we must show that the call-by-need CPS transform, as a function
from λaneed to λ
:=1
cps , is correct. First, we need to extend the CPS transform to the
annotated terms. See Fig. 40, where we have also marked which λ-abstractions
are administrative.
Most of the first part of Ca is unsurprising: A lets translates as a
suspended computation, as appears in CaJMNK. A letv translates as a
memo-thunk. However, leta is a challenge. To see why, consider a suspended
computation:
CaJlets x = M inNKK =ad νx. x :=1 λ¯k. · · · in CaJNKK
The computation of x is suspended, pending its need. Then, x will be needed
when N reduces to a term of the form E[x]. At that point in the computation,
we expect CaJE[x]K to have evaluated to some term E ′[xK ′], where E ′ is a CPS
evaluation context and K ′ is a continuation.
CaJlets x = M inE[x]KK =ad νx. x :=1 λ¯k. · · · in E ′[xK ′]
What happens next is that the deref 1 rule fires:
νx. x :=1 λ¯k. · · · in E ′[xK ′] 7→νx.E ′[(λ¯k. · · · )K ′]
Since this deref 1-reduction is what activates the computation of x, we expect
that νx.E ′[(λ¯k. · · · )K ′] should be the shape of the CPS term corresponding to
an active let. But this means we must be able to translate evaluation contexts
as well as terms. For the uniform CPS transform, we were content to have a
150
CaJxK , λ¯k. xk
CaJλx.MK , λ¯k. k(λ¯(x, k′). CaJMKk′)
CaJMNK , λ¯k. CaJMK(λv. νx.
x :=1 λ¯k
′. CaJNK(λv. x := λk′′. k′′v in k′v)
in v(x, k))
CaJlets x = M inNK
, λ¯k. νx. x :=1 λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v)
in CaJNKk
CaJleta x = M inE[x]K
, λ¯k. νx. CaLEM[λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v)]k
CaJletv x = V inNK
, λ¯k. νx. x := CaJV K in CaJNKk
CaLM , 
CaLENM , λ¯k. CaLEM(λv. νx.
x :=1 λ¯k
′. CaJNK(λv. x := λ¯k′′. k′′v in k′v)
in v(x, k))
CaLlets x = M inEM
, λ¯k. νx. x :=1 λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v)
in CaLEMk
CaLleta x = E inF [x]M
, λ¯k. νx. CaLF M[λ¯k′. CaLEM(λv. x := λ¯k′′. k′′v in k′v)]k
CaLletv x = V inEM
, λ¯k. νx. x := CaJV K in CaLEMk
FIGURE 40. The call-by-need CPS transform on annotated terms and contexts,
Ca.
151
lemma (Proposition 19) that merely asserted that there was some K that served
to translate a source evaluation context. Now that contexts are a crucial part of
the CPS transform, we need to make the translation explicit. Hence the context
part of the CPS transform, which we write using round brackets, like CaLEM. Its
definition is easily derived from that of Ca, so that we have:
Proposition 42. CaJE[M ]K ≡ CaLEM[CaJMK]
Proof. An easy induction on E. 
Again we define an administrative congruence relation. We will want to be
able to rearrange bindings when it is safe to do so; accordingly, we adopt a lift
rule, a generalization of the lift rule from call-by-need:
E[E ′[C]]→ad E ′[E[C]] lift
(λ¯(x+).M)(V +)→ad M{V +/x+} βad
As usual, the transitive closure of →ad is →+ad , and its reflexive and transitive
closure is →?ad . Its reflexive, symmetric, and transitive closure, restricted9 to
safe terms, is =ad .
We will also use 7→ad to refer to an invocation of βad at the top level,
which we call an administrative standard reduction. (A lift-reduction is never
proper, as it is never necessary.) As before, an administrative answer is a term
that cannot take an administrative standard reduction.
The confluence, standardization, and commutativity results we need are
similar to before:
Proposition 43. The relation =ad
9The restriction is necessary because safety is not closed backward, i.e.there are unsafe
terms that reduce to safe ones.
152
1. is confluent, so that if M =ad M
′, then there is some N such that M →?ad
N and M ′ →?ad N ; and
2. has the standardization property, so that if M →?ad N and N is an
administrative answer, then there is an administrative answer M ′ such
that M 7→?ad M ′ →?ad N .
Proof.
1. Because lift and βad do not overlap and they are both left-linear, we
can use modularity (Appel, van Oostrom, & Simonsen, 2010) to prove
confluence from the confluence of each rule separately. The lift rule is
symmetric and thus trivially confluent. The βad rule is simply the β rule
restricted to a subcalculus, so it is also confluent.
2. Since lift and βad trivially commute (they do not even interact), we can
perform the lift steps last, giving M →?βad N ′ →?lift N . Since →?lift cannot
create a proper standard reduction, N ′ must also be an administrative
answer; hence standardization of β-reduction applies, completing the proof.

Proposition 44.
1. If M →?ad M ′ 7→pr N and M is an administrative answer, then there is N ′
with M 7→pr N ′ →?ad N .
2. If M ←?ad M ′ 7→pr N , then there is N ′ with M 7→pr N ′ ← [?ad N .
Proof. The lift rule can neither create nor destroy any redexes, and internal
reductions still cannot destroy a standard redex, so the proof is essentially the
same as that of Proposition 17. 
153
Lemma 45. If M =ad 7→pr N , then M 7→+pr1=ad N .
Proof. The same as the proof of Lemma 18, this time using Propositions 43
and 44. 
For the uniform CPS transform, we characterized the action of translation
on contexts as Proposition 19: CaJE[M ]KK will reduce to CJMKK ′, where K ′ is
some continuation that represents E. This case will be more complex, however:
In λneed , an evaluation context contains both bindings and work to be done.
10
A continuation alone only captures the latter. Therefore, for call-by-need, we
will need to translate the bindings as well. Our approach is to split context
translation into a function B providing a CPS binding context and a function K
providing a continuation (in which the bindings from B are in scope). Putting
them together, we will be able to relate the original context transform CaL−M to
the split transform.
BLM , []
BLEMM , BLEM
BLlets x = M inEM , νx. x :=1
λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v)
in BLEM
BLleta x = E inF [x]M , νx.BLF M[BLEM]
BLletv x = V inEM , νx. x := CaJV K in BLEM
10Danvy and Zerny (Danvy & Zerny, 2013) make a similar observation about call-by-need
evaluation contexts; they then derive a call-by-need language that separates the two parts.
154
KLM : K , K
KLEMM : K , KLEM : λv. νx. x :=1
λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v)
in v(x,K)
KLlets x = M inEM : K , KLEM : K
KLleta x = E inF [x]M : K , KLEM : λv. x := λ¯k′′. k′′v in (KLF M : K)v
KLletv x = V inEM : K , KLEM : K
The following proposition shows the relationship between Ca, B, and K.
Note that, like Proposition 19, it demonstrates the action of administrative
reductions: In this case, they serve to bring the continuation inward while
preserving the bindings in the context. We use T here to denote a value that is,
in particular, a thunk of the form λk.M .
Proposition 46. CaLEM[T ]K ≡ BLEM[T (KLEM : K)]
Proof. By induction on E. First, some shorthand will clarify:
let` x = V inP , νx. x :=1 λ¯k. V (λv. x := λ¯k. kv in kv) in P
Now:
155
– For E ≡ :
CaLEM[T ]K ≡ [T ]K
, TK
, [TK]
, BLM[T (KLM : K)]
≡ BLEM[T (KLEM : K)]
– For E ≡ E ′M :
CaLEM[T ]K ≡ CaLE ′MM[T ]K
, (λ¯k. CaLE ′M(λv. let` x = CaJMK in v(x, k)))[T ]K
≡ (λ¯k. CaLE ′[T ]M(λv. let` x = CaJMK in v(x, k)))K
7→ad CaLE ′[T ]M(λv. let` x = CaJMK in v(x,K))
7→?ad BLE ′M[T (KLE ′M : λv. let` x = CaJMK in v(x,K))] (by I.H.)
, BLE ′MM[T (KLE ′MM : K)]
≡ BLEM[T (KLEM : K)]
156
– For E ≡ lets x = M inE ′:
CaLEM[T ]K ≡ CaLlets x = M inE ′M[T ]K
, (λ¯k. let` x = CaJMK in CaLE ′Mk)[T ]K
≡ (λ¯k. let` x = CaJMK in CaLE ′[T ]Mk)K
7→ad let` x = CaJMK in CaLE ′[T ]MK
7→?ad let` x = CaJMK inBLE ′M[T (KLE ′M : K)k] (by I.H.)
, BLlets x = M inE ′M[T (KLlets x = M inE ′M : K)]
≡ BLEM[T (KLEM : K)]
– For E ≡ leta x = E ′ inF [x]:
CaLEM[T ]K ≡ CaLleta x = E ′ inF [x]M[T ]K
≡ (λ¯k. νx. CaLF M[λ¯k′. CaLE ′M(λv. x := λ¯k′′. k′′v in k′v)]k)[T ]K
≡ (λ¯k. νx. CaLF M[λ¯k′. CaLE ′M[T ](λv. x := λ¯k′′. k′′v in k′v)]k)K
7→ad νx. CaLF M[λ¯k′. CaLE ′M[T ](λv. x := λ¯k′′. k′′v in k′v)]K
7→?ad νx.BLF M[(λ¯k′. CaLE ′M[T ](λv. x := λ¯k′′. k′′v in (KLF M : k′)v))K] (by I.H.)
7→ad νx.BLF M[CaLE ′M[T ](λv. x := λ¯k′′. k′′v in (KLF M : K)v)]
7→?ad νx.BLF M[BLE ′M[T (KLE ′M : λv. x := λ¯k′′. k′′v in (KLF M : K)v))]] (by I.H.)
, νx.BLleta x = E ′ inF [x]M[T (KLleta x = E ′ inF [x]M : K)]
≡ BLEM[T (KLEM : K)]
157
– For E ≡ letv x = V inE ′:
CaLEM[T ]K ≡ CaLletv x = V inE ′M[T ]K
, (λ¯k. νx. x := CaJV K in CaLE ′Mk)[T ]K
≡ (λ¯k. νx. x := CaJV K in CaLE ′[T ]Mk)K
7→ad νx. x := CaJV K in CaLE ′[T ]MK
7→?ad νx. x := CaJV K in BLE ′M[T (KLE ′M : K)] (by I.H.)
, BLletv x = V inE ′M[T (KLletv x = V inE ′M : K)]
≡ BLEM[T (KLEM : K)] 
Now we can show how a term in a context is transformed:
Corollary 47. CaJE[M ]KK 7→?ad BLEM[CaJMK(KLEM : K)].
Proof. From Propositions 42 and 46. 
Not all λneed contexts affect the continuation. In particular, binding
contexts never alter the continuation at all. This is not surprising, since binding
contexts are precisely those that do not affect the flow of control.
Proposition 48. For any λaneed binding context B and λ
:=1
cps continuation K,
KLBM : K ≡ K.
Proof. Trivial induction on B. Note that the clauses of K that do nothing but
recurse are precisely those for the parts of a binding context. 
Corollary 49. For any λaneed binding context B, λ
a
need term M , and λ
:=1
cps
continuation K, CaJB[M ]KK 7→?ad BLBM[CaJMKK].
Proof. Immediate from Corollary 47 and Proposition 48. 
158
Now we turn to correctness. Our forward simulation follows the same
construction as that of the uniform CPS transform:
Definition 50. For a λaneed term M and λ
:=1
cps term P , let M ∼ P when
CaJMK ret =ad P .
To prove that ∼ is a forward simulation, we go through the diagrams in
Eq. (4.6) once again. The first, third, and fourth are exactly as before:
Lemma 51. M ∼ CaJMK ret. 
Lemma 52. If M ∼ P and AM , then P ⇓. 
Lemma 53. If M ∼ P and sM , then P 6⇓. 
The second diagram can be proved using much the same strategy as
for Lemma 29, but the more complex source and target languages make the
calculations heavier. To review, the steps we take to prove the simulation are:
1. Show that, if M 7→N by a reduction in the empty context, then we have
CaJMKK 7→?=ad CaJNKK.
2. Allow the reduction to occur in any evaluation context, not only at the
top of the term.
3. Let the CPS term be any P =ad CaJKM .
Now we begin with step one immediately:
Lemma 54. If M 7→ N by a reduction in the empty context, then
CaJMKK 7→+pr1=ad CaJNKK.
Proof. This time we have four reduction rules, so there are four cases.
159
– For a β-reduction:
CaJB[λx.M ]NKK
, (λ¯k. (CaJB[λx.M ]K(λv. let` x = CaJNK in v(x, k)))K
7→ad CaJB[λx.M ]K(λv. let` x = CaJNK in v(x,K))
7→?ad BLBM[CaJλx.MK(λv. let` x = CaJNK in v(x,K))] (by Corollary 49)
, BLBM[(λ¯k. k(λ¯(x, k′). CaJMKk′))(λv. let` x = CaJNK in v(x,K))]
7→ad BLBM[(λv. let` x = CaJNK in v(x,K))(λ¯(x, k′). CaJMKk′)]
7→pr BLBM[let` x = CaJNK in(λ¯(x, k′). CaJMKk′)(x,K)]
7→ad BLBM[let` x = CaJNK in CaJMKK]
←[ad BLBM[(λ¯k. let` x = CaJNK in CaJMKk)K]
←[?ad CaLBM[let` x = CaJNK in CaJMK]K (by Corollary 49)
←[ad (λ¯k. CaLBM[let` x = CaJNK in CaJMK]k)K
, CaJB[lets x = N inM ]KK
160
– For an act-reduction:
CaJlets x = M inE[x]KK
, (λ¯k. let` x = CaJMK in CaJE[x]K)K
7→ad let` x = CaJMK in CaJE[x]KK
7→?ad let` x = CaJMK inBLEM[CaJxK(KLEM : K)] (by Corollary 47)
, let` x = CaJMK inBLEM[(λ¯k. xk)(KLEM : K)]
7→ad let` x = CaJMK inBLEM[x(KLEM : K)]
, νx. x :=1 λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v) in BLEM[x(KLEM : K)]
7→pr νx.BLEM[(λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v))(KLEM : K)]
← [?ad (λ¯k. νx. CaLEM[λ¯k′. CaJMK(λv. x := λ¯k′′. k′′v in k′v)])K (by Corollary 47)
, CaJleta x = M inE[x]KK
– For a deact-reduction:
CaJleta x = B[V ] inE[x]KK
, (λ¯k. νx. CaLEM[λ¯k′. CaJB[V ]K(λv. x := λ¯k′′. k′′v in k′v)])K
7→ad νx. CaLEM[λ¯k′. CaJB[V ]K(λv. x := λ¯k′′. k′′v in k′v)]K
7→?ad νx.BLEM[(λ¯k′. CaJB[V ]K(λv. x := λ¯k′′. k′′v in k′v))(KLEM : K)] (by Proposition 46)
7→ad νx.BLEM[CaJB[V ]K(λv. x := λ¯k′′. k′′v in (KLEM : K)v)]
7→?ad νx.BLEM[BLBM[CaJV K(λv. x := λ¯k′′. k′′v in (KLEM : K)v)]] (by Corollary 49)
161
Letting λ¯k. kW , CaJV K:
, νx.BLEM[BLBM[(λ¯k. kW )(λv. x := λ¯k′′. k′′v in (KLEM : K)v)]]
7→ad νx.BLEM[BLBM[(λv. x := λ¯k′′. k′′v in (KLEM : K)v)W ]]
7→pr νx.BLEM[BLBM[x := λ¯k′′. k′′W in (KLEM : K)W ]]
→ad BLBM[νx.BLEM[x := λ¯k′′. k′′W in (KLEM : K)W ]]
→ad BLBM[νx. x := λ¯k′′. k′′W in BLEM[(KLEM : K)W ]]
←[ad BLBM[νx. x := λ¯k′′. k′′W in BLEM[(λ¯k. kW )(KLEM : K)]]
, BLBM[νx. x := CaJV K in BLEM[CaJV K(KLEM : K)]]
←[?ad BLBM[νx. x := CaJV K in CaJE[V ]KK] (by Corollary 47)
←[?ad CaJB[letv x = V inE[V ]]KK (by Corollary 49)
– For a deref -reduction:
CaJletv x = V inE[x]KK
, λ¯k. (νx. x := CaJV K in CaJE[x]Kk)K
7→ad νx. x := CaJV K in CaJE[x]KK
7→?ad νx. x := CaJV K in BLEM[x(KLEM : K)] (by Corollary 47)
7→pr νx. x := CaJV K in BLEM[CaJV K(KLEM : K)]
←[?ad νx. x := CaJV K in CaJE[V ]KK (by Corollary 47)
←[ad (λ¯k. νx. x := CaJV K in CaJE[V ]Kk)K
, CaJletv x = V inE[V ]KK 
Now we proceed to the second step, which allows the reduction to take
place in a larger context:
162
Proposition 55. If M 7→N , then CaJMKK 7→+pr1=ad CaJNKK.
Proof. As before, by definition of 7→, we have M ≡ E[M ′] and N ≡ E[N ′],
where M ′ 7→N ′ at top level.
CaJMKK ≡ CaJE[M ′]KK
7→?ad BLEM[CaJM ′K(KLEM : K)] (by Corollary 47)
7→+pr1=ad CaLEM[CaJN ′K(KLEM : K)] (by Lemma 54)
←[?ad CaJE[N ′]KK (by Corollary 47)
≡ CaJNKK 
Finally, we use Lemma 45 to generalize, completing the third step:
Lemma 56. If M 7→N and M ∼ P , then there is Q such that P 7→+ Q and
N ∼ Q.
Proof. We have M ∼ P , so P =ad CaJKM ret. M 7→N , so CaJMK ret 7→+pr1=ad
CaJKN ret by Proposition 55. So P =ad 7→+pr1=ad CaJKN ret. Since 7→+pr1 is short
for 7→?ad 7→pr 7→?ad , we have P =ad 7→pr=ad CaJKN ret. Then, by Lemma 45, we
have P 7→+Q =ad CaJKN ret for some Q. 
And now we have the correctness result for the CPS transform on
annotated terms:
Lemma 57. For any λaneed -term M :
1. M ⇓ iff CaJMK ret ⇓.
2. M ⇑ iff CaJMK ret ⇑.
3. M 6⇓ iff CaJMK ret 6⇓.
Proof. By Lemmas 51 to 53 and 56, using determinacy. 
163
From λneed to λ
:=1
cps
From these pieces, we assemble the correctness of the call-by-need CPS
transform:
Theorem 58. For any λneed -term M :
1. M ⇓ iff CJMK ret ⇓.
2. M ⇑ iff CJMK ret ⇑.
3. M 6⇓ iff CJMK ret 6⇓.
Proof. Immediate from Propositions 40 and 41 and Lemma 57. 
Naming and pi-encoding
Since we have introduced some of the naming mechanism into the
“unnamed” CPS language λ:=1cps , the simulation proof for the naming transform
needs to be more subtle. The readback function N−1 we defined before undoes
all assignments; now that the source CPS calculus is only mostly unnamed, this
is too blunt an instrument. Instead, we will use a relation that can selectively
eliminate sharing.
First, we need to restrict our terms so that we can reason about what
variables may be assigned to.
Definition 59. A λ:=1cps term or value is localized if no variable appearing on
the left of an assignment subterm is bound by a λ.11
For example, the value λ(x, y). x := λk. ky in kx is not allowed, since
x is bound by the λ. Since free and ν-bound variables are not subject to
11We borrow the term localized from the pi-calculus literature. The localized pi-calculus is
a subcalculus that forbids processes from listening on channels they have received from other
processes. As in the pi-calculus, this restriction can be made finer using a type system.
164
substitution by β-reduction, localized terms are closed under reduction. In
particular, we can say with certainty which variables in a localized term may
ever be assigned to, no matter the context—only those assigned to by subterms
of the term.
Proposition 60. If K is localized, then CaJMKK is localized. In particular,
CaJMK ret is localized.
Proof. Easy induction on M . 
Now, keeping in mind that λ:=1cps,vn is a subset of λ
:=1
cps :
Definition 61. The relation ≺ is the restriction to λ:=1cps × λ:=1cps,vn of the reflexive,
transitive, and congruent closure of the following rules on localized terms:
M{λy+. N/x} ≺ νx. x := λy+. N inM
(if x not assigned in M)
M{λ1y+. N/x} ≺ νx. x :=1 λy+. N inM
(if x is affine and not assigned in M or N)
M ≺ νx.M
(if x not free in M)
This suffices to prove the correctness of the naming transform:
Lemma 62. For any λ:=1cps-term M and variable k:
1. M ⇓ iff N JMK ⇓.
2. M ⇑ iff N JMK ⇑.
3. M 6⇓ iff N JMK 6⇓.
165
Proof. As before, we show that M ≺ N JMK and that ≺ is a backward
bisimulation that preserves outcomes in the backward direction. The result
follows as always from these observations and determinism.
Initial Condition That M ≺ N JMK can be found by an easy induction, as ≺
simply undoes the manipulations performed by N .
Simulation We need that M ≺ P and P 7→ Q imply that there is N with
M 7→?N and N ≺ Q.
If P 7→Q, then we have a reduction by β, deref , or deref 1. β-reductions
are unchanged by ≺. For deref , we have P ≡ E[f := λ(x+). P ′ in E ′[f(y+)]
and Q ≡ E[f := λ(x+)P ′ in E ′[(λ(x+). P ′)(y+)]]. Now consider how
M might relate to P : Applications of ≺ inside E, M , or E ′ would not
interfere with the deref -reduction (we can apply the rules in Q instead).
If f was substituted into the body of P , then we can simply take M = N .
Otherwise, we can contract M to find N ; either way, N ≺ Q.
The case for deref 1 (i.e.for an ephemeral assignment rather than a
permanent one) is similar, only to get N ≺ Q at the end, we need to
apply the third rule of ≺ to collect the ν as garbage.
Outcomes As with Corollary 35, A and s are invariant under ≺, and every
deref (or deref 1) creates a standard β-redex, so answers, stuck states, and
divergence are preserved. 
The augmentation of P for ephemeral assignment is easy to prove correct:
Lemma 63. For any λ:=1cps,vn-term M , M ⇓ iff PJMK ⇓ret
Proof. To the proof of Lemma 36, we need only add consideration of ephemeral
assignment, which corresponds just as strongly as permanent assignment. 
166
Finally we have our proof12of the correctness of the call-by-need pi-
encoding:
Theorem 64. For any λneed -term M , M reduces to an answer iff EJMKret ⇓ret.
Proof. Immediate from Theorem 58 and Lemmas 62 and 63, since EJMKk ,
PJN JCJMKkKK. 
Abstract Machine
Just as we did with the uniform CPS transform, we can derive an abstract
machine from the call-by-need transform. First, we represent ephemeral
assignment in store-passing style: A thunk assigned ephemerally should be
erased from store when it is accessed. We use the symbol ⊥ to denote such a
“missing” value; the store will bind ⊥ to a variable that has been allocated (by a
ν) but currently has no value.
Using this representation, the functional correspondence gives us the
abstract machine in Fig. 41. There are different machine states for examining
a term, a thunk, a continuation, or a closure, and a halt state returning the
final value and store. The store is a map from locations to thunks, and the
environment maps local variables to locations in the store.
Notably, up to a few transition compressions, this abstract machine is
the same as one derived by Ager, Danvy, and Midtgaard (Ager et al., 2004)13,
except that when a suspended computation is retrieved from the environment, it
is removed. In this way, it resembles the original call-by-need abstract machine
by Sestoft (Sestoft, 1997). Without a letrec form in the source, however,
12This is not the first such proof (Brock & Ostheimer, 1995), though previous proofs did
not exploit the connection to CPS.
13Specifically, it resembles the first variant mentioned in section 3 of (Ager et al., 2004).
167
Locations: `, . . .
Terms: M,N ::= x
∣∣ λx.M ∣∣MN
Thunks: t ::= KSuspM, `, ρ
∣∣ KMemof ∣∣ ⊥
Continuations: k ::= KRet
∣∣ KApplyM, ρ, k ∣∣ KUpdate`, k
Closures: f ::= KClosx,M, ρ
Stores: σ ::= 
∣∣ σ[` = t]
Environments: ρ ::= 
∣∣ ρ[x = `]
States: S ::= 〈M,σ, ρ, k〉M
∣∣ 〈t, σ, k〉T∣∣ 〈k, f, σ〉K ∣∣ 〈f, `, σ, k〉F∣∣ 〈f, σ〉H
M 7→〈M, , ,KRet〉M
〈x, σ, ρ, k〉M 7→〈t, σ, k〉T
where ρ(x) ≡ ` and σ(`) ≡ t
〈λx.M, σ, ρ, k〉M 7→〈k,KClosx,M, ρ, σ〉K
〈MN,σ, ρ, k〉M 7→〈M,σ, ρ,KApplyN, ρ, k〉M
〈KSuspM, `, ρ, σ, k〉T 7→〈M,σ[` = ⊥], ρ,KUpdate`, k〉M
〈KMemof, σ, k〉T 7→〈k, f, σ〉K
〈KRet, f, σ〉K 7→〈f, σ〉H
〈KApplyM, ρ, k, f, σ〉K 7→〈f, `, σ[` = KSuspM, `, ρ], k〉F
where ` /∈ σ
〈KUpdate`, k, f, σ〉K 7→〈k, f, σ[` = KMemof ]〉K
〈KClosx,M, ρ, `, σ, k〉F 7→〈M,σ, ρ[x = `], k〉M
FIGURE 41. The abstract machine derived from C.
168
this difference in behavior cannot be observed, since the symbol binding a
computation cannot appear free in the term being computed.
It is also quite similar to one derived recently by Danvy and
Zerny (Danvy & Zerny, 2013), which they call the lazy Krivine machine.
The mechanisms are superficially different in a few ways. For them, a thunk is
simply an unevaluated term, whereas we remember whether the thunk has been
evaluated before (and a few bookkeeping details). However, this is merely a
different choice for the division of responsibility: We hand control to the thunk,
and then the thunk determines whether to set up an update or simply return a
value. The lazy Krivine machine instead inspects the thunk when it is retrieved:
If it is a value, it is returned immediately, and otherwise it is evaluated. Hence
our thunks are tagged and theirs are not. The tags are largely an artifact of
the connection to the pi-calculus translation—whether an argument has been
evaluated is evident from the structure of the process representing it. Another
difference is that the lazy Krivine machine lacks an environment, relying entirely
on the store, but again this is superficial.
169
CHAPTER V
Conclusion
As we have seen, there are many tradeoffs in designing an intermediate
language. Some are clear-cut, such as how much low-level detail to expose—too
little, and the optimizer cannot make important decisions about safety and
efficiency; too much, and the IL becomes unworkable to implement or to reason
about. Other design points are subtle, such as the use of continuations in a
functional IL. Convenient approximations, such as functions as join points, may
be hazardous. We hope to have found the “sweet spot” by translating from
Sequent Core back to the established Core language.
The right extension to an intermediate language can enable singnificant
new optimizations, and translation offers a powerful method of deriving
the properties of the new extension with minimal effort while making the
correctness as clear as possible. Reasoning in one language while implementing
another lets us exploit the power of theory without overly conceding on matters
of engineering.
In particular, we were able to make improvements to a mature compiler.
Compared to the baseline of System F, FJ is a rather small change; other
transformations are barely affected; the new commuting conversions are valuable
in practice; and they make the transformation pipeline more robust.
Although we have presented FJ as a lazy language, everything in this
paper applies equally to a call-by-value language. All one needs to do is to
change the evaluation context, the notion of what is substitutable, and a few
typing rules.
170
For example, this expression does not match the casefloat axiom:
f
case g x y ofA→ e1
B → e2

Nor do we want it to be: the function f is presumed lazy, so the case
expression—and hence the call to g, which may be expensive—may never
happen. If, however, we decide that f  is a valid evaluation context, then
we can reduce by casefloat to:
case g x y ofA→ f e1
B → f e2
The work on the pi-calculus in Chapter IV suggests another extension
to GHC’s Core language. Since Core doesn’t have primitives expressing the
updates that cache call-by-need computation, the simplifier and other main
optimization passes don’t have any opportunity to move, combine, or eliminate
them. It is my hope that these operations would find synergies with other
passes in much the same way the new case-of-case transformation enables list
fusion to perform better.
More generally, the connection between CPS transforms, the pi-calculus,
and graph reduction had been considered only in the call-by-value and call-
by-name worlds. We have seen that only a modest extension to the target
CPS calculus is required in order to put call-by-need on an equal footing.
Hopefully, we can find further uses for the constructive update calculus; for
instance, given the closeness to pi-calculus channel operations, it is possible
that other pi-calculus encodings can be reinterpreted as CPS transforms as well.
In another direction, since many type systems for the pi-calculus have been
171
proposed (Sangiorgi & Walker, 2003), it seems worth exploring whether we can
use the techniques outlined here to consider a typed CPS transform and a typed
encoding into the pi-calculus.
172
References Cited
Accattoli, B. (2013). Evaluating functions as processes. In Termgraph
(pp. 41–55).
Ager, M. S., Danvy, O., & Midtgaard, J. (2004). A functional correspondence
between call-by-need evaluators and lazy abstract machines. Information
Processing Letters, 90 (5), 223–232.
Ahmed, A. & Blume, M. (2008). Typed closure conversion preserves
observational equivalence. In J. Hook & P. Thiemann (Eds.), ICFP 2008
[Proceedings of the 13th ACM SIGPLAN international conference on
functional programming], September 20–28, 2008 (pp. 157–168). Victoria,
BC, Canada. ACM. doi:10.1145/1411204.1411227
Aho, A. V., Sethi, R., & Ullman, J. D. (1986). Compilers: Principles,
Techniques, and Tools. Addison-Wesley.
Amadio, R. (2011). A decompilation of the pi-calculus and its application to
termination. (Tech. rep. No. UMR CNRS 7126). Universite´ Paris Diderot.
Appel, A. W. (1992). Compiling with Continuations. Cambridge University
Press.
Appel, A. W. (1998a). Modern Compiler Implementation in ML. Cambridge
University Press.
Appel, A. W. (1998b). SSA is functional programming. SIGPLAN Notices,
33 (4), 17–20. doi:10.1145/278283.278285
Appel, C., van Oostrom, V., & Simonsen, J. G. (2010). Higher-order (non-
)modularity. In Rta (pp. 17–32).
Ariola, Z. M. & Klop, J. W. (1996). Lambda calculus with explicit recursion.
Information and Computation, 139.
Ariola, Z. M. & Felleisen, M. (1997). The call-by-need lambda calculus. J.
Functional Programming, 7 (3), 265–301.
Ariola, Z. M., Felleisen, M., Maraist, J., Odersky, M., & Wadler, P. (1995). A
call-by-need lambda calculus. In R. K. Cytron & P. Lee (Eds.), POPL
’95 [Conference record of POPL ’95: 22nd ACM SIGPLAN-SIGACT
symposium on principles of programming languages], January 23–25, 1995
(pp. 233–246). San Francisco, California, USA. ACM Press. doi:10.1145/
199448.199507
173
Baker-Finch, C. A., Glynn, K., & Peyton Jones, S. L. (2004). Constructed
product result analysis for Haskell. J. Funct. Program. 14 (2), 211–245.
doi:10.1017/S0956796803004751
Barendregt, H. P. (1984). The Lambda Calculus: Its Syntax and Semantics.
Studies in Logic and the Foundations of Mathematics. Amsterdam: North
Holland.
Bove, A., Dybjer, P., & Norell, U. (2009). A brief overview of Agda: A
functional language with dependent types. In S. Berghofer, T. Nipkow,
C. Urban, & M. Wenzel (Eds.), TPHOLs 2009 [Theorem proving in
higher order logics, 22nd international conference], August 17–20, 2009
(Vol. 5674, pp. 73–78). Lecture Notes in Computer Science. Munich,
Germany. Springer. doi:10.1007/978-3-642-03359-9 6
Brady, E. (2013). Idris, a general-purpose dependently typed programming
language: design and implementation. J. Funct. Program. 23 (5), 552–593.
doi:10.1017/S095679681300018X
Brock, S. & Ostheimer, G. (1995). Process semantics of graph reduction. In
Concur (pp. 471–485).
Chakravarty, M. M. T., Keller, G., & Peyton Jones, S. (2005). Associated type
synonyms. In O. Danvy & B. C. Pierce (Eds.), ICFP 2005 [Proceedings
of the 10th ACM SIGPLAN international conference on functional
programming], September 26–28, 2005 (pp. 241–253). Tallinn, Estonia.
ACM. doi:10.1145/1086365.1086397
Chakravarty, M. M. T., Keller, G., & Zadarnowski, P. (2003). A functional
perspective on SSA optimisation algorithms. Electr. Notes Theor. Comput.
Sci. 82 (2), 347–361. doi:10.1016/S1571-0661(05)82596-4
Chang, S. & Felleisen, M. (2012). The call-by-need lambda calculus, revisited.
In Programming languages and systems (pp. 128–147). Springer.
Church, A. (1932). A set of postulates for the foundation of logic. Annals of
Mathematics, 2, 33, 346–366.
Coutts, D., Leshchinskiy, R., & Stewart, D. (2007, October). Stream fusion:
From lists to streams to nothing at all. In Proceedings of the 12th acm
sigplan international conference on functional programming (pp. 315–326).
ICFP ’07. Freiburg, Germany: ACM. doi:10.1145/1291151.1291199
Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., & Zadeck, F. K. (1991).
Efficiently computing static single assignment form and the control
174
dependence graph. ACM Trans. Program. Lang. Syst. 13 (4), 451–490.
doi:10.1145/115372.115320
Danvy, O. & Filinski, A. (1989). A Functional Abstraction of Typed Contexts.
(Tech. rep. No. 89/12). DIKU, University of Copenhagen, Copenhagen,
Denmark.
Danvy, O. & Filinski, A. (1992). Representing control: A study of the CPS
transformation. Mathematical structures in computer science, 2 (04),
361–391.
Danvy, O. & Nielsen, L. R. (2003). A first-order one-pass cps transformation.
Theoretical Computer Science, 308 (1), 239–257.
Danvy, O. & Zerny, I. (2013). A synthetic operational account of call-by-need
evaluation. In Proceedings of the 15th symposium on principles and
practice of declarative programming (pp. 97–108). ACM.
Davidson, J. W. & Fraser, C. W. (1980). The design and application of a
retargetable peephole optimizer. ACM Trans. Program. Lang. Syst. 2 (2),
191–202. doi:10.1145/357094.357098
Downen, P., Maurer, L., Ariola, Z. M., & Peyton Jones, S. (2016). Sequent
calculus as a compiler intermediate language. In Icfp 2016 [Proceedings
of the 21st ACM SIGPLAN international conference on functional
programming], September 18–22, 2016 (pp. 74–88). Nara, Japan. doi:10.
1145/2951913.2951931
Downen, P., Maurer, L., Ariola, Z. M., & Varacca, D. (2014). Continuations,
processes, and sharing. In O. Chitil, A. King, & O. Danvy (Eds.),
Proceedings of the 16th international symposium on principles and practice
of declarative programming, kent, canterbury, united kingdom, september
8-10, 2014 (pp. 69–80). ACM. doi:10.1145/2643135.2643155
Felleisen, M. & Friedman, D. (1986). Control operators, the SECD-machine,
and the λ-calculus. In Formal descriptions of programming concepts
(pp. 193–219).
Flanagan, C., Sabry, A., Duba, B. F., & Felleisen, M. (1993). The essence
of compiling with continuations. In R. Cartwright (Ed.), PLDI 1993
[Proceedings of the ACM SIGPLAN ’93 conference on programming
language design and implementation], June 23–25, 1993 (pp. 237–247).
Albuquerque, New Mexico, USA. ACM. doi:10.1145/155090.155113
Fluet, M. & Weeks, S. (2001). Contification using dominators. In B. C. Pierce
(Ed.), ICFP ’01 [Proceedings of the sixth ACM SIGPLAN international
175
conference on functional programming], September 3–5, 2001 (pp. 2–13).
Firenze (Florence), Italy. ACM. doi:10.1145/507635.507639
Gentzen, G. (1935). Untersuchungen u¨ber das logische schließen. i.
Mathematische Zeitschrift, 39 (1), 176–210. doi:10.1007/BF01201353
Gill, A. & Hutton, G. (2009). The worker/wrapper transformation. J. Funct.
Program. 19 (2), 227–251. doi:10.1017/S0956796809007175
Girard, J.-Y., Taylor, P., & Lafont, Y. (1989). Proofs and types. Cambridge
University Press Cambridge.
Hatcliff, J. & Danvy, O. (1997). Thunks and the λ-calculus. J. Funct. Program.
7 (3), 303–319.
Herbelin, H. (1995). A λ-calculus structure isomorphic to gentzen-style
sequent calculus structure. In L. Pacholski & J. Tiuryn (Eds.), CSL ’94
[Computer science logic, 8th international workshop, selected papers],
September 25–30, 1994 (Vol. 933, pp. 61–75). Lecture Notes in Computer
Science. Kazimierz, Poland. Springer. doi:10.1007/BFb0022247
Hinze, R. (2005). Church numerals, twice! J. Funct. Program. 15 (1), 1–13.
doi:10.1017/S0956796804005313
Howard, W. A. (1980). The formulae-as-types notion of construction. In J.
Seldin & J. Hindley (Eds.), To H. B. Curry: Essays on combinatory logic,
lambda calculus and formalism. New York: Academic Press.
Keep, A., Hearn, A., & Dybvig, R. (2021). Optimizing closures in O(0) time. In
Annual workshop on scheme and functional programming. ACM.
Kelsey, R. (1995). A correspondence between continuation passing style and
static single assignment form. In M. D. Ernst (Ed.), IR’95 [Proceedings
of the ACM SIGPLAN workshop on intermediate representations],
January 22, 1995 (pp. 13–23). San Francisco, CA, USA. ACM. doi:10 .
1145/202529.202532
Kennedy, A. (2007). Compiling with continuations, continued. In R. Hinze &
N. Ramsey (Eds.), ICFP 2007 [Proceedings of the 12th ACM SIGPLAN
international conference on functional programming], October 1–3, 2007
(pp. 177–190). Freiburg, Germany. ACM. doi:10.1145/1291151.1291179
Krivine, J.-L. (2007). A call-by-name lambda-calculus machine. Higher-Order
and Symbolic Computation, 20 (3), 199–207.
Lattner, C. & Adve, V. S. (2004). LLVM: A compilation framework for lifelong
program analysis & transformation. In CGO 2004 [Proceedings of the
176
2nd IEEE / ACM international symposium on code generation and
optimization], March 20–24, 2004 (pp. 75–88). San Jose, CA, USA. IEEE
Computer Society. doi:10.1109/CGO.2004.1281665
Leroy, X. (2009). A formally verified compiler back-end. Journal of Automated
Reasoning, 43 (4), 363–446.
Lindley, S. (2005). Normalisation by evaluation in the compilation of typed
functional programming languages (Doctoral dissertation, University of
Edinburgh, College of Science and Engineering, School of Informatics).
Lowry, E. S. & Medlock, C. W. (1969). Object code optimization. Commun.
ACM, 12 (1), 13–22. doi:10.1145/362835.362838
Marlow, S., Yakushev, A. R., & Peyton Jones, S. (2007). Faster laziness using
dynamic pointer tagging. In R. Hinze & N. Ramsey (Eds.), ICFP 2007
[Proceedings of the 12th ACM SIGPLAN international conference on
functional programming], October 1–3, 2007 (pp. 277–288). Freiburg,
Germany. ACM. doi:10.1145/1291151.1291194
Maurer, L., Downen, P., Ariola, Z. M., & Peyton Jones, S. (2017). Compiling
without continuations. In A. Cohen & M. T. Vechev (Eds.), PLDI 2017
[Proceedings of the 38th ACM SIGPLAN conference on programming
language design and implementation], June 18–23, 2017 (pp. 482–494).
Barcelona, Spain. ACM. doi:10.1145/3062341.3062380
Meyer, A. & Cosmadakis, S. (1988, July). Semantical Paradigms: Notes for an
Invited Lecture. MIT Laboratory for Computer Science. 545 Technology
Square, Cambridge, MA 02139.
Milner, R. (1992). Functions as processes. Mathematical Structures in Computer
Science, 2 (02), 119–141.
Milner, R., Parrow, J., & Walker, D. (1992). A calculus of mobile processes, I.
Information and Computation, 100 (1), 1–40. doi:10.1016/0890-5401(92)
90008-4
Milner, R. & Sangiorgi, D. (1992). Barbed bisimulation. In Automata, languages
and programming (pp. 685–695). Springer.
Minamide, Y., Morrisett, J. G., & Harper, R. (1996). Typed closure conversion.
In H. Boehm & G. L. S. Jr. (Eds.), POPL ’96 [Conference record of
POPL ’96: the 23rd ACM SIGPLAN-SIGACT symposium on principles
of programming languages, papers presented at the symposium],
January 21–24, 1996 (pp. 271–283). St. Petersburg Beach, Florida, USA.
ACM Press. doi:10.1145/237721.237791
177
Morrisett, J. G., Walker, D., Crary, K., & Glew, N. (1999). From system F
to typed assembly language. ACM Trans. Program. Lang. Syst. 21 (3),
527–568. doi:10.1145/319301.319345
Novillo, D. (2003). Tree SSA: A new optimization infrastructure for GCC. In
Proceedings of the 2003 GCC developers’ summit (pp. 181–193).
Okasaki, C., Lee, P., & Tarditi, D. (1994). Call-by-need and continuation-
passing style. Lisp and Symbolic Computation, 7 (1), 57–82.
LLVM. (2015). LLVM language reference manual.
Parigot, M. (1992). λµ-calculus: an algorithmic interpretation of classical
natural deduction. In A. Voronkov (Ed.), Logic programming and
automated reasoning (Vol. 624, pp. 190–201). Lecture Notes in Computer
Science. Springer Berlin Heidelberg. doi:10.1007/BFb0013061
Peyton Jones, S. L. (1987). The Implementation of Functional Programming
Languages. Prentice-Hall.
Peyton Jones, S. L. & Launchbury, J. (1991). Unboxed values as first class
citizens in a non-strict functional language. In Proceedings of the 5th ACM
conference on functional programming languages and computer architecture
(pp. 636–666). London, UK, UK: Springer-Verlag.
Peyton Jones, S. L. & Santos, A. L. M. (1998). A transformation-based
optimiser for Haskell. Sci. Comput. Program. 32 (1-3), 3–47. doi:10.1016/
S0167-6423(97)00029-4
Peyton Jones, S. & Meijer, E. (1997). Henk: A typed intermediate language. In
TIC 1997, June 8, 1997. Amsterdam, The Netherlands.
Peyton Jones, S., Vytiniotis, D., Weirich, S., & Washburn, G. (2006). Simple
unification-based type inference for gadts. In J. H. Reppy & J. L.
Lawall (Eds.), ICFP 2006 [Proceedings of the 11th ACM SIGPLAN
international conference on functional programming], September 16–21,
2006 (pp. 50–61). Portland, Oregon, USA. ACM. doi:10.1145/1159803.
1159811
Peyton Jones, S., Partain, W., & Santos, A. (1996, May). Let-floating: Moving
bindings to give faster programs. In ACM SIGPLAN International
Conference on Functional Programming (ICFP’96). Philadelphia: ACM.
Peyton Jones, S. & Santos, A. (1998, September). A transformation-based
optimiser for Haskell. Science of Computer Programming, 32 (1-3), 3–47.
Pierce, B. C. (2002). Types and programming languages. MIT Press.
178
Plotkin, G. D. (1975). Call-by-name, call-by-value and the λ-calculus. Theor.
Comput. Sci. 1 (2), 125–159. doi:10.1016/0304-3975(75)90017-1
Plotkin, G. D. (1977). LCF considered as a programming language. Theor.
Comput. Sci. 5 (3), 223–255. doi:10.1016/0304-3975(77)90044-5
Pop, S. (2006). The SSA representation framework: semantics, analyses and
GCC implementation (Doctoral dissertation, E´cole Nationale Supe´rieure
des Mines de Paris).
Prosser, R. T. (1959). Applications of boolean matrices to the analysis of flow
diagrams. In Papers presented at the December 1–3, 1959, Eastern joint
IRE-AIEE-ACM computer conference (pp. 133–138). ACM.
Reppy, J. H. (2002). Optimizing nested loops using local CPS conversion.
Higher-Order and Symbolic Computation, 15 (2-3), 161–180. doi:10.1023/A:
1020839128338
Reynolds, J. C. (1993). The discoveries of continuations. Lisp and Symbolic
Computation, 6 (3-4), 233–248.
Reynolds, J. C. (1998). Definitional interpreters for higher-order programming
languages. Higher-Order and Symbolic Computation, 11 (4), 363–397.
doi:10.1023/A:1010027404223
Rosen, B. K., Wegman, M. N., & Zadeck, F. K. (1988). Global value numbers
and redundant computations. In J. Ferrante & P. Mager (Eds.), POPL ’88
[Proceedings of the fifteenth annual ACM SIGPLAN-SIGACT symposium
on principles of programming languages], January 10–13, 1988 (pp. 12–27).
San Diego, California, USA. ACM Press. doi:10.1145/73560.73562
Sabry, A. & Felleisen, M. (1993). Reasoning about programs in continuation-
passing style. Lisp and Symbolic Computation, 6 (3-4), 289–360.
Sabry, A. & Wadler, P. (1997). A reflection on call-by-value. ACM Transactions
on Programming Languages and Systems (TOPLAS), 19 (6), 916–941.
Sangiorgi, D. (1999, July). From λ to pi; or, rediscovering continuations.
Mathematical Structures in Computer Science, 9 (4), 367–401.
Sangiorgi, D. & Walker, D. (2003). The pi-calculus: a Theory of Mobile
Processes. Cambridge Univ. Press.
Santos, A. L. M. (1995). Compilation by Transformation in Non-Strict
Functional Languages (Doctoral dissertation, University of Glasgow).
179
Sestoft, P. (1997). Deriving a lazy abstract machine. Journal of Functional
Programming, 7 (03), 231–264.
Steele, Jr., G. L. & Sussman, G. J. (1976, March). Lambda: The Ultimate
Imperative. (AI Memo No. AIM-353). Massachusetts Institute of
Technology, Artificial Intelligence Laboratory.
Steele, G. L., Jr. (1978, May). Rabbit: A Compiler for Scheme. (Technical
Report No. AI-TR-474). Artificial Intelligence Laboratory, MIT.
Cambridge, MA.
Sulzmann, M., Chakravarty, M. M. T., Jones, S. L. P., & Donnelly, K. (2007).
System F with type equality coercions. In F. Pottier & G. C. Necula
(Eds.), TLDI ’07 [Proceedings of the 2007 ACM SIGPLAN international
workshop on types in language design and implementation], January 16,
2007 (pp. 53–66). Nice, France. ACM. doi:10.1145/1190315.1190324
Sussman, G. J. & Steele, Jr., G. L. (1975, December). SCHEME: An interpreter
for untyped lambda-calculus. (AI Memo No. AIM-349). Massachusetts
Institute of Technology, Artificial Intelligence Laboratory.
Svenningsson, J. (2002, September). Shortcut fusion for accumulating
parameters & zip-like functions. In ICFP ’02 [Proceedings of the seventh
ACM SIGPLAN international conference on functional programming],
October 4–6, 2002 (pp. 124–132). Pittsburgh, Pennsylvania, USA.
Pittsburgh: ACM. doi:10.1145/581478.581491
Tolmach, A. P. & Oliva, D. (1998). From ML to Ada: Strongly-typed language
interoperability via source translation. J. Funct. Program. 8 (4), 367–412.
Xi, H., Chen, C., & Chen, G. (2003). Guarded recursive datatype constructors.
In A. Aiken & G. Morrisett (Eds.), POPL 2003 [Conference record of
POPL 2003: the 30th SIGPLAN-SIGACT symposium on principles
of programming languages], January 15–17, 2003 (pp. 224–235). New
Orleans, Louisiana, USA. ACM. doi:10.1145/640128.604150
Xi, H. & Pfenning, F. (1999). Dependent types in practical programming. In
A. W. Appel & A. Aiken (Eds.), POPL ’99 [Proceedings of the 26th ACM
SIGPLAN-SIGACT symposium on principles of programming languages],
January 20–22, 1999 (pp. 214–227). San Antonio, TX, USA. ACM. doi:10.
1145/292540.292560
180