Data Structures in Pure Lambda Calculus

This article explains different strategies of encoding data as pure lambda calculus expressions using abstractions. Due to personal preferences this is done using the syntax and standard library of the bruijn programming language – you might want to read my previous article first if you’re not familiar with it.

Update: More than a year later I’ve had a talk on this topic. You can find the slides here.

States

While probably not a data structure in the traditional sense, encoding different states as expressions in lambda calculus isn’t as trivial as in other languages. I therefore want to give an introduction to several strategies for doing this.

Characteristics

Let’s first talk about some of the characteristics a good encoding of states needs to have. In my opinion, encodings should be

Bits

We start by wanting to encode

\{0,1\}

. This is fairly easy to do by using two abstractions and referencing the respective argument:

0 ⤳ [[0]]
1 ⤳ [[1]]

Keep in mind that the order doesn’t really matter – I do like the symmetry though!

Now let’s verify the characteristics: It obviously seems quite concise as we don’t have any redundant abstractions. The encoding is also easily readable – think about it!

We can use the properties of the encoding to simulate an if condition. Assuming is-true and is-false are some expressions we want to return respectively, this results in the following reduction:

  [[1]] is-true is-false
⤳ [is-true] is-false
⤳ is-true

  [[0]] is-true is-false
⤳ [0] is-false
⤳ is-false

Comparing the two encodings isn’t as obvious though, so let’s formally write down the objective first:

cmp [[0]] [[0]] ⤳ is-true
cmp [[0]] [[1]] ⤳ is-false
cmp [[1]] [[0]] ⤳ is-false
cmp [[1]] [[1]] ⤳ is-true

After inspecting these 4 conditions, we have two cases: If the first argument is [[0]], we basically want to return the inverted second argument. If the second argument is [[1]], we want to return the second argument without inverting it:

# if the first argument is true, return as before
first-true [0 is-true is-false]

# if the first argument is false, return inverted
first-false [0 is-false is-true]

# combine the two functions!
cmp [[1 first-true first-false]]

n

states

We now want to encode

n

different states in a single bruijn expression. My initial idea was to just use

n

abstractions for each state:

1 ⤳ [0]
3 ⤳ [[[0]]]

Unfortunately, this doesn’t fulfill the proposed characteristics at all: It grows linearly (

\mathcal{O}(n)

) and is only readable/comparable in a data-dependent way – you basically need to know

n

before you can fully evaluate the encoding!²

Okay, we clearly need a better approach. How about only using a single abstraction and then referencing its parameter

n

times?

1 ⤳ [0]
3 ⤳ [0 0 0]

Well, let’s see. It still grows linearly, but we can ignore that for now. At least for me, it isn’t instantly clear why such an encoding wouldn’t be suitable for reading. One might think that just applying some kind of increment function to it basically increments a number

n

-times and therefore returns its state – like [0 0 0] inc (+0). However, due to the left-associative precedence, this gets reduced to ((inc inc) inc) (+0) – the incs get overwritten with unapplied versions of themself!

A very similar thing happens with right-associative precedence: [0 (0 0)] inc (+0) ⤳ (inc (inc inc)) (+0). Adding an extra argument fixes this, though:

1 ⤳ [[0 1]]
3 ⤳ [[0 (0 (0 1))]]

  [[0 (0 (0 1))]] (+0) inc
⤳ [0 (0 (0 (+0)))] inc
⤳ inc (inc (inc (+0)))
⤳ (+3)

Comparing states using this encoding is then quite simple: Just compare the numbers!

It would be even better if we were able to compare the encodings directly without first needing to convert them to numbers. It turns out this isn’t actually that easy. The main idea is to calculate some kind of distance between the two state encodings and then checking whether it’s non-empty. As you might have guessed by now, we basically rediscovered the Church-encoding for numbers: The “distance” function is just the sub function, the “non-empty” function is the zero? function! I won’t go into the detailed implementation, you can look it up in std/Number/Unary or countless articles about the Church-encoding.

Logarithmic complexity

One of the main problems of the Church state encoding is its linear growth. This is obviously suboptimal for encoding large states.

How about just adding another abstraction layer? For every “element” in the encoding, we’d now have two different choices. Well, if this doesn’t sound suspiciously similar to binary…

1 ⤳ [[[1 2]]]
2 ⤳ [[[1 (0 2)]]]
3 ⤳ [[[1 (1 2)]]]
4 ⤳ [[[1 (0 (0 2))]]]

Do you see the pattern? Using this encoding, we effectively reduced the space complexity to

\mathcal{O}(\log n)

! This is actually the exact implementation of the binary syntactic sugar in bruijn – we can therefore just use the functions from std/Number/Binary. Using the ternary! function, we could convert the number to ternary and compare the numbers using their respective comparison functions. But even better: This encoding is also suitable for direct comparison without conversion (e.g. eq?, sub or zero?)!

So it seems like we found a perfect state encoding for the given constraints. We could obviously reduce the space complexity even more by adding more abstractions. However, an investigation by Torben Mogensen showed that bigger bases result in much more complicated comparison code (like quadratic to cubic code size increments between bases). The biggest base that bruijn’s standard library currently supports is therefore (balanced) ternary.

Boxes

We now want to construct a box data structure – it should be able to store data (e.g. using the previous state encodings) and read/modify it accordingly. Having a readable full/empty state would also be great.

Such a data structure can be helpful for many problems. For example, a function might return a box that’s either filled or empty – similarly to the Maybe monad in Haskell!

We use a very nifty trick to accomplish this: Let’s define the empty box as [[1]] and the full box as [[0 X]] (where X is the data within the box). We can construct a box using this simple function:

# first argument is the box value
box [[[0 2]]]

Checking whether a box is full or empty can then be done using this state? function:

state? [0 is-empty [is-full]]

  # empty box
  (state? [[1]])
⤳ [0 is-empty [is-full]] [[1]]
⤳ [[1]] is-empty [is-full]
⤳ [is-empty] [is-full]
⤳ is-empty

  # full box
  (state? [[0 X]])
⤳ [0 is-empty [is-full]] [[0 X]]
⤳ [[0 X]] is-empty [is-full]
⤳ [0 X] [is-full]
⤳ [is-full] X
⤳ is-full

Now onto the reading part. We obviously can’t read data from an empty box, so that’s something we should check first. Other than that, a full box can basically be read using [[0 X]] Y [0], where Y represents any other term. This isn’t very elegant though.

Luckily, there’s a way better solution that even returns a default value if the box is empty:

# first argument is the default value
read [[0 1 [0]]]

  # empty box
  (read D [[1]])
⤳ [[0 1 [0]]] D [[1]]
⤳ [0 D [0]] [[1]]
⤳ [[1]] D [0]
⤳ [D] [0]
⤳ D

  # full box
  (read D [[0 X]])
⤳ [[0 1 [0]]] D [[0 X]]
⤳ [0 D [0]] [[0 X]]
⤳ [[0 X]] D [0]
⤳ [0 X] [0]
⤳ [0] X
⤳ X

We’re almost done – the only remaining feature is a write function. As it turns out, that’s almost hilariously simple: We just construct a new box and completely ignore the old one!

# first argument is the box, second argument the new value
write [[box 0]]

  # empty box
  (write [[1]] D)
⤳ [[box 0]] [[1]] D
⤳ [box 0] D
⤳ box D
⤳ [[[0 2]]] D
⤳ [[0 D]]

  # full box
  (write [[0 X]] D)
⤳ [[box 0]] [[0 X]] D
⤳ [box 0] D
⤳ box D
⤳ [[[0 2]]] D
⤳ [[0 D]]

You can find the full implementation in the std/Box or std/Option files of bruijn’s standard library.

Pairs

Another useful data structure is a pair. The typical implementation has some similarities to a full box: We now store two expressions instead of one and need one abstraction less as the extra abstraction of the full box is just for empty/full checking – pairs don’t need that.

A pair with A and B then looks like [0 A B] – easy! Extracting data from it is even easier: We just need to apply either [[1]] or [[0]].

  # extract A using [[1]]
  [0 A B] [[1]]
⤳ [[1]] A B
⤳ [A] B
⤳ A

  # extract B using [[0]]
  [0 A B] [[0]]
⤳ [[0]] A B
⤳ [0] B
⤳ B

We can convert these applications to callable functions by introducing an abstraction and application: fst [0 [[1]]] and snd [0 [[0]]].

There are several other quite elegant operations for pairs in std/Pair – have a look!

n

-tuples

The next step might now be obvious: Extending the pair definition and its selectors to support

n

arguments instead of only two!

Defining a triple data structure for A, B and C results in [0 A B C]. The selectors are then simply [[[0]]], [[[1]]] and [[[2]]]! Similarly, the selectors for a quadruple are [[[[0]]]], [[[[1]]]], [[[[2]]]] and [[[[3]]]].

This has the evident drawback of

\mathcal{O}(n)

-space selectors, so let’s look at a more general way of combining and accessing elements.

Lists

We can construct linked lists using a well-known principle, namely chaining multiple pairs together to get an arbitrary-long combination of elements – this might remind you of some Lisp-like languages.

Constructing a list of A, B and C is then as simple as [0 A [0 B [0 C [0 [[0]]]]]], where [[0]] at the end represents the end of the list. We can use the selectors of pairs to get the head or tail of the list:

  # returns the head of the list using [[1]]
  [0 A [0 B [0 [[0]]]]] [[1]]
⤳ [[1]] A [0 B [0 [[0]]]]
⤳ [A] [0 B [0 [[0]]]]
⤳ A

  # returns the tail of the list using [[0]]
  [0 A [0 B [0 [[0]]]]] [[0]]
⤳ [[0]] A [0 B [0 [[0]]]]
⤳ [0] [0 B [0 [[0]]]]
⤳ [0 B [0 [[0]]]]

Adding an element to a list is done by simply wrapping the old list with a new one. For example, adding A to an empty list: [0 [[0]]] becomes [0 A [0 [[0]]]].

Getting the

n

th element of a list can be accomplished by using repeated application of [[0]] until all unwanted elements are dropped. This list format is perfect for stack-like operations.

In bruijn, lists are one of the most powerful implementations. This is mainly because this data structure is actually great for lazy evaluation (e.g. infinite lists!) – using std/List can therefore be very elegant and efficient.

Tagged Union

If you need to store several values with different “types”, you probably want to use a tagged union (also known as a sum type).

Implementing tagged unions in lambda calculus requires you to know the number

n

of different types of data you want to store. You can then construct the union as follows (the Scott encoding):

# meta notation for encoding type x as union of n types
construct [ⁿ⁺¹ x n]

# example for three types "ternary", "binary", "unary":
construct-ternary [[[[0 3]]]]
construct-binary [[[[1 3]]]]
construct-unary [[[[2 3]]]]

:test (construct-ternary (+3t)) ([[[0 (+3t)]]])
:test (construct-binary (+3b)) ([[[1 (+3b)]]])
:test (construct-unary (+3u)) ([[[2 (+3u)]]])

# meta notation to return true if of type n
type? [0 [false] ... [true] ... [false]]

# example for three types "ternary", "binary", "unary":
ternary? [0 [false] [false] [true]]
binary? [0 [false] [true] [false]]
unary? [0 [true] [false] [false]]

:test (ternary? (construct-ternary (+3t))) (true)
:test (unary? (construct-ternary (+3t))) (false)

# meta notation for n types
extract [0 [0] ... [0]]

# example for three types "ternary", "binary", "unary":
extract-nary [0 [0] [0] [0]]

:test (extract-nary (construct-ternary (+3t))) ((+3t))
:test (extract-nary (construct-binary (+3b))) ((+3b))
:test (extract-nary (construct-unary (+3u))) ((+3u))

The encoding can easily be extended to support multiple values per type, for example:

# example for three types "two ternary", "binary", "unary"
construct-two-ternary [[[[[0 4 3]]]]]
construct-binary [[[[1 3]]]]
construct-unary [[[[2 3]]]]

:test (construct-ternary (+3t) (+4t)) ([[[0 (+3t) (+4t)]]])
:test (construct-binary (+3b)) ([[[1 (+3b)]]])
:test (construct-unary (+3u)) ([[[2 (+3u)]]])

ternary? [0 [false] [false] [[true]]]
binary? [0 [false] [true] [[false]]]
unary? [0 [true] [false] [[false]]]

:test (ternary? (construct-two-ternary (+3t) (+4t))) (true)
:test (unary? (construct-two-ternary (+3t) (+4t))) (false)

# extracting different values of two-ternary
extract-nary1 [0 [0] [0] [[0]]]
extract-nary2 [0 [0] [0] [[1]]]

:test (extract-nary1 (construct-ternary (+3t) (+4t))) ((+4t))
:test (extract-nary2 (construct-ternary (+3t) (+4t))) ((+3t))
:test (extract-nary1 (construct-binary (+3b))) ((+3b))
:test (extract-nary1 (construct-unary (+3u))) ((+3u))

Trees

Let’s finally look at a more advanced data type: Trees. Their idea is essential for fundamental concepts like searches, sorting algorithms and sets.

A straightforward implementation can be achieved using rose trees. In a nutshell, the root of the tree is a list, where each element is in turn another list. Doing this recursively can result in all kinds of tree variations and can be quite powerful. You can find such an implementation in std/Tree/Rose.

Especially searching, inserting and removing elements from rose trees can potentially result in huge time complexities though (depending on the specifics of the implementation and usage, obviously). Instead, researchers came up with a more efficient method: Balanced binary trees. In combination with only two outgoing branches per node, they also automatically rebalance themself on inserts for optimal search and insert performance.

As a proof of concept, I wrote an AVL tree implementation in bruijn. While I certainly wouldn’t call it efficient, it once again proves how powerful lambda calculus really is. You can find the implementation in std/Tree/Balanced.

Thanks for reading. This article might also be interesting. Suggest improvements or fixes via email. Support on Ko-fi.