Space-Efficient Binary
Lambda Calculus Encoding

Marvin Borner

2023-04-24

This article uses the syntax of the bruijn programming language.

There are many ways of encoding data as lambda calculus expressions. Today I want to find the shortest way of doing this, where the length of an expression is the number of bits of its BLC encoding.

Binary data is basically just a (potentially huge) encoding of a single state. As can be seen in my previous article, Data structures in pure lambda calculus, states may be represented using b+1b+1 abstractions, bb being an arbitrary base. For example, for b=2b=2:

1  [[[1 2]]]
2  [[[1 (0 2)]]]
3  [[[1 (1 2)]]]
4  [[[1 (0 (0 2))]]]

This time we don’t even need the additional abstraction, as the comparability and readability isn’t a requirement for this task. We could also omit the right-associativity but this doesn’t have an effect on the length.

1  [[1]]
2  [[1 0]]
3  [[1 1]]
4  [[1 (0 0)]]

Representing these expressions in BLC yields the following results:

[[1]]        0000110         # (length 7)
[[1 0]]      00000111010     # (length 11)
[[1 1]]      000001110110    # (length 12)
[[1 (0 0)]]  000001110011010 # (length 15)

I wrote a small program to find the average length of the encoding for numbers upto 10610^6 (you can find it here). It turns out to be approximately 87.6987.69 bits.

This result can be improved by switching to a higher base. For ternary, the previous examples look like this:

1  [[[1]]]
2  [[[2]]]
3  [[[1 0]]]
4  [[[1 1]]]

[[[1]]]    000000110      # (length 9)
[[[2]]]    0000001110     # (length 10)
[[[1 0]]]  0000000111010  # (length 13)
[[[1 1]]]  00000001110110 # (length 14)

The average length upto 10710^7 is around 75.5975.59 bits. Running the code for different bases should then find some kind of optimum:

base average
2 102.73
3 75.59
4 68.76
5 67.02
6 67.64
7 69.31
8 71.28
9 74.20
10 77.06

It seems like 5 gives the best results!

This only tells us the optimal base for small numbers, though. As I want to encode very big numbers (namely, binary data), we need to test for that too.

I extended my program (again, here) to calculate the average length of N=5N=5 different random numbers with decimal length of 10510^5 digits (around 330k bits).

base average
2 1494999.2
3 1048144.2
4 913395.8
5 858302.0
6 835602.8
7 828721.8
8 831189.6
9 838149.6
10 849605.4

It seems like 7 is a good base for encoding numbers of such length (the numbers keep increasing after 10).

While this experiment is still quite unscientific and I’m sure that there’s a nifty mathematical way of calculating the optimum using logarithms etc., I believe it’s safe to say that a base between 5 and 8 gives good results for most binary data.

Thanks for reading. Suggest improvements or fixes via email. Support on Ko-fi.