Space-Efficient Binary Lambda Calculus Encoding

There are many ways of encoding data as lambda calculus expressions. Today I want to find the shortest way of doing this, where the length of an expression is the number of bits of its BLC encoding.

Binary data is basically just a (potentially huge) encoding of a single state. As can be seen in my previous article, Data structures in pure lambda calculus, states may be represented using

b+1

abstractions,

b

being an arbitrary base. For example, for

b=2

1 ⤳ [[[1 2]]]
2 ⤳ [[[1 (0 2)]]]
3 ⤳ [[[1 (1 2)]]]
4 ⤳ [[[1 (0 (0 2))]]]

This time we don’t even need the additional abstraction, as the comparability and readability isn’t a requirement for this task. We could also omit the right-associativity but this doesn’t have an effect on the length.

1 ⤳ [[1]]
2 ⤳ [[1 0]]
3 ⤳ [[1 1]]
4 ⤳ [[1 (0 0)]]

[[1]]       ≡ 0000110         # (length 7)
[[1 0]]     ≡ 00000111010     # (length 11)
[[1 1]]     ≡ 000001110110    # (length 12)
[[1 (0 0)]] ≡ 000001110011010 # (length 15)

I wrote a small program to find the average length of the encoding for numbers upto

10^6

(you can find it here). It turns out to be approximately

87.69

bits.

This result can be improved by switching to a higher base. For ternary, the previous examples look like this:

1 ⤳ [[[1]]]
2 ⤳ [[[2]]]
3 ⤳ [[[1 0]]]
4 ⤳ [[[1 1]]]

[[[1]]]   ≡ 000000110      # (length 9)
[[[2]]]   ≡ 0000001110     # (length 10)
[[[1 0]]] ≡ 0000000111010  # (length 13)
[[[1 1]]] ≡ 00000001110110 # (length 14)

The average length upto

10^7

is around

75.59

bits. Running the code for different bases should then find some kind of optimum:

base	average
2	102.73
3	75.59
4	68.76
5	67.02
6	67.64
7	69.31
8	71.28
9	74.20
10	77.06

This only tells us the optimal base for small numbers, though. As I want to encode very big numbers (namely, binary data), we need to test for that too.

I extended my program (again, here) to calculate the average length of

N=5

different random numbers with decimal length of

10^5

digits (around 330k bits).

It seems like 7 is a good base for encoding numbers of such length (the numbers keep increasing after 10).

base	average
2	1494999.2
3	1048144.2
4	913395.8
5	858302.0
6	835602.8
7	828721.8
8	831189.6
9	838149.6
10	849605.4

While this experiment is still quite unscientific and I’m sure that there’s a nifty mathematical way of calculating the optimum using logarithms etc., I believe it’s safe to say that a base between 5 and 8 gives good results for most binary data.