Lambda Calculus Encoding

2023-04-24

This article uses the syntax of the bruijn programming language.

There are many ways of encoding data as lambda calculus expressions.
Today I want to find the *shortest* way of doing this, where the
length of an expression is the number of bits of its BLC encoding.

Binary data is basically just a (potentially huge) encoding of a single state. As can be seen in my previous article, Data structures in pure lambda calculus, states may be represented using $b+1$ abstractions, $b$ being an arbitrary base. For example, for $b=2$:

```
1 ⤳ [[[1 2]]]
2 ⤳ [[[1 (0 2)]]]
3 ⤳ [[[1 (1 2)]]]
4 ⤳ [[[1 (0 (0 2))]]]
```

This time we don’t even need the additional abstraction, as the comparability and readability isn’t a requirement for this task. We could also omit the right-associativity but this doesn’t have an effect on the length.

```
1 ⤳ [[1]]
2 ⤳ [[1 0]]
3 ⤳ [[1 1]]
4 ⤳ [[1 (0 0)]]
```

Representing these expressions in BLC yields the following results:

```
[[1]] ≡ 0000110 # (length 7)
[[1 0]] ≡ 00000111010 # (length 11)
[[1 1]] ≡ 000001110110 # (length 12)
[[1 (0 0)]] ≡ 000001110011010 # (length 15)
```

I wrote a small program to find the average length of the encoding for numbers upto $10^6$ (you can find it here). It turns out to be approximately $87.69$ bits.

This result can be improved by switching to a higher base. For ternary, the previous examples look like this:

```
1 ⤳ [[[1]]]
2 ⤳ [[[2]]]
3 ⤳ [[[1 0]]]
4 ⤳ [[[1 1]]]
[[[1]]] ≡ 000000110 # (length 9)
[[[2]]] ≡ 0000001110 # (length 10)
[[[1 0]]] ≡ 0000000111010 # (length 13)
[[[1 1]]] ≡ 00000001110110 # (length 14)
```

The average length upto $10^7$ is around $75.59$ bits. Running the code for different bases should then find some kind of optimum:

base | average |
---|---|

2 | 102.73 |

3 | 75.59 |

4 | 68.76 |

5 | 67.02 |

6 | 67.64 |

7 | 69.31 |

8 | 71.28 |

9 | 74.20 |

10 | 77.06 |

It seems like 5 gives the best results!

This only tells us the optimal base for *small* numbers,
though. As I want to encode *very big* numbers (namely, binary
data), we need to test for that too.

I extended my program (again, here) to calculate the average length of $N=5$ different random numbers with decimal length of $10^5$ digits (around 330k bits).

base | average |
---|---|

2 | 1494999.2 |

3 | 1048144.2 |

4 | 913395.8 |

5 | 858302.0 |

6 | 835602.8 |

7 | 828721.8 |

8 | 831189.6 |

9 | 838149.6 |

10 | 849605.4 |

It seems like 7 is a good base for encoding numbers of such length (the numbers keep increasing after 10).

While this experiment is still quite unscientific and I’m
*sure* that there’s a nifty mathematical way of calculating the
optimum using logarithms etc., I believe it’s safe to say that a base
between 5 and 8 gives good results for most binary data.

$❦$

Thanks for reading. Suggest improvements or fixes via email. Support on Ko-fi.