Space-Efficient Binary
Lambda Calculus Encoding

Marvin Borner

2023-04-24

This article uses the syntax of the bruijn programming language.

There are many ways of encoding data as lambda calculus expressions. Today I want to find the shortest way of doing this, where the length of an expression is the number of bits of its BLC encoding.

Binary data is basically just a (potentially huge) encoding of a single state. As can be seen in my previous article, Data structures in pure lambda calculus, states may be represented using b+1b+1 abstractions, bb being an arbitrary base. For example, for b=2b=2:

1  [[[1 2]]]
2  [[[1 (0 2)]]]
3  [[[1 (1 2)]]]
4  [[[1 (0 (0 2))]]]

This time we don’t even need the additional abstraction, as the comparability and readability isn’t a requirement for this task. We could also omit the right-associativity but this doesn’t have an effect on the length.

1  [[1]]
2  [[1 0]]
3  [[1 1]]
4  [[1 (0 0)]]

Representing these expressions in BLC yields the following results:

[[1]]        0000110         # (length 7)
[[1 0]]      00000111010     # (length 11)
[[1 1]]      000001110110    # (length 12)
[[1 (0 0)]]  000001110011010 # (length 15)

I wrote a small program to find the average length of the encoding for numbers upto 10610^6 (you can find it here). It turns out to be approximately 87.6987.69 bits.

This result can be improved by switching to a higher base. For ternary, the previous examples look like this:

1  [[[1]]]
2  [[[2]]]
3  [[[1 0]]]
4  [[[1 1]]]

[[[1]]]    000000110      # (length 9)
[[[2]]]    0000001110     # (length 10)
[[[1 0]]]  0000000111010  # (length 13)
[[[1 1]]]  00000001110110 # (length 14)

The average length upto 10710^7 is around 75.5975.59 bits. Running the code for different bases should then find some kind of optimum:

base average
2 102.73
3 75.59
4 68.76
5 67.02
6 67.64
7 69.31
8 71.28
9 74.20
10 77.06

It seems like 5 gives the best results!

This only tells us the optimal base for small numbers, though. As I want to encode very big numbers (namely, binary data), we need to test for that too.

I extended my program (again, here) to calculate the average length of N=5N=5 different random numbers with decimal length of 10510^5 digits (around 330k bits).

base average
2 1494999.2
3 1048144.2
4 913395.8
5 858302.0
6 835602.8
7 828721.8
8 831189.6
9 838149.6
10 849605.4

It seems like 7 is a good base for encoding numbers of such length (the numbers keep increasing after 10).

While this experiment is still quite unscientific and I’m sure that there’s a nifty mathematical way of calculating the optimum using logarithms etc., I believe it’s safe to say that a base between 5 and 8 gives good results for most binary data.

Thanks for reading. Suggest improvements or fixes via email. Support on Ko-fi.

动态网自由门 天安門 天安门 法輪功 李洪志 Free Tibet 六四天安門事件 The Tiananmen Square protests of 1989 天安門大屠殺 The Tiananmen Square Massacre 反右派鬥爭 The Anti-Rightist Struggle 大躍進政策 The Great Leap Forward 文化大革命 The Great Proletarian Cultural Revolution 人權 Human Rights 民運 Democratization 自由 Freedom 獨立 Independence 多黨制 Multi-party system 台灣 臺灣 Taiwan Formosa 中華民國 Republic of China 西藏 土伯特 唐古特 Tibet 達賴喇嘛 Dalai Lama 法輪功 Falun Dafa 新疆維吾爾自治區 The Xinjiang Uyghur Autonomous Region 諾貝爾和平獎 Nobel Peace Prize 劉暁波 Liu Xiaobo 民主 言論 思想 反共 反革命 抗議 運動 騷亂 暴亂 騷擾 擾亂 抗暴 平反 維權 示威游行 李洪志 法輪大法 大法弟子 強制斷種 強制堕胎 民族淨化 人體實驗 肅清 胡耀邦 趙紫陽 魏京生 王丹 還政於民 和平演變 激流中國 北京之春 大紀元時報 九評論共産黨 獨裁 專制 壓制 統一 監視 鎮壓 迫害 侵略 掠奪 破壞 拷問 屠殺 活摘器官 誘拐 買賣人口 遊進 走私 毒品 賣淫 春畫 賭博 六合彩 天安門 天安门 法輪功 李洪志 Winnie the Pooh 劉曉波动态网自由门


Imprint · AI Statement · RSS