FORTH? Really!?
Imagine you have to generate the word that succeeds this colon: ___
What would you put in that blank space?
It’s easier when the question comes first.
But what if we structured things such that the blank had to be generated before its constituent parts. LLMs are wonderful, but I see too many people try to break down recursively to solve problems like top-down humans do. Instead, I posit that FORTH and associative/applicative languages may be better for transformer architectures. Concatenate, not integrate. Agree on the stack state.
I set out to question if this could be true.
Sideways Passing Join.
Imagine you had this program:
A SCAN [foo > 5] FILTER
B SCAN [foo < 5] FILTER
BUILD
PROBE
that performs a natural join on A and B’s shared identifiers.
Because of the properties of associative languages you can always make local edits. For example,
if you made a sed-like transformation, you could replace BUILD PROBE with the following anywhere
there’s a BUILD PROBE sequence to do a sideways-information-passing join:
DUP STATS SWAP BUILD
[PUSHDOWN] DIP PROBE
This same associative property allows us to divide a program into, “What’s been generated in-context,” and, “What remains to be generated.” We shuffle one token at a time to extend the context and consume our desire to generate tokens.
I have a hunch that transformations of finite automatons over subsequences of the text can be used to write optimization passes for the database layer.
A phrase from Manfred von Thun goes, “syntactic concatenation is semantic composition.”
A Benchmark
I set out to benchmark what models can do in this regard. Would the order of terms matter to an attention transformer? The experiment is simple: I want to construct a tree over numbers and measure when the tree conforms to instructions. In my experiment I used parity to assess whether the sum of a sub-tree’s children were even or odd. Thus, prefix notation needs to know the overall answer before it generates the sub-answers. Postfix notation generates bottom-up, generating sub-answers before answering further.
If you think about how you answer, “What is the next token,” you’ll see where I’m going.
Setup
Given: A sequence of numbers. Construct: A prefix or postfix parity tree.
What is a parity tree? An unbalanced, left- or right-skewed binary tree whose leaves are numbers and whose interior nodes represent the parity of their transitive children.
Results
I ran four trials across Opus and Haiku (sonnet gave results I need to better understand before I’ll publish). Thinking consistently outperforms non-thinking. Opus consistently outperforms Haiku. And postfix consistently outperforms prefix.
| Model | Thinking | Postfix Acc | Prefix Acc | Both Correct | Postfix Only | Prefix Only | Both Wrong |
|---|---|---|---|---|---|---|---|
| Haiku | Yes | 88.3% | 36.7% | 110 | 155 | 0 | 35 |
| Haiku | No | 6.7% | 4.3% | 9 | 11 | 4 | 276 |
| Opus | Yes | 98.3% | 81.3% | 243 | 52 | 1 | 4 |
| Opus | No | 50.0% | 9.7% | 28 | 122 | 1 | 149 |
All makes sense in the world.