Programming With 5 Bits of RAM
... and what Turing machines can tell us about programming languages.
I remember, as a kid in the 1980s, working on computers with 64 kilobytes of RAM. Changing out floppy disks was a way of life. Today, it isn’t unusual for a programming language or toolchain to use 100 megabytes or more—modern machines can afford it. We have mostly forgotten what it is like to program in a constrained environment.
We know that it is possible to program within 4 kilobytes of RAM, because people used to do it. An interesting question is: how low can one go, and still have a general-purpose computer? 1 kilobyte? 256 bytes? The answer, if one is willing to put up with inconvenience is... very low. Of course, we no longer have luxuries such as garbage-collected runtimes... or runtimes of any sort.
We shall, in this analysis, consider computational universality mandatory. Our programming environment—our ultra-low RAM computer—must still be able to perform any computation, and there must be a feasible way to write the code. Universality, in turn, requires our ability to accept all sizes of input—otherwise, we are only talking about a function over a finite domain, which can trivially be implemented with zero RAM, as a static lookup table. What this means is that our working memory—our context, our state register—will be constrained, but we will have an infinite environment in which to work, that being a necessity of our problem. There is no need to invent new formulations for all this, because one already exists: the model of computability considered standard, the Turing machine.
A Turing machine is a 1930s robot—it could have been built then with contemporary electromechanical technology—that operates on a linear tape of unspecified length. Since the scope of concern is what is and is not computable, as opposed to what is economical to compute, there is no constraint on how long the tape may be—you can assume there will always be enough. Each time step, however, the machine does only a few basic things:
reads one symbol (e.g., letter or character) from the tape.
consults a single register, which is the only internal state the machine has.
may halt; alternatively, may change its state or write a new symbol to the tape.
if it does not halt, moves one cell left or right.
This is about the simplest machine that could be called a computer; and yet, such machines—not each of them, individually, but the class of them—can perform any computation. Models of computation providing far more features have been proposed, and none of the standard ones alter the set of problems that can be computed.
The set of symbols such a machine recognizes is called its alphabet—call its size a. The number of states the machine recognizes is often called s. The class of these machines are called (s, a) Turing machines; one with five states and the alphabet {A
, B
, C
} would be called a (5, 3) machine. Both of these must be preconfigured; so too much be the single program the machine runs, specified as a lookup table (call it T) whose entries indicate: (a) the symbol to be written; (b) the machine’s new state; and (c) the direction (left or right) the tape head will move before it executes the next step. Additionally, there is a symbol in the alphabet that is designated as blank, and there is a state—by convention, not included among the s—called the halting state that corresponds to the machine being turned off. The input to the program is the sequence of symbols—only a finite number may be nonblank—on the tape when the machine starts; the output is what is on the tape if the computation halts. A non-halting computation is considered to be a failure.
For example, here is a specification for a (4, 2) Turing machine that converts a string of n 1
s into a string of n+3 1
s; its start state is S
and its halt state is called H
; the blank symbol is 0
. Rules for combinations that will not occur during legal execution are left unspecified.
T[S, 0] = (A, 1, R) T[A, 0] = (B, 1, R) T[B, 0] = (C, 1, R)
T[S, 1] = (S, 1, R) T[C, 0] = (H, -, -)
The machine starts at the left of the input; it scans right until it finds the first blank (0
) and uses intermediate states to write three more 1
s, then halts. This particular machine never moves left, but most of the interesting ones will go back and forth.
More generally, when a Turing machine is turned on, it executes its program against the contents of the tape, like so:
while (state != Halt) {
(new_state, new_symb, dir) <- T[state, read()]
write(new_symb)
state = new_state
if dir = ‘L’ move_left(); else if dir = ‘R’ move_right();
}
This, ultimately, is the only thing a Turing machine does: Read a symbol, Evaluate it in context of a single state register, Print a symbol before updating state and moving, in a Loop, until a halting state is reached. The stored transition table is the program, and it’s read-only memory—it never changes.
Although it is sometimes said that a Turing machine’s tape is infinite, it is more accurate to say that it’s unbounded. Theoreticians do not care much about economic distinctions, such as that between programs that can be executed using no more than 1,000,000 tape cells and those that require at least 1,000,001. The role of Turing machines is not to tell us how things will be computed—real machines are far more complicated—but, more specifically, to describe what can be.
The unlimited length of the tape is an example of a design principle, parameter deletion. An infinite amount of tape will never actually be used, since a finite number of cells will be non-blank when the machine starts, and only one symbol can be written in each step. We just don’t care to set a fixed bound. Having an unbounded amount of slow storage does not, it turns out, give us computing superpowers. It is a shed full of floppy disks, not a God machine.
One might ask whether the principle of parameter deletion could be applied to the alphabet size and state count. Upward, the answer is no. A machine capable of recognizing an infinite alphabet or an infinite number of internal states would be physically unrealizable. In fact, in the 1930s, to build a Turing machine with more than ten states would have required substantial engineering effort, though an average computer today can easily simulate one with millions. What about parameter deletion downward? The answer is: sort-of. Any computation that can be done with 2k symbols can be done with 2, as we can emulate the former by treating k-long bit vectors as symbols in a higher-order language, the cost of this being state count and performance. So, it can be shown that all Turing machines can be emulated with 2-symbol ones. It is also the case that we can reduce state count, at a cost of (possibly) alphabet size and, again, performance, by using tape space to store the state register of an emulated larger machine. This notion of emulation leads to the question of whether there exist “universal” programs that can take any program, decode and execute it, and the answer is that, yes, there are. As a result, there are values—surprisingly small ones—of s and a for which (s, a) Turing machines exist that are capable of emulating any such machine. Each of these parameters can be driven as low as 2, but not both at the same time: there is no (2, 2) universal Turing machine.
Still, the fact that a Turing machine—as specified in a known interface language—can be input to another one remains a big deal; even though each machine only runs “one program,” there are machines that can, through a bootstrapping technique, be used to run anything. We call these universal Turing machines, or UTMs. Yurii Rogozhin discovered a number of minimal UTMs in 1996; he found one with 24 states on the 2-symbol binary alphabet {0, 1}; there is one with 5 states and 5 symbols; finally, there is one with 2 states but an 18-symbol alphabet. The last of these is, in many ways, the most interesting one, because when embedding a Turing machine into a game, economic model, or other decision process to prove it undecidable— as was done for Magic: the Gathering in 2019 by Alex Churchill, Stella Biderman, and Austin Herrick—it is usually easier to extend the alphabet than to include more states. This (2, 18) machine also answers the question of how little working memory (or RAM) a computer can have and still be universal—the answer is 1 bit.
Arguably, though, the amount of RAM required is not the only concern when discussing minimality in computation. The (2, 18) machine has 36 transition rules, and if we use 7 bits to store each, it requires 252 bits of read-only memory. Rogozhin’s (4, 6) machine, using naive storage (no compression) demands only 144 bits and is, by this metric, the smallest.
The drawback of these ultra-small universal Turing machines is that they are difficult to program. Self-modifying code is commonplace. I would argue that the input language of any UTM is, by definition, a general-purpose language, but these do not feel like programming languages in the way that Python, Haskell, and C do. The truly minimal UTMs each require us to learn specific parochial techniques, and they are not efficient. The m-tag model of computation, even older than the Turing machine, feels like esoteric, graduate-level math (because it is) rather than software engineering. You can program on a machine with only 1 bit of random-access runtime state; most people wouldn’t.
At the other extreme, if an interpreted programming language—whether it is natively interpreted, or an intermediate bytecode for a compiled language—can be interpreted by a real computer with m bits of RAM, then a (2m, 256) Turing machine can be built, in theory, with identical behavior—that is, it can interpret text files written in that same language. Here, the tape stands in for I/O, one byte at a time—we assume that the program itself is agnostic of where these bits come from and go to; the abstraction of the tape stands in for the real sources and destinations. Thus, and it should not be surprising, given the universality of the Turing machine, we can build interpreting machines for whatever languages we want when we have truly massive state counts, e.g. 21,000,000. Of course, no real interpreter would we built that way; it becomes impossible to store the transition table.
As our state counts get smaller, interface languages get more simplistic and cumbersome. So how small can a UTM be and have its input language still feel like a “real” programming language? Obviously, this is a subjective question. It is more thorough to say there is a continuum, a frontier of possibilities that is probably mostly unexplored. The more states and symbols a machine is allowed to recognize, the more opportunities there are to make its interface language user friendly.
To be clear about the rules of analysis, the machine corresponding to a language must interpret it natively. Even the smallest UTMs can handle a translated version of any program in any language; that doesn’t count. For the same reason, additional input to the program is not allowed—otherwise, a larger Turing machine could be passed in to a small UTM. Thus, when we say that a language L can be interpreted by a (s, a) Turing machine, we mean that the code to be interpreted must be stored in plaintext on the tape, as it would sit on a file on a real machine; however, we do not require that it support the I/O needs of a real system. What would otherwise be additional input may be (and must be) hard-coded in the program—that is, stored within it in constants. We represent output strictly with the tape; therefore, we can restrict our attention to programs that do one round of output in bulk—there is no requirement of interaction with a real user. Finally, output is only required at all for programs that terminate; a successful program that does not terminate (e.g., an event loop) does not fit with the Turing model and is ignored—however, the machine may not terminate successfully for an L program that doesn’t do so (since then it would make a false claim about the program’s behavior in L). Last of all, while it is crucial to specify that no translation be required for tape input—what is written on the tape must be precisely equal to the syntax of the language—we are not so stringent about output, and allow reasonable interpretation expectations.
For example, if language L requires support of character (8-bit) output, as Brainfuck does, we would still accept a 10-symbol machine that returns:
[0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]
in lieu of outputting "Hi"
, for which the former is an unambiguous ASCII representation, because it is using all the language it has. I say “10-symbol” (as opposed to 2) because there are 8 additional symbols, in possible input, the machine must recognize in plain text: the eight characters that Brainfuck uses.
Finally, and this probably goes without saying, there is no requirement that the language’s interpreting Turing machine be efficient—only correct.
That gone over, below is a (25, 13) Turing machine—thus, it operates with 5 bits of RAM—whose interface language (“Stix”) is probably close to as minimal as one can get, while still keeping the feel of a programming language. Like Brainfuck, it is an esoteric language (or “esolang”) that one would not use for real programming; still, it is close enough to real languages that, with a few hours of practice, one could figure out how to use it.
Let’s discuss how to read the diagram above. There are seven black (or ordinary) symbols: 0 ' b < ( ) #
in this language, and the blank symbol b
is often represented equivalently as a space. The first six of these also have a red—for accessibility, also underlined—form. This Turing machine interprets a language called Stix—I imagine Styx is taken, and this language isn’t very important, so I’ll keep it Stix and spare someone a hash collision.
Not all of the transition rules are drawn. Instead, each state has been given a sign that corresponds to its default; a minus sign means that if no edge applies, the state is unchanged and no symbol is written (or, equivalently, the symbol that was read is written) and the machine moves to the left; a plus sign means the same, but with movement to the right. These movement-only rules are called trivial rules. When a transition involves movement opposite to the state’s labeled direction, it is noted on the transition arrow. For example, M
is default-positive—all of its rules are of the form
(M, α) ⟶ (M, α, R)
,
except on the read of #
, for which the rule is
(M, #) ⟶ (N, b, L)
,
which is written on the diagram as “# ⟶ b | -
”; the minus sign indicates leftward movement after that transition. The start state for this machine is called A
; the halt state is 0
.
As for Stix, it is a stack-based language with four basic instructions:
0
(“null”): push a zero onto the stack.
'
(“succ”): increment the entry on the top of the stack (TOS).
#
(“halt”): terminate.
<
(“rot”): pop TOS → n; then rotate-left top n+2 entries; so, ROS[n] becomes new TOS.
That is, the behavior of < is:
X Y 0 -> Y X
X Y Z 1 -> Y Z X
X Y Z A 2 -> Y Z A X
Also, the halt instruction is only legal at the textual end of a standalone program. In this regard, it seems redundant—we could use the first blank symbol as our end-of-code marker—but we will see later why we need it. In truth, the requirement exists not for the benefit of the language, but the implementation—it reduces the machine’s state count.
The language also comes with two grouping symbols—(
, pronounced “hwee”, and )
, pronounced “plunk.” The behavior of (α)
is:
if TOS is 0, pop it.
otherwise, decrement TOS, execute
α
, then execute(α)
.
This gives us looping and conditional execution. The usual behavior is like a for-loop—if the same entry is TOS each time, and if it is never otherwise altered, we get a bounded loop with a descending counter. Of course, the language doesn’t require us to use it in this way. We could increment the counter within the loop, or use a different stack entry, etc.
At program start, the stack is empty. So, if we wish to compute f(7, 2) for some function f, with code β, we prepend 0'''''''0''
to f’s code to do this. We push a zero (the only thing we can push) to the stack, increment it seven times, push another zero, and increment it twice, yielding stack contents [7, 2]. Execution of any instruction with inadequate stack contents is erroneous and has undefined behavior. By convention, the return value of any function or program is whatever is on TOS when it is finished executing. That is, a code segment with behavior:
... X -> ... f(X)
is considered equivalent to the function f; the function ''
, for example, reifies λx.x+2. The standalone program 0''#
is said to return 2, since it leaves 2 on TOS.
Note the versatility of the ()
construction. It provides us with zero-testing, the decrement operator, and both primitive and general recursion. Primitive recursion—the downward-counting index—is the intended usage, but Stix does not limit us to this. We can loop forever with the function 0'(')
; we can write unbounded searches, and we can use the stack’s unboundedness for complex recursions, if we wish. Therefore, the language has a code segment for any of Godel’s μ-recursive functions—it’s Turing complete. Although the Turing machine above is nowhere close to being the smallest UTM, it provably is one. And yet its interface language is a legitimate esolang, no worse than (or, at least, not much worse than)Brainfuck.
It is typical that a concatenative language requires no end-of-program instruction: the empty string is a legal program, computing the identity function, and the first blank means “halt.” I would have liked to have been able to design Stix this way. From a design perspective, adding this halt marker #
is a wart. I did it because the alternative would have increased my state count, in context of this Turing machine’s runtime behavior. It should be obvious that the stack must live on the tape—it has infinite potential capacity, and therefore cannot fit in our finite (and tiny) state register. Furthermore, when designing Turing machines, one finds that a bulk of its steps will be spent on move-only transitions—the trivial rules discussed above—and the most important locations during runtime are: (a) the code pointer—this is why the instructions have red (marked) counterparts and (b) TOS. The machine is often going back and forth between these two places, and we want to use as few states in doing so as we can. We could store the stack in a number of ways, but the easiest and most state-efficient is to have space-separated representations (more on that, later) of the numbers. One way to mark TOS is not to do so: since the stack entries are set apart by single spaces, two consecutive blank spaces is the TOS marker. However, both seeking TOS and maintaining this property suffer from state-count bloat. Instead, once execution begins, the #
that marks the end of a program becomes a TOS marker, and we can seek-right to it at a cost of only one state. This is a beautifully horrible—or a horribly beautiful—hack that would be morally indefensible if it didn’t work. It is also self-modifying code, but the only instance of it this particular machine uses. Thus, although it seems out-of-character, as a design decision, to staple an ugly end-of-program requirement onto an otherwise minimalistic language, it reduces the number of states we require, at the small cost of one more symbol in our alphabet.
Let us go through some Stix programs; we will be writing functions rather than standalone programs, so we will ignore the #
marker for now.
We often have things on the stack that we no longer need. The program ()
is called “drop.” Each iteration of the loop decrements TOS, until the loop check fails (TOS zero) and the 0 is deleted. This not an efficient way to remove something from the stack, but it is all Stix has. We say this program’s behavior is:
... X -> ...
Another useful program, “plus,” is this one:
(0<'0<)
Each time we decrement TOS, we increment ROS[0], which will become TOS upon loop exit, due to the final zero deletion. Its behavior is this:
... X Y -> ... X+Y
A tragically inefficient but important program is this one: 000'<(0'<'0'<'0'<)
. Stix cannot read a stack entry except by iterated decrementation—that is, by destroying it—but this version helpfully creates two copies and throws away the depleted original. Its behavior is:
... X -> ... X X
So we call it “dup.” It allows us to turn destructive functions into nondestructive ones; that is, ... X -> ... f(X)
into ... X -> ... X X -> ... X f(X)
.
Here is multiplication:
00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()
Division-with-remainder looks like this:
00'<0'<0'(()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0''<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()0<()000)(()0<()0'<'0<0'<0'0))
And here’s primality testing:
0''('000'<(0'<'0'<'0'<)000'<(0'<'0'<'0'<)00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0'<000'<(0'<'0'<'0'<)0'<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()()()0'000)(()()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0'<0<00'<0'<0'(()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0''<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()0<()000)(()0<()0'<'0<0'<0'0))0'0<(()()()'00)(()()()()000)0))
Finally, let’s do “Hello, World!” How does Stix do I/O? Well, it doesn’t. I/O is imperative and dys-functional and evil and even Silicon Valley’s most notorious sex pests consider it perverse. Stix doesn’t need I/O. It has the stack.
We also note that finite arrays of natural numbers can be mapped uniquely to natural numbers via prime factorization. For example, the array [2, 0, 1] is represented by 23325071 = 504; we make a convention of prepending the length to ensure uniqueness. We represent strings as arrays of Unicode code points. So, the string “Hello, World!” is represented by 2133725101 ... 4333. There are lots of ways we can build up this number, but the most reusable one requires a function, enstring, with code:
0'0<(000'<(0'<'0'<'0'<)0''0<(0<0'(()'000'<(0'<'0'<'0'<)000'<(0'<'0'<'0'<)0''00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()()000)(()()0''('000'<(0'<'0'<'0'<)000'<(0'<'0'<'0'<)00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0'<000'<(0'<'0'<'0'<)0'<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()()()0'000)(()()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0'<0<00'<0'<0'(()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0''<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()0<()000)(()0<()0'<'0<0'<0'0))0'0<(()()()'00)(()()()()000)0))0)0'0<(()()00))0<)0''<0'0<(0'<000'<(0'<'0'<'0'<)0''<00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0'<)0<()0'<00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0<)
... which has the neat behavior of using the full potential of the stack; TOS tells it how many entries to consume. It behaves like so:
... 0 -> 1
... a 1 -> 2a
... a b 2 -> 2a3b
... a b c 3 -> 2a3b5c
Of course, a “Hello, World” program needs to populate the stack with the requisite characters before calling it. So we also have to build up a prelude that does so. Putting it all together, we have “Hello, World!”
0'''''''''''''0'''''''''0''''''''00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()000'<(0'<'0'<'0'<)0'''''0''''''00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()(0<'0<)(0)000'<(0'<'0'<'0'<)'''''''000'<(0'<'0'<'0'<)000'<(0'<'0'<'0'<)'''0'''''''0''''''00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()''0''''0''''''''00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0'''''''''0'''''''''00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0''''''(0<'0<)0''<000'<(0'<'0'<'0'<)0'''<0'''<0'''<0'''<000'<(0'<'0'<'0'<)'''000'<(0'<'0'<'0'<)(0)(0)(0)(0)(0)(0)000'<(0'<'0'<'0'<)0''''''''00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))()0''''0''''''''00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()'0''''''''''''''0'0<(000'<(0'<'0'<'0'<)0''0<(0<0'(()'000'<(0'<'0'<'0'<)000'<(0'<'0'<'0'<)0''00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()()000)(()()0''('000'<(0'<'0'<'0'<)000'<(0'<'0'<'0'<)00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0'<000'<(0'<'0'<'0'<)0'<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()()()0'000)(()()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0'<0<00'<0'<0'(()000'<(0'<'0'<'0'<)0'<000'<(0'<'0'<'0'<)0''<00<(0'<000'<(0'<'0'<)0'0<(()()0'<0'<00)(()()()0'00))0'0<(()()()0<()000)(()0<()0'<'0<0'<0'0))0'0<(()()()'00)(()()()()000)0))0)0'0<(()()00))0<)0''<0'0<(0'<000'<(0'<'0'<'0'<)0''<00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0'<)0<()0'<00'<(0'<000'<(0'<'0'<'0'<)0''<(0<'0<)0'<)0<()0<)#
At 1,400 characters, it fits anywhere you might need to store it. You could print it out on an index card. You could probably memorize it. You could type it out in a minute or two. That’s the good news. The bad news is that, while most of the low-level languages we actually use are low-level for the purpose of efficiency, Stix is so low-level that it cannot be efficient. We’ve already seen embarrassing “dup” and “drop” implementations, and our mechanism for addition iterated incrementation. To run this program, as it has been written, would require at least 101,000 times the age of the universe, because of the absurdly slow arithmetic. It would say “Hello” to a world that will probably have long ago said goodbye.
We have not even gotten to the worst (or best) part. Stix is so featureless as to be naturally inefficient, but the implementation offered by this small Turing machine adds even more inefficiency—exponentially more. I mentioned that the contents of the stack, during execution, are space-separated integers. Below, we show an intermediate state:
0'''0''0''''(0<'0<)(0<'0<) 0''' 0''''''#
The program is on the left: 0'''0''0''''(0<'0<)(0<'0<)
. It computes 3 + 2 + 4; the snapshot above is one where the first addition—the first (0<'0<)
—has been executed, leaving a six (0''''''
) on TOS, and the second addition has not. This is a “base one” integer representation. Is it valid? Nothing in nature or mathematics says it is not. Is it efficient? Absolutely not. Why do I use such a representation? Again, state count.
Data movement, for this machine, is especially wretched: the code segment for multiplication contains 11 <
instructions, several of which are in inner loops. This operation is O(n) not in the bit-size of the number, but the number itself. Our addition is therefore no better than O(n2) in, again, the number—not the binary size of it. The machine depicted above has been tested and it did correctly compute that the 65th prime is 313, but doing so required 2,165,906,081,027 Turing steps—on my computer, about an hour and a half.
The slowness of the machine above has several other sources. Every instruction requires transit from the code to the stack. A two-headed Turing machine—one that can transfer control from a “code” head to a “data” head in one step—would avoid these steps and work much faster. This is a capability that real computers, as opposed to simplified Turing machines, do have—caches allow machines to put focus in multiple places at once. Of course, the base-one representation of numbers, as above, is not intrinsic to the Stix language—we could upgrade to far more efficient base-two representations, at a modest cost in state count.
If we had, say, 128 states to work with instead of 25—and, additionally, no interest in keeping our alphabet smaller than, say, 32 symbols—we would likely want to do that. We would also equip our Turing machine with the necessary states to implement efficient duplication, deletion, arithmetic, and logical operators, because it is straightforward to make such a machine do these things in more sound ways. The machine, once upgraded, would still be slow—we are, after all, using a fast electronic computer to simulate a plodding, mechanical one—but we would have a usable machine that executes a function equivalent to “Hello, World!” above in a few minutes instead of after the heat death of the universe. With 512 states—maybe a lot less—we could probably implement a half-decent Lisp. In the tens or hundreds of thousands, we could probably build a machine that would feel like a genuine computer.
Is the state count required to interpret a programming language a perfect measure of where it lands between the extreme esolangs and the ones we actually use? Probably not. Of course, there is the matter of the minimal machine, for any such language, likely being unknowable—an implementation only gives us an upper bound. Furthermore, in measuring a Turing machine’s complexity, state count is hardly the only thing we care about. The amount of read-only memory, as well, gives an indication of how intricate a physical object would have to be to implement it. This metric is still not perfect, but I consider the complexity of a Turing machine, for this purpose, to be (rounding up) the following, measured in bits:
f(s, a, r, t) = t + as * h(r / as) + r * lg(2as + 1)
where s and a are the size of the state sets and alphabets, t is the number of states that have at least one trivial rule, r is the number of nontrivial rules, and
h(p) = -(p * lg p + (1-p) * lg(1 - p)); h(0) = h(1) = 1.
This index is more useful, in my view, because it is easy to lower the alphabet size or state count using hackish changes that increase other parameters, but the number of nontrivial rules is much more robust, and seems to better represent the innate complexity of what the machine is doing. The h function above suggests an information-theoretic inspiration, and this would be a correct intuition—the idea is to consider that a physical implementation could be made that stores only the executed rules; we add one bit per state for states that have executed trivial (move-only) rules. There are more sophisticated compressions one could include in such a formula, to exploit other potential regularities; the problem is that doing so risks giving advantages to specific machines—I would rather have a measurement that may be less accurate, information-theoretically, but that is simpler and, therefore, more likely to be fair.
By this metric, the Turing machines from Rogozhin’s 1996 paper weigh in between 124 and 270 bits—his smallest is the (5, 5) machine. This surprised me, because the (4, 6) machine is usually considered to be the smallest one, but the (5, 5) machine wins because it has so many trivial rules. His (2, 18) machine, arguably most interesting for its minimal state count, resulting in its invocation whenever a Turing machine needs to be embedded in some other system, comes in at 223 bits. The Stix Turing machine is not quite as trim, but is on the same order of magnitude, with 58 nontrivial rules and 19 trivial ones, for a score of 782 bits.