Monday, December 22, 2025

How complex is human language?

This question has been asked for many times. The whole idea of a compiler came not from studying program composition, but from trying to reverse engineer a language. 

Mathematically, a language is defined as a set of symbols, composing a "grammar" that can be accepted (validated) by an automation and said automation can decide if the provided value is valid for that language or not. We have discovered sets of languages that can be mapped onto a variety of computational mechanisms that stem from simplest to most complex:

Turning machine is a mathematical model for what we call modern computers. It's slightly more abstract and allows for infinite memory, but overall thinking of it as a computer is a good approximation for readers of this blog. So where does a human language fall? Back in the 1950s-1980s we had hoped that English is a context-free language. From wikipedia on context-free languages:

Chomsky initially hoped to overcome the limitations of context-free grammars by adding transformation rules.

Such rules are another standard device in traditional linguistics; e.g. passivization in English. Much of generative grammar has been devoted to finding ways of refining the descriptive mechanisms of phrase-structure grammar and transformation rules such that exactly the kinds of things can be expressed that natural language actually allows. Allowing arbitrary transformations does not meet that goal: they are much too powerful, being Turing complete unless significant restrictions are added (e.g. no transformations that introduce and then rewrite symbols in a context-free fashion).

So what does this tell us? You need a "computer" to "run" human language. More over it means you need memory (hidden state) to reason over a language. This model also tells us that a meaningful collection of words (like a sentence, a paragraph, or this blog) is a trace through a program execution.

To successfully model a language you need to reconstruct the program from its traces. This statement seems doable, but it's way more involved than it sounds. 

Ok well traces are just paths through a graph right? Well, no - a graph is a finite state machine. Most famous way of collapsing language to a finite state machine is the Markov chain and it outputs sentences that look like this:

People, having a much larger number of varieties, and are very
different from what one can find in Chinatowns accross the country
(things like pork buns, steamed dumplings, etc.) They can be cheap,
being sold for around 30 to 75 cents apiece (depending on size), are
generally not greasy, can be adequately explained by stupidity.
...
So I will conclude by saying that I can well understand that she might
soon have the time, it makes sense, again, to get the gist of my
argument, I was in that (though it's a Republican administration).

From Google Groups

That's... decently sane for 1-hop statistical model, but if we pay attention we see that things quickly fall apart, as text starts talking about people, and ends up talking about dumplings. Given a more complex automation it feels like you can get more knowledge out of big datasets, but even though we fixed small-scale coherence we are still struggling to capture broader one.

No comments:

Post a Comment

How complex is human language?

This question has been asked for many times. The whole idea of a compiler came not from studying program composition, but from trying to rev...