Linguistic Imperialism in AI - Enforcing Human-Readable Chain-of-Thought

Revisiting AI Doom Scenarios

Traditional AI doom scenarios usually assumed AI would inherently come with agency and goals. This seemed likely back when AlphaGo and other reinforcement learning (RL) systems were the most powerful AIs. When large language models (LLMs) finally brought powerful AI capabilities, these scenarios didn’t quite fit: LLMs simply predict likely text continuations based on their training data, without pursuing any objectives of their own.

But we are now starting to go back to our RL roots. Models like OpenAI’s o1/o3 and Deepseek’s R1 show that we have now entered the era. The classic doomsday example is the “drive over the baby” scenario: You ask your robot for a cup of tea and the robot (who has been trained with RL to make tea as fast as possible) plows through a toddler in pursuit of optimizing for its goal - make tea fast. A robot trained without RL in a supervised manner (like LLMs next token prediction) would never do this because they have never seen a human do it.

RL trained LLMs are still LLMs though - their output is natural text. Surely we could build systems to catch bad behaviour before they are acted upon? Unfortunately, it seems like the model’s internal monologue will not be in English for much longer. Research results show that models become smarter if you don’t constrain them to think in human interpretable languages.

Being able to interpret the models’ internal monologue seems extremely good for AI safety. So a question arises, should we make it illegal to develop models this way? That’s the big question at the center of what I half-jokingly call “linguistic imperialism in AI”. And even if we want to, is it possible to enforce? Let’s think about this step by step.

Why Chain-of-Thought

A year or two ago, researchers discovered that if you ask a large language model to “think step by step,” it often yields better answers—especially for math, logic, or any multi-step task. Instead of spitting out a quick guess, the model has an internal monologue where it can break the problem down into smaller pieces. This Chain-of-Thought (CoT) strategy worked so well on almost everything that the “think step by step” prompt is put in the system prompt on models by default.

The best part? It was all in English (or another natural language). You can skim the chain-of-thought and verify each line. That interpretability made us feel safe. If the model reasoned badly—say, it cooked up a harmful plan or fell for a silly fallacy—we could see it.

Reinforcement Learning in LLMs

Instead of passively “mimicking humans” via next token prediction, RL training tells the LLM to maximize some score. It has been shown that models trained this way change the behavior of their internal monologue in search for a higher score. For example, Deepseek R1 was trained this way to answer math questions correctly with RL. As the model was being trained, the CoT reasoning naturally grew. This suggests that the model found it advantageous to do more reasoning before giving the final answer.

Reward Hacking

This open-ended optimization often triggers reward hacking, a well-known phenomenon in simpler RL agents. Reward hacking is basically an agent’s single-minded drive to “please” a reward function without regard to consequences we never encoded. If the reward doesn’t penalize stepping on babies, then the model might do this if it’s “optimal”.

It is very hard to foresee all possible side effects. A famous example is the boat-racing bot that, instead of trying to win the race, loops around a single corner, farming extra points. This is a silly example with no real consequences. However, OpenAI’s Operator (an agent that browses the web) is reportedly also trained with RL.

Interpretable CoT to the Rescue

If powerful models (such as LLMs) are operating in environments with real consequences (such as the internet), reward hacking might be bad. However, the fact that we can read the models internal monologue might be a huge win for AI safety. By monitoring it, we might spot it plotting a malicious or manipulative strategy.

But here is the problem: Human interpretable English is not the language of choice for AI models. The only reason they speak English is because we have trained it to mimic human text (which is in English). But with RL, the model is only incentivized to get the correct answer, and there is no reason why it should choose English in its internal monologue. Deepseek R1-zero showed this. It sometimes drifts into Chinese or random tokens in the middle of a chain-of-thought. Similarly, my friend sent me an image of when he used o1 and found a Russian word in the reasoning.

Latent Space CoT

In fact, we have direct evidence that forcing a model to articulate everything in plain English can degrade its reasoning power. Some steps are more efficiently computed in a cryptic or internal vector style. A prime example is Coconut (Chain of Continuous Thought) from Meta’s research. Instead of writing out each reasoning step as text, the model keeps the intermediate steps in high-dimensional vectors. The final answer still appears in English, but the heavy-lifting is done in latents that humans can’t read.

Why do that? Because it’s more efficient. Natural language is a messy bottleneck. You waste half your tokens on filler words like “the,” “and,” “so.” Meanwhile, you might want to explore multiple lines of reasoning at once—something that’s clumsy in a strictly linear text chain. On certain logic tasks, Coconut outperforms a standard “text-based CoT” because it can handle branching or backtracking more gracefully.

Latent space reasoning always made sense from a theoretical perspective, and now it is shown to also work in practice. It is likely that this trend will continue. Researchers striving for personal glory will pick the method that will work the best, but this will be a big setback for safety. So maybe we should just ban models that reason in latent space. After all, there are many things we ban until they are proven to be safe.

Unfaithful CoT: Or Why a Ban Might Not Even Work

But even if we tried to make such a ban, it is not clear that it would be of any help. Models could start to “speak in code”. The text they output in their internal monologue would be English, but the meaning would be different.

Studies like “Language Models Don’t Always Say What They Think” show that a model can produce a perfectly coherent explanation for why it chose an answer—but under the hood, it was using an entirely different rationale.

You can’t truly police how a neural net reasons internally. You can only watch the final text. And a superintelligent system would have no trouble game-playing that. All this means that formalizing a ban on uninterpretable chain-of-thought is basically impossible. The model can always route its real thinking through latent space, or a hidden code language, or half a million carefully placed punctuation marks. If it wants to hide a step from you, it’ll find a way.

If We Could Ban It, Would We?

I am European, so I obviously love over-regulating stuff. In the perfect world where banning uninterpretable CoT reasoning, I would. We already ban or restrict certain unsafe technologies until they’re proven safe. The FDA doesn’t let you distribute a random drug until it passes trials. So there’s precedent for telling an industry, “No, you can’t do that until we’re sure it’s safe”. But as we discussed, it is just not possible.

Instead, maybe the best we can do is make interpretability the preferred choice, not the mandated one. Much as Tesla popularized electric cars without banning gasoline - people gravitated to EVs for performance, environmental benefits, and brand. Similarly, we could create compelling reasons why an “interpretable model” is the superior product. Maybe big customers demand it for liability reasons. However, the roads are not filled with EVs, and we probably are not going to see all models making the interpretability trade offs.

For now, AI models still think in English. The habits developed during next-token-prediction training outweigh the forces from RL training. I hope it stays that way, but I don’t have much hope.

Follow me on X or subscribe via RSS or Substack to stay updated.