Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.
(Last revised: January 2026. See changelog at the bottom.)
Part of the “Intro to brain-like-AGI safety” post series.
Thus far in the series, Post #1 set up my big picture motivation: what is “brain-like AGI safety” and why do we care? The subsequent six posts (#2–#7) delved into neuroscience. Of those, Posts #2–#3 presented a way of dividing the brain into a “Learning Subsystem” and a “Steering Subsystem”, differentiated by whether they have a property I call “learning from scratch”. Then Posts #4–#7 presented a big picture of how I think motivation and goals work in the brain, which winds up looking kinda like a weird variant on actor-critic model-based reinforcement learning.
Having established that neuroscience background, now we can finally switch in earnest to thinking more explicitly about...
Worse, on top of the speed and manpower problem, there’s also an interpretability problem. The human would be trying to grade “thoughts”, which are activation patterns in giant inscrutable world-model (§2.7), including new idiosyncratic concepts that the AGI invented itself by continual learning (§8.2 above). Imagine getting a database dump of the connections in Magnus Carlsen’s brain as he’s playing chess, and trying to judge how he’s doing.
I'm surprised by this paragraph. Magrus Carlsen’s Steering Subsystem doesn't have to understand his thoughts about ...
A few more observations.
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don't have to make this assumption. Instead, we can assume a set of possible observations
I believe that Theorem 4 remains valid.
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large ...
Oh, I think of "ending factory farming" as very far from "taking over the world".
If Superman were a skilled political operator it could be as simple as arranging to take photoshoots with whichever politicians legislated the end of factory farms.
Or if he were less skilled it could involve doing various kinds of property damage to factory farms (potentially even things which there aren't laws against, like flying around them in a way which blows the buildings over).
This might escalate to the government trying to arrest him, and outright conflict, but honestl...
(Last revised: January 2026. See changelog at the bottom.)
Part of the “Intro to brain-like-AGI safety” post series.
Thus far in the series, Post #1 set out some definitions and motivations (what is “brain-like AGI safety” and why should we care?), and Posts #2 & #3 split the brain into a Learning Subsystem (cortex, striatum, cerebellum, amygdala, etc.) that “learns from scratch” using learning algorithms, and a Steering Subsystem (hypothalamus, brainstem, etc.) that is mostly genetically-hardwired and executes innate species-specific instincts and reactions.
Then in Post #4, I talked about the “short-term predictor”, a circuit which learns, via supervised learning, to predict a signal in advance of its arrival, but only by perhaps a fraction of a second. Post #5 then argued that if we form a closed...
The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels.
I think this picture would be clearer if you drew [predict sensory inputs] as a separate box from Though Generator.
In the picture in my head, there is [predict sensory inputs] box, that revives and tries to predict the sensory output. This box also sends a signal of [current context] to both the Though Generator and the Though Assessor. Also, [predict sensory inputs] gets some signal from Though Ge...
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
Which universal distribution?
Some universal distributions are full of agents that make choices that make that distribution not a valid model of reality after the decisions are made (self-defeating). Other distributions are full of agents making decisions that ratify the distribution (self-fulfilling).
Distributions that aren't fixed points under reflection about what they decide about themselves are not coherent models of reality.
I think this is a very important hypothesis but I disagree with various parts of the analysis.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it's plausible that the AI's capabilities here could be separated from its propensities by instrumentalizing the lear... (read more)