My comments on this post winded up into a whole separate post: "H-JEPA might be technically alignable in a modified form".
If the external process is predictable, the LLM will move to parts of the state space that best account for the effects of the environment and its model of the most likely sequences (loosely analogous to a Bayesian posterior).
I think it would be more accurate to say that the dynamics of internal states of LLMs parameterise not just the model of sequences but of the world, including token sequences as the sensory manifestation of it.
I'm sure that LLMs already possess some world models (Actually, Othello-GPT Has A Linear Emergent World Representation), the q...
the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns
Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, interne...
My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading "Discovering Agents", is the following:
Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent's future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent's own generative model also depends on (adapts to...
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I'm asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few -- in my und... (read more)