All of Roman Leventov's Comments + Replies

if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regularity regime we'd need to deal with that in place

Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I'm asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few -- in my und... (read more)

If the external process is predictable, the LLM will move to parts of the state space that best account for the effects of the environment and its model of the most likely sequences (loosely analogous to a Bayesian posterior).

I think it would be more accurate to say that the dynamics of internal states of LLMs parameterise not just the model of sequences but of the world, including token sequences as the sensory manifestation of it.

I'm sure that LLMs already possess some world models (Actually, Othello-GPT Has A Linear Emergent World Representation), the q... (read more)

the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns

Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?

1Vladimir Nesov
See this thread (including my reply) for a bit on the shoggoth issue.

OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, interne... (read more)

1Robert Kirk
The main argument of the post isn't "ASI/AGI may be causally confused, what are the consequences of that" but rather "Scaling up static pretraining may result in causally confused models, which hence probably wouldn't be considered ASI/AGI". I think in practice if we get AGI/ASI, then almost by definition I'd think it's not causally confused. In a theoretical sense this may be true (I'm not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We're arguing here that static training, even when scaled up, plausibly doesn't lead to a model that isn't causally confused about a lot of how the world works. No reason, I'll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn't been so for very long (see e.g. this tweet: https://twitter.com/DavidSKrueger/status/1540303276800983041). Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while "Discovering such hidden confounders doesn't give interventional capacity" is true, to discover these confounders he needed interventional capacity. We're not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn't useful for making someone where shorts). They are meant to be referring to the previous parts of the argument, but I've just realised that this hasn't worked as the labels aren't correct. I'll fix that.

My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading "Discovering Agents", is the following:

Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent's future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent's own generative model also depends on (adapts to... (read more)

5Abram Demski
What it means for the agent to "have a generative model" is that the agent systematically corrects this model based on its experience (to within some tolerable competence!). A model/belief/representation depends on reference maintenance, but in general, the machinery of reference maintenance can and usually should extend far beyond the representation itself. For example, an important book will tend to get edition updates, but the complex machinery which results in such an update extends far beyond the book's author. A telescope produces a representation of far-away space, but the empty space between the telescope and the stars is also instrumental in maintaining the reference (eg, it must remain clear of obstacles). A student does a lot of work "within their own boundaries" to maintain their knowledge, but they also use notebooks, computers, etc. The student's teachers are also heavily involved in the reference-maintenance.  I'm not a big fan of active inference. It strikes me as, basically, a not-particularly-great scheme for injecting randomness into actions to encourage exploration.