In a variety of conversations about AI misalignment risks, I find that it is important to be able to clearly point at different levels of AI capability. My current favorite approach is to talk about how much the AI accelerates AI R&D[1] labor.
I define acceleration of AI R&D labor by Y times as "the level of acceleration which is as useful (for making more powerful AIs) for an AI company as having its employees run Y times faster[2] (when you allow the total inference compute budget for AI assistance to be equal to total salaries)". Importantly, a 5x AI R&D labor acceleration won't necessarily mean that research into making AI systems more powerful happens 5x faster, as this just refers to increasing the labor part of the...
A more specific argument to expect spikiness and therefore TED-AI to be vastly superhuman in most areas:
I would generally expect faster takeoff speeds in domains that AIs are worse at, or that AI companies aren't prioritizing very highly.
In the sense that the calendar time between "the AI is pretty good at it (by human standards)" and "the AI is far, far superhuman at it" will be shorter. Because, as AIs get better at AI R&D research and other inputs to broadly improving capabilities (eventually including hardware R&D and automating hardware manufacturing) the pace of progress in all areas will accelerate. So for capabilities that AIs struggle with, ...
We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors.
In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:
I did some related work with mentees last year that extends EDW by using an unsupervised optimisation target: the average CE loss of an LLM on items in the dataset when given the featurisation of the item in the prompt (https://arxiv.org/abs/2502.17541). That kind of approach can help you prioritise the features which explains the dataset you're investigating the most (kind of a natural language PCA), and can be applied to unstructured data. I think stuff like this is pretty interesting and would be excited to see more of it!
I like your suggestion of a pro...
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
...
- Version C: You and I and 10 randomly sampled 2026 humans are all in the same situation, all guessing where the largest subset of the group will show up for the meeting. We have common knowledge of this, and that everyone is trying to guess the same answer.
Pause to reflect on this, and how our intent to converge with additional people affects your confidence level that you will pick the most common answer.
. . . . . . . . .
In Version C, are you more confident, or less confident, that you will guess correctly?
. . . . . . . . .
Probably you're more confident, r
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
Thanks! Seems like the upshot is that humans have a complex soup of desires around learning from and epistemically interacting with AIs, and you’re proposing to use the word “legitimate” to describe that soup, which I agree is a helpful way of thinking about it (and probably better than “not-manipulating”), but also not a True Name in the John Wentworth sense.
The proposal in the last paragraph seems unhelpful for me-in-particular, because I’m working on how to install goals in brain-like AGIs, and I have ideas that seem promising but only work for a limite...
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Update: We have now launched Phase 1 of the Challenge with $50,000 in prizes:
For Phase 1, we have increased the depth of the network from 8 to 32 hidden layers. Our existing algorithms scale poorly with depth, and so we expect there to be significant room for improvement. Phase 1 lasts until the end of July, after which Phase 2 begins. For Phase 2, there will be a prize pool of at least $100,000, and we may change the architectural parameters agai...