Some versions of the METR time horizon paper from alternate universes:
Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh)
Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the excep...
A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬
[EDIT: Never mind, this is just Kleene's second recursion theorem!]
Quick question about Kleene's recursion theorem:
Let's say F is a computable function from ℕ^N to ℕ. Is there a single computable function X from ℕ^N to ℕ such that
X = F(X, y_2,..., y_N) for all y_2,...,y_N in ℕ
(taking the X within F as the binary code of X in a fixed encoding) or do there need to be additional conditions?
I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.
The experiments
I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find ...
They were separated by spaces. (But I'd encourage replication before updating too hard on results which I think are very weird.)
How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?
I've failed to find anything in our literature.
It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.
However, fair problems are more than just fair environments: we want a definition of a fair problem (an...
It's an essential aspect of decision making for an agent to figure out where it might be. Thought experiments try to declare the current situation, but they don't necessarily need to be able to convincingly succeed. Algorithmic induction, such as updating from Solomonoff prior, is the basic way an agent figures out which situations it should care about, and declaring that we are working with a particular thought experiment doesn't affect the prior. In line with updatelessness, an agent should be ready for observations in general (according to which of them it cares about more), rather than particular "fair" observations, so distinguishing observations that describe "fair" thought experiments doesn't seem right either.
Thanks for this!
What I was saying up there is not a justification of Hurwicz' decision rule. Rather, it is that if you already accept the Hurwicz rule, it can be reduced to maximin, and for a simplicity prior the reduction is "cheap" (produces another simplicity prior).
Why accept the Hurwicz' decision rule? Well, at least you can't be accused of a pessimism bias there. But if you truly want to dig deeper, we can start instead from an agent making decisions according to an ambidistribution, which is a fairly general (assumption-light) way of making decision...
Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields.
Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose...
After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.
Idea for LLM support for writing LessWrong posts: virtual comments.
Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.
(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an i...
Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet.
I expect to refer back to this comment a lot. I'm reproducing it here for visibility.
Basic idea / spirit of the proposal
We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.
Concrete proposal
Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make adequately evaluate as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we cons...
Alignment Forum readers might be interested in this:
...
Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
- Researchers from frontier labs & government
- AI researchers curious about control mechanisms
- InfoSec professionals
- Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature
Something a little different: Today I turn 28. If you might be open to do something nice for me for my birthday, I would like to request the gift of data. I have made a 2-4 min anonymous survey about me as a person, and if you have a distinct sense of me as a person (even just from reading my LW posts/comments) I would greatly appreciate you filling it out and letting me know how you see me!
It's an anonymous survey where you rate me on lots of attributes like "anxious", "honorable", "wise" and more. All multiple-choice. Two years ago I al...
My AGI timelines median is now in 2028 btw, up from the 2027 it's been at since 2022. Lots of reasons for this but the main one is that I'm convinced by the benchmarks+gaps argument Eli Lifland and Nikola Jurkovic have been developing. (But the reason I'm convinced is probably that my intuitions have been shaped by events like the pretraining slowdown)
Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?
We'd discussed that some before, but one way to distill it is... I think autonomously doing nontrivial R&D engineering projects requires sustaining coherent agency across a large "inferential distance". "Time" in the sense of "long-horizon tasks" is a solid proxy for it, but not really the core feature. Instead, it's about being able to maintain a stable picture of the project even as you move from a fairly simple-in-terms-of-mem...
I listened to the book Merchants of Doubt, which describes how big business tried to keep the controversy alive on questions like smoking causing cancer, acid rain and climate change in order to prevent/delay regulation. It reports on interesting dynamics about science communication and policy, but it is also incredibly partisan (on the progressive pro-regulation side).[1]
Some interesting dynamics:
Thanks for tracking it down!
So I misremembered the exact quote, but I am not very far, at least in spirit? (There would be 10^(24 - 6) milliliters in 1000km² of 100m-deep lakes.)
Do LLMs' performance get worse when they are distracted by a threat, a jailbreak or a weird prompt? Claude 3.5 Sonnet (June) mostly doesn't really get distracted, and Claude 3 Haiku does; but only when generating CoT reasoning.
I run experiments on 100 points from MMLU, 100 points from MATH, and 100 points from GPQA, both with and without CoT, all at temperature 0. For all no-CoT experiments, I use a prefill to force an answer. I always use an exact match evaluation, which makes it possible to get less than the random chance accuracy (25% for MMLU and...
I think people who predict significant AI progress and automation often underestimate how human domain experts will continue to be useful for oversight, auditing, accountability, keeping things robustly on track, and setting high-level strategy.
Having "humans in the loop" will be critical for ensuring alignment and robustness, and I think people will realize this, creating demand for skilled human experts who can supervise and direct AIs.
(I may be responding to a strawman here, but my impression is that many people talk as if in the future most cognitive/white-collar work will be automated and there'll be basically no demand for human domain experts in any technical field, for example.)
Oversight, auditing, and accountability are jobs. Agriculture shows that 95% of jobs going away is not the problem. But AI might be better at the new jobs as well, without any window of opportunity where humans are initially doing them and AI needs to catch up. Instead it's AI that starts doing all the new things well first and humans get no opportunity to become competitive at anything, old or new, ever again.
Even formulation of aligned high-level tasks and intent alignment of AIs make sense as jobs that could be done well by misaligned AIs for instrumental reasons. Which is not even deceptive alignment, but still plausibly segues into gradual disempowerment or sharp left turn.
Claude 3.7 reward hacks. During training, Claude 3.7 Sonnet sometimes resorted to "special-casing" to pass tests when it got stuck — including directly hardcoding expected outputs or even modifying test files themselves. Rumors are circulating that o1/o3 was doing similar things — like overwriting equality operators to get Python tests to pass — and this may have contributed to the delayed release.
This seems relevant to claims that "we'll soon have reward models sophisticated enough to understand human values" and that inner alignment is the real challenge...
A weird example of this is on page 33 (full transcript pasted farther down)
tl;dr: It found a great general solution for speeding up some code on specific hardward, tried to improve more, resorted to edge cases which did worse, and submitted a worse version (forgetting the initial solution).
This complicates the reward hacking picture because it had a better solution that got better reward than special-casing yet it still resorted to special-casing. Did it just forget the earlier solution? Feels more like a contextually activated heuristic to special-c...
Here's a summary of how I currently think AI training will go. (Maybe I should say "Toy model" instead of "Summary.")
Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.
Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic." (yay!)
Oh, I just remembered another point to make:
In my experience, and in the experience of my friends, today's LLMs lie pretty frequently. And by 'lie' I mean 'say something they know is false and misleading, and then double down on it instead of apologize.' Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according t...
What do you call this phenomenon?
Yes, my original comment wasn't clear about this, but your nitpick is actually a key part of what I'm trying to get at.
Usually, you start with imitation learning and tack on RL at the end. That's what AlphaGo is. It's what predecessors to Dreamer-V3 like VPT are. It's what current reasoning models are.
But then, eventually, you figure out how to bypass the imitation learning/behavioral cloning part and do RL from the start. Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.