AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

The strategy-stealing assumption
Best of LessWrong 2019

The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail. 

by paulfchristiano
Daniel Kokotajlo8d80
AIs will greatly change engineering in AI companies well before AGI
I appreciate your recent anti-super-short timelines posts Ryan and basically agree with them. I'm curious who you see yourself as arguing against. Maybe me? But I haven't had 2027 timelines since last year, now I'm at 2029. 
Rohin Shah14d2423
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years[1]), and (c) not a great measure of what people care about[2][3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period. (Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.) 1. ^ I'm excluding GPT-2 and GPT-3 which seem very noisy. 2. ^ Compared to things like Moore's Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines. 3. ^ None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I'm aware of.
Kaj_Sotala22d54
AI companies have started saying safeguards are load-bearing
I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.
Load More
10Zvi
This post is even-handed and well-reasoned, and explains the issues involved well. The strategy-stealing assumption seems important, as a lot of predictions are inherently relying on it either being essentially true, or effectively false, and I think the assumption will often effectively be a crux in those disagreements, for reasons the post illustrates well. The weird thing is that Paul ends the post saying he thinks the assumption is mostly true, whereas I thought the post was persuasive that the assumption is mostly false. The post illustrates that the unaligned force is likely to have many strategic and tactical advantages over aligned forces, that should allow the unaligned force to, at a minimum, 'punch above its weight' in various ways even under close-to-ideal conditions. And after the events of 2020, and my resulting updates to my model of humans, I'm highly skeptical that we'll get close to ideal. Either way, I'm happy to include this.

Recent Discussion

Alignment as uploading with more steps
22
Cole Wyeth
4d

Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen  has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.

Motivation and Overview

Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break...

(Continue Reading - 3820 more words)
1Cole Wyeth1d
I think 4 is basically right, though human values aren’t just fuzzy, they’re also quite complex, perhaps on the order of complexity of the human’s mind, meaning you pretty much have to execute the human’s mind to evaluate their values exactly.  Some people, like very hardcore preference utilitarians, have values dominated by a term much simpler than their minds’. However, even those people usually have somewhat self-referential preferences in that they care at least a bit extra about themselves and those close to them, and this kind of self-reference drastically increases the complexity of values if you want to include it.  For instance, I value my current mind being able to do certain things in the future (learn stuff, prove theorems, seed planets with life) somewhat more than I would value that for a typical human’s mind (though I am fairly altruistic). I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration). Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).  I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.  unrelatedly, I am still not convinced we live in a mathematical multiverse, or even necessarily a mathematical universe. (Finding out we lived in a mathematical universe would make a mathematical multiverse seem very likely for the ensemble reasons we have discussed before)  
2Wei Dai6h
Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.) (Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.) Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic. What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)? Yeah, this is covered under position 5 in the above linked post. Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.
1Cole Wyeth5h
My plan isn’t dependent on that meta-ethical assumption. It may be that there is a correct way to complete your values but not everyone is capable of it, but as long as some uploads can figure their value completion out, those uploads can prosper. Or if they can only figure out how to build an AGI that works out how to complete their values, they will have plenty of time to do that after this acute period of risk ends. And it seems that if no one can figure out their values, or eventually figure out how to build an AGI to complete their values, the situation would be rather intractable.  I don’t understand your thinking here. I’m suggesting a plan to prevent extinction from AGI. Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people. At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in. But I don’t see a reason that my plan runs a particular risk of locking in misconceptions.  yes, generalization in deep learning is hard, but it’s getting rapidly more effective in practice and better understood through AIT and mostly(?) SLT.  I think this is tractable. Insofar as it’s not tractable, I think it can be made equally intractable for capabilities and alignment (possibly at some alignment tax). I have more detailed ideas about this, many of which are expressed in the post (and many of which are not). But I think that’s the high level reason for optimism.
Wei Dai4h20

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploa

... (read more)
Reply
Four ways learning Econ makes people dumber re: future AI
104
Steven Byrnes
1mo
This is a linkpost for https://x.com/steve47285/status/1958527894965108829

(Cross-posted from X, intended for a general audience.)

There’s a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples. Longpost:

THE FIRST PIECE of Econ anti-pedagogy is hiding in the words “labor” & “capital”. These words conflate a superficial difference (flesh-and-blood human vs not) with a bundle of unspoken assumptions and intuitions, which will all get broken by Artificial General Intelligence (AGI).

By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after...

(Continue Reading - 1632 more words)
2David Stinson2d
Okay. Humans are capable of final consumption (i.e. with a reward function that does not involve making more money later). I'm interested to see how an AI would do that because it is the crux of a lot of downstream processes.
2Steven Byrnes1d
OK sure, here’s THOUGHT EXPERIMENT 1: suppose that these future AGIs desire movies, cars, smartphones, etc. just like humans do. Would you buy my claims in that case? If so—well, not all humans want to enjoy movies and fine dining. Some have strong ambitious aspirations—to go to Mars, to cure cancer, whatever. If they have money, they spend it on trying to make their dream happen. If they need money or skills, they get them. For example, Jeff Bezos had a childhood dream of working on rocket ships. He founded Amazon to get money to do Blue Origin, which he is sinking $2B/year into. Would the economy collapse if all humans put their spending cash towards ambitious projects like rocket ships, instead of movies and fast cars? No, of course not! Right? So the fact that humans “demand” videogames rather than scramjet prototypes is incidental, not a pillar of the economy. OK, back to AIs. I acknowledge that AIs are unlikely to want movies and fast cars. But AIs can certainly “want” to accomplish ambitious projects. If we’re putting aside misalignment and AI takeover, these ambitious projects would be ones that their human programmer installed, like making cures-for-cancer and quantum computers. Or if we’re not putting aside misalignment, then these ambitious projects might include building galaxy-scale paperclip factories or whatever. So THOUGHT EXPERIMENT 2: these future AGIs don’t desire movies etc. like in Thought Experiment 1, but rather desire to accomplish certain ambitious projects like curing cancer, quantum computation, or galaxy-scale paperclip factories. My claims are: * If you agree that humans historically bootstrapped themselves to a large population, then you should accept that Thought Experiment 1 enables exponentially-growing AGIs, since that’s basically isomorphic (except that AGI population growth can be much faster because it’s not bottlenecked by human pregnancy and maturation times). * If you buy that Thought Experiment 1 enables exponential
2David Stinson1d
I'd say this is a partial misunderstanding, because the difference between final and intermediate consumption is about intention, rather than the type of goods.  Or to be more concrete, this is where I get off the train. It depends entirely on whether these endeavors were originally thought to be profitable. If you were spending your own money, with no thought of financial returns, then it would be fine. If all the major companies on the stock market announced today that they were devoting all of their funds to rocket ships, on the other hand, the result would be easily called a economic collapse, as people (banks, bondholders, etc.) recalibrate their balance sheets to the updated profitability expectations. If AI is directing that spending, rather than people, on the other hand, the distinction would not be between alignment and misalignment, but rather with something more akin to 'analignment,' where AIs could have spending preferences completely disconnected from those of their human owners. Otherwise, their financial results would simply translate to the financial conditions of their owners. The reason why intention is relevant to models which might appear at first to be entirely mechanistic has to do with emergent properties. While on the one hand this is just an accounting question, you would also hope that in your model, GDP at time t bears some sort of relationship with t+1 (or whatever alternative measure of economic activity you prefer). Ultimately, any model of reality has to start at some level of analysis. This can be subjective, and I would potentially be open to a case that AI might be a more suitable level of analysis than the individual human, but if you are making that case then I would like to see the case the independence of AI spending decisions. If that turns out to be a difficult argument to make, then it's a sign that it may be worth keeping conventional economics as the most efficient/convenient/productive modelling approach.
Steven Byrnes14h*20

For obvious reasons, we should care a great deal whether the exponentially-growing mass of AGIs-building-AGIs is ultimately trying to make cancer cures and other awesome consumer products (things that humans view as intrinsically valuable / ends in themselves), versus ultimately trying to make galaxy-scale paperclip factories (things that misaligned AIs view as intrinsically valuable / ends in themselves).

From my perspective, I care about this because the former world is obviously a better world for me to live in.

But it seems like you have some extra reaso... (read more)

Reply
Daniel Kokotajlo's Shortform
Daniel Kokotajlo
6y
Alexander Gietelink Oldenziel15h32

Drone countermeasures are an idle hope. The only real counter to drones is more drones. 

Lasers, shotguns, tank redesign [no holes!], nets, counter-drones, flak etc will all be part of the arsenal surely but thinking drone countermeasures are going to restore the previous generation's war doctrine is as silly as thinking that metallurgy innovations will reverse the gunpowder age. 

The future present of warfare is drones, drones, drones. 

Reply
3Alexander Gietelink Oldenziel15h
It seems nobody outside Ukraine/Russia is actually at the leading edge of where the reality of military technology is. That includes Hamas. Even using the drone doctrine from two years ago would be devastating to the Israelis. Probably they don't have the resources, organization to do so.  [Even Ukraine itself is not really there - there are clearly many simple ways drones and drone manufacturing could be improved they haven't had the time and resources to focus on yet. ] Expect terror/resistance groups to start utilizing drones en masse in the next few years. 
0jbash1d
The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS. And that's a $70,000 shell, with the cheaper PKG sounding like it's closer to $15,000, and a plain old dumb shell being $3,000. Which seems crazy, but there you are. Anyway, guiding a ballistic shell while riding it down into the target seems like a pretty different problem from figuring out when to release a bomb. ... but I don't think a hand grenade is typically an anti-building munition. From the little I know about grenades, it seems like they'll have to fix the roof, but unless you're really lucky, the building's still going to be mostly usable, and, other than hearing loss, anybody inside is going to be OK unless they're in the room directly below where the grenade hits, and maybe even then. If you're attacking buildings, I suspect you may need a bigger drone.
3Thomas Kwa1d
Good catch. Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs. * Fiber FPVs with 40km range and 2kg payload (either kamikaze or grenade/mine dropping), which can eventually be countered by a large number of short range guns if they fly at low altitude. It's not clear to me if the 40km range ones need to be larger * Heavy bomber drones can be equipped with fiber (or Starlink for US allies) and carry 15kg+ payload, enough to damage buildings and sensitive industrial equipment. They can do this while flying above the range of small guns and need dedicated antiaircraft guns * Fixed wing can carry even larger payloads with longer range and higher altitude, but are still pretty slow, except for the ones with jet engines * Drones equipped with GPS will know their position to within ~10 meters like the GPS only variant of Excalibur. It seems possible to constrain the payload's horizontal velocity by 1 m/s on average, and the drop time from 1500m is 17 seconds, giving an error of 17 m. The overall error would be sqrt(10^2 + 17^2) = 20 m. If GPS is jammed, it's not obvious they can do the first part, but probably they can still use cameras or something * All of the above are extremely threatening for both organized warfare and terrorism against an opponent without effective cheap air defense. * Even with the next evolution of air defense including radar-acoustic fusion to find threats, the limited reliability of ~all types of existing air defense and large number of drone configurations makes me guess that drones will remain moderately threatening in some form. Given that Hezbollah was previously firing unguided rockets with CEP in the hundreds of meters, some kind of drone that can target with CEP around 20 meters could be more cost effective for them if they cannot procure thousands of cheap guided missiles. If they could
Stress Testing Deliberative Alignment for Anti-Scheming Training
26
Mikita Balesni, Marius Hobbhahn, Bronson Schoen, Jérémy Scheurer, Axel Højmark, Felix Hofstätter, Teun van der Weij, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, AlexMeinke, jenny
11h
This is a linkpost for antischeming.ai

Twitter | Microsite | Apollo Blog | OpenAI Blog | Full paper

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type...

(See More - 260 more words)
Fifty Years Requests for Startups
7
Gustavs Zilgalvis
2d
This is a linkpost for https://www.fiftyyears.com/rfs

Preamble:

5050 is a program that helps great scientists, researchers, and engineers become great founders. It’s helped launch 78 companies, and now we’re turning our attention to one of the most important challenges of our time: building a safe and aligned AI future.

We created the 5050 AI track to support founders building for this new world. Below are five ideas we’d love to back right now. We believe these are great opportunities to build safe AI, but they’re not the only ones. If you’re building in this space, we want to hear from you.

What's inside:

Mission Control for Agents

Scalable Oversight for Multi-Agent Systems

An independent startup focused on scalable oversight could build the infrastructure and tooling needed to make multi-agent systems production-ready. Beyond selling tools, this company would invest in

...
(See More - 339 more words)
Lessons from Studying Two-Hop Latent Reasoning
28
Mikita Balesni, Tomek Korbak, Owain_Evans
6d
This is a linkpost for https://arxiv.org/abs/2411.16353

Twitter | ArXiv

Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?

In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama...

(See More - 333 more words)
Tomek Korbak3d10

Thanks! I somehow missed this paper, looks interesting!

Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop?

We did not, but Jiahai Feng had an experiment like this in his paper.

Reply
3Tomek Korbak3d
I think that's fair, it might be that we've over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon). Two big reasons why I'm hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don't offer a clean explanation of why we see the results we see.
4Fabien Roger3d
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
2Tomek Korbak3d
We don't have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).
Load More
26Stress Testing Deliberative Alignment for Anti-Scheming Training
Mikita Balesni, Marius Hobbhahn, Bronson Schoen, Jérémy Scheurer, Axel Højmark, Felix Hofstätter, Teun van der Weij, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, AlexMeinke, jenny
11h
0
11A recurrent CNN finds maze paths by filling dead-ends
Adrià Garriga-alonso
2d
0
7Fifty Years Requests for Startups
Gustavs Zilgalvis
2d
0
16LLM AGI may reason about its goals and discover misalignments by default
Seth Herd
3d
0
22Alignment as uploading with more steps
Cole Wyeth
4d
5
28Lessons from Studying Two-Hop Latent Reasoning
Mikita Balesni, Tomek Korbak, Owain_Evans
6d
8
24AIs will greatly change engineering in AI companies well before AGI
ryan_greenblatt
8d
2
15Large Language Models and the Critical Brain Hypothesis
David Africa
9d
0
26Decision Theory Guarding is Sufficient for Scheming
james.lucassen
9d
1
7MATS 8.0 Research Projects
Jonathan Michala, DanielFilan, Ryan Kidd
9d
0
Load More