David Scott Krueger

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

Reward modeling and reward gaming
Aligning foundation models
Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
Preventing the development and deployment of socially harmful AI systems
Elaborating and evaluating speculative concerns about more advanced future AI systems

Posts

Sorted by New

3capybaralet's Shortform

60Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

2mo

17A Sober Look at Steering Vectors for LLMs

4mo

7Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

7mo

31[Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"

10mo

37"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

12What organizations other than Conjecture have (esp. public) info-hazard policies?

21A (EtA: quick) note on terminology: AI Alignment != AI x-safety

8Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

16Quick thoughts on "scalable oversight" / "super-human feedback" research

14Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants")

Wikitag Contributions

Consequentialism

10y

(+50/-38)

Comments

Sorted by

Newest

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Scott Krueger2mo10

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned". Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned. Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game". It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly. And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances. But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win. This is also part of the problem...

My main response, at a high level:
Consider a simple model:

We have 2 human/AI teams in competition with each other, A and B.
A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
Whichever group has more power at the end of the week survives.
The humans in A ask their AI to make A as powerful as possible at the end of the week.
The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

Responding to some particular points below:

Sure, but these things don't result in non-human entities obtaining power right?

Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power. People were already having to implement and interact with stupid automated decision-making systems before AI came along.

Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Scott Krueger2mo52

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use. The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.

Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive. When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible. Such decision-makers will likely struggle to retain power. Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.

Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Scott Krueger2mo10

This thought experiment is described in ARCHES FYI. https://acritch.com/papers/arches.pdf

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

David Scott Krueger2mo1315

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.

Evaluating the historical value misspecification argument

David Scott Krueger3mo20

There are 2 senses in which I agree that we don't need full on "capital V value alignment":
1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier)
2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results)
But also:
1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns -- CAVEAT: agency is not a unidimensional quantity, cf: "Harms from Increasingly Agentic Algorithmic Systems").
2. Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don't really see how your arguments are relevant to this question.

Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial "hack" on it's values leading to behavior that significantly diverges from things humans would endorse.

Evaluating the historical value misspecification argument

David Scott Krueger3mo10

OTMH, I think my concern here is less:

"The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)"

and more:
"The AI's values must be much more aligned in order to be safe outside the text domain"

I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.

This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.

Evaluating the historical value misspecification argument

David Scott Krueger3mo10

This comment made me reflect on what fragility of values means.

To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like "people" in its environment (in order to instantiate human values like "try not to hurt people") even as the world changes radically with the introduction of various forms of transhumanism.

I guess it's not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain. Plausibly we just translate everything into text and are good to go? It makes me wonder where we're at with adversarial robustness of vision-language models, e.g.

Survey: How Do Elite Chinese Students Feel About the Risks of AI?

David Scott Krueger6mo30

OK, so it's not really just your results? You are aggregating across these studies (and presumably ones of "Westerners" as well)? I do wonder how directly comparable things are... Did you make an effort to translate a study or questions from studies, or are the questions just independently conceived and formulated?

Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

David Scott Krueger6mo30

No, I was only responding to the the first part.

Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

David Scott Krueger7mo11

Not necessarily fooling it, just keeping it ignorant. I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...