All of David Scott Krueger (formerly: capybaralet)'s Comments + Replies

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned".  Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but wi... (read more)

4Ryan Greenblatt
I think something like this is a reasonable model but I have a few things I'd change. Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relative to A and is this going to be an ongoing trend or it is a one time reduction.) Probably I'd prefer talking about 2 groups at the start of the singularity. As in, suppose there are two AI companies "A" and "B" where "A" just wants AI systems decended from them to have power and "B" wants to maximize the expected resources under control of humans in B. We'll suppose that the government and other actors do nothing for simplicity. If they start in the same spot, does "B" end up with substantially less expected power? To make this more realistic (as might be important), we'll say that "B" has a random lead/disadvantage uniformly distributed between (e.g.) -3 and 3 months so that winner takes all dynamics aren't a crux. What about if the humans in group B ask their AI to make them (the humans) as powerful in expectation? ---------------------------------------- Supposing you're fine with these changes, then my claim would be: * If alignment is solved, then the AI representing B can powerseek in exactly the same way as the AI representing A does while still defering to the humans on the long run resource usage and still devoting a tiny fraction of resources toward physically keeping the humans alive (which is very cheap, at least once AIs are very powerful). Thus, the cost for B is negligable and B barely loses any power relative to its initial position. If it is winner takes all, B has almost a 50% chance of winning. * If alignment isn't solved, the stategy for B will involve spending a subset of resources on trying to solve alignment. I think alignment is reasonably likely to be practically feasible such that by spending a month of delay to work specifically on safety/alignment (over what A does for comm

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant.  This state of affairs both causes many bad outcomes and many aspects are self-reinforcing.  I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix the... (read more)

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net tran... (read more)

This thought experiment is described in ARCHES FYI.  https://acritch.com/papers/arches.pdf

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.  

Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)

I was mostly just trying to point to prior arguments against similar arguments while expressing my view.

  1. There are 2 senses in which I agree that we don't need full on "capital V value alignment":
    1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier)
    2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results)
  2. But also:
    1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignmen
... (read more)

OTMH, I think my concern here is less:

  • "The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)"

    and more:
  • "The AI's values must be much more aligned in order to be safe outside the text domain"

    I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.

    This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.

     
1Noosphere89
Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.

This comment made me reflect on what fragility of values means.

To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like "people" in its environment (in order to instantiate human values like "try not to hurt people") even as the world changes radically with the introduction of various forms of transhumanism.

I guess it's not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain.  Plausibly we just translate everything into text and are good to go?  It makes me wonder where we're at with adversarial robustness of vision-language models, e.g.

1Linch
I think I'm relatively optimistic that the difference between a system that "can (and will) do a very good job with human values when restricted to the text domain: vs "system that can do a very good job, unrestricted" isn't that high. This is because I'm personally fairly skeptical about arguments along the lines of "words aren't human thinking, words are mere shadows of human thinking" that people put out, at least when it comes to human values.  (It's definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).

OK, so it's not really just your results?  You are aggregating across these studies (and presumably ones of "Westerners" as well)?  I do wonder how directly comparable things are... Did you make an effort to translate a study or questions from studies, or are the questions just independently conceived and formulated? 

Not necessarily fooling it, just keeping it ignorant.  I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...

2Daniel Kokotajlo
Are you describing something that would fit within my 'Another type of proposal...' category?

This is a super interesting and important problem, IMO.  I believe it already has significant real world practical consequences, e.g. powerful people find it difficult to avoid being surrounded by sychophants: even if they really don't want to be, that's just an extra constraint for the sychophants to satisfy ("don't come across as sychophantic")!  I am inclined to agree that avoiding power differentials is the only way to really avoid these perverse outcomes in practice, and I think this is a good argument in favor of doing so.

------------------... (read more)

What makes you say this: "However, our results suggest that students are broadly less concerned about the risks of AI than people in the United States and Europe"? 

1Nick Corvino
From most of the survey data collected from Chinese citizens so far (e.g., this survey, The Center for Long-Term Artificial Intelligence, this Ipsos survey, and a few other polls), they're consistently reporting low concern on AI risks compared to Westerners. Sometimes this could be due to framing, but our survey explicitly translated many of these Western surveys' questions so we could directly compare results. To be fair, we don't have nearly as much data from China. 

This activation function was introduced in one of my papers from 10 years ago ;)

See Figure 2 of https://arxiv.org/abs/1402.3337

Really interesting point!  

I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety.  

I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.

I don't disagree... in this case you don't get agents for a long time; someone else does though.

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

0Ediz Ucar
The thing that we care about is how long it takes to get to agents. If we put lots of effort making powerful Oracle systems or other non-agentic systems, we must assume that agentic systems will follow shortly. Someone will make them, even if you do not. 

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
 

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at  is equal to our expectat

... (read more)

I'm not necessarily saying people are subconsciously trying to create a moat.  

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

4Daniel Kokotajlo
It sure sounds like you are saying that though! Before you put in the EtA, it sure sounded like you were saying that people were subconsciously motivated to avoid academic publishing because it helped them build and preserve a moat. Now, after the EtA, it still sounds like that but is a bit more unclear since 'indirect' is a bit more ambiguous than 'subconscious.'

Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

Speaking for myself…

I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:

  • I write lots of blog posts intended to be read by neuroscientists, i.e. I will attempt to engage with background assumptions that neuroscientists are likely to have, not assume non-neuroscience background knowledge or jargon, etc.
    • [To be clear, I also write even more blog posts that are not in that category.]
  • When one of my blog posts specifically discusses some neuroscientist’s work, I’ll sometimes cold-email them and ask for pr
... (read more)

My point (see footnote) is that motivations are complex.  I do not believe "the real motivations" is a very useful concept here.  

The question becomes why "don't they judge those costs to be worth it"?  Is there motivated reasoning involved?  Almost certainly yes; there always is.

2Daniel Kokotajlo
Here are two hypotheses for why they don't judge those costs to be worth it, each one of which is much more plausible to me than the one you proposed: (1) The costs aren't in fact worth it & they've reacted appropriately to the evidence. (2) The costs are worth it, but thanks to motivated reasoning, they exaggerate the costs, because writing things up in academic style and then dealing with the publication process is boring and frustrating. Seriously, isn't (2) a much better hypothesis than the one you put forth about moats?
  1. A lot of work just isn't made publicly available
  2. When it is, it's often in the form of ~100 page google docs
  3. Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
2David Scott Krueger
In my experience people also often know their blog posts aren't very good.
1Andrea Miotti
Which one? All of them seem to be working for me.

Yeah this was super unclear to me; I think it's worth updating the OP.

FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.

Oh I see.  I was getting at the "it's not aligned" bit.

Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:

  • I'm in control
  • The machine part is in control
  • Something in the middle

Only the first one seems likely to be sufficiently aligned. 

I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects.  A colleague of mine had this to say:

... (read more)

Indeed.  I think having a clean, well-understood interface for human/AI interaction seems useful here.  I recognize this is a big ask in the current norms and rules around AI development and deployment.

I don't understand what you're getting at RE "personal level".

0rpglover64
Like, I may not want to become a cyborg if I stop being me, but that's a separate concern from whether it's bad for alignment (if the resulting cyborg is still aligned).

I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.

I agree that this is important. Are you more concerned about cyborgs than other human-in-the-loop systems? To me the whole point is figuring out how to make systems where the human remains fully in control (unlike, e.g. delegating to agents), and so answering this "how to say whether a person retains control" question seems critical to doing that successfully.

3rpglover64
I think that's an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the "dual use technology" section.

FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.

Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.

I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering".  The only thing it needs to be distinguished from is "AI science".

I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.

I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant.  "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.

I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.

So I think people shou... (read more)

Hmm... this is a good point.

I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.

1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur.  Intentions are perhaps best thought of as incomplete specifications.  

2) From our current position, I think “never ever create... (read more)

2Steve Byrnes
1) Oh, sorry, what I meant was, the generals in Country A want their AGI to help them “win the war”, even if it involves killing people in Country B + innocent bystanders. And vice-versa for Country B. And then, between the efforts of both AGIs, the humans are all dead. But nothing here was either an “AGI accident unintended-by-the-designers behavior” nor “AGI misuse” by my definitions. But anyway, yes I can imagine situations where it’s unclear whether “the AGI does things specifically intended by its designers”. That’s why I said “pretty solid” and not “rock solid” :) I think we probably disagree about whether these situations are the main thing we should be talking about, versus edge-cases we can put aside most of the time. From my perspective, they’re edge-cases. For example, the scenarios where a power-seeking AGI kills everyone are clearly on the “unintended” side of the (imperfect) dichotomy. But I guess it’s fine that other people are focused on different problems from me, and that “intent-alignment is poorly defined” may be a more central consideration for them. ¯\_(ツ)_/¯ 3) I like your “note on terminology post”. But I also think of myself as subscribing to “the conventional framing of AI alignment”. I’m kinda confused that you see the former as counter to the latter. If you’re working on that, then I wish you luck! It does seem maybe feasible to buy some time. It doesn’t seem feasible to put off AGI forever. (But I’m not an expert.) It seems you agree. * Obviously the manual will not be written by one person, and obviously some parts of the manual will not be written until the endgame, where we know more about AGI than we do today. But we can still try to make as much progress on the manual as we can, right? * The post you linked says “alignment is not enough”, which I see as obviously true, but that post doesn’t say “alignment is not necessary”. So, we still need that manual, right? * Delaying AGI forever would obviate the need for a manual, but i

While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which  significantly restricts the space of possibile risks.

There are other reasons I would not be happy with that browser extension.  There is not one specific conversation I can point to; it comes up regularly.  I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accide... (read more)

4Steve Byrnes
Thanks for your reply! It continues to feel very bizarre to me to interpret the word “accident” as strongly implying “nobody was being negligent, nobody is to blame, nobody could have possibly seen it coming, etc.”. But I don’t want to deny your lived experience. I guess you interpret the word “accident” as having those connotations, and I figure that if you do, there are probably other people who do too. Maybe it’s a regional dialect thing, or different fields use the term in different ways, who knows. So anyway, going forward, I will endeavor to keep that possibility in mind and maybe put in some extra words of clarification where possible to head off misunderstandings. :) I agree with this point. I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”. I want to use the word “accident” universally for all bad outcomes downstream of (B), regardless of how grossly negligent and reckless people were etc., whereas you don’t want to use the word “accident”. OK, let’s put that aside. I think (?) that we both agree that bad outcomes downstream of (A) are not necessarily related to “misuse” / “bad actors”. E.g., if there’s a war with AGIs on both sides, and humans are wiped out in the crossfire, I don’t necessarily want to say that either side necessarily was “bad actors”, or that either side’s behavior constitutes “misuse”. So yeah, I agree that “accident vs misuse” is not a good dichotomy for AGI x-risk. Thanks, that’s interesting. I didn’t intend my chart to imply that “everyone follows the manual” doesn’t also require avoiding coordination problems and avoiding bad decisions etc. Obviously it does—or at least, that was obvious to me. Anyway, your feedback is noted. :) I agree that “never ever create AGI” is an option in principle. (It doesn’t strike me as a feasible option in practice; does it to you? I know this is off-top

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

2Vanessa Kosoy
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.

By "intend" do you mean that they sought that outcome / selected for it?  
Or merely that it was a known or predictable outcome of their behavior?

I think "unintentional" would already probably be a better term in most cases. 

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).

It seems like this means that, for any policy, we can represent it as optimizing re... (read more)

3Vanessa Kosoy
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which "justifies" this g. In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.

"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents a... (read more)

I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.

AI is highly non-analogous with guns.

I really don't think the distinction is meaningful or useful in almost any situation.  I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.

How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?

I’m guessing that you’re going to say “That’s not a useful distinction because (B) is stupid. Obviously nobody is talking about (B)”. In which case, my response is “The things that are obvious to you and me are not necessarily obvious to people who are new to thinking carefully about AGI x-risk.”

…And in particular, normal people s... (read more)

Load More