All of Noosphere89's Comments + Replies

My second concern is that “AIs solving specific technical problems that the human wants them to solve” is insufficient to avoid extinction and get to a good future—even if the AI is solving those problems at superhuman level and with the help of (1-3) superpowers.[3]

I won’t go much into details of why I have this concern, since I want to keep this post focused on technical alignment. But I’ll spend just a few paragraphs to give a hint. For more, see my What does it take to defend the world against out-of-control AGIs? and John Wentworth’s Th

... (read more)
2Steve Byrnes
See also my comment here, about the alleged “Law of Conservation of Wisdom”. Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight. It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following AIs will not be spreading wisdom and foresight in the first place. [Above paragraphs are assuming for the sake of argument that we can solve the technical alignment problem to get powerful instruction-following AI.] Yeah I have not forgotten about your related comment from 4 months ago, I’ve been working on replying to it, and now it’s looking like it will be a whole post, hopefully forthcoming! :)

Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.

"The AI's values must be much more aligned in order to be safe outside the text domain"

2David Scott Krueger
1. There are 2 senses in which I agree that we don't need full on "capital V value alignment": 1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier) 2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results) 2. But also: 1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns -- CAVEAT: agency is not a unidimensional quantity, cf: "Harms from Increasingly Agentic Algorithmic Systems"). 2. Note that my statement was about the relative requirements for alignment in text domains vs. real-world.  I don't really see how your arguments are relevant to this question. Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial "hack" on it's values leading to behavior that significantly diverges from things humans would endorse.  

I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking.

I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts.

I'd say the biggest area of how I'm skeptical so far is that I think there's a real difference between the useful idea that power is useful for the sc... (read more)

2Marius Hobbhahn
Thanks. I agree that this is a weak part of the post.  After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on "chaotic catastrophes", e.g. something like: 1. Things move really fast. 2. We don't really understand how goals work and how models form them. 3. The science loop makes models change their goals meaningfully in all sorts of ways.  4. "what failure looks like" type loss of control. 

Oh, now I understand.

And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.

I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.

2Abram Demski
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates). LLMs touch on the real world far more than that, such that MCTS-like skill at navigating "the LLM world" in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.

To answer the question:

So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.

Similarly, playing text-based video games, with the sparse feedback given for winning.

Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.

Etc.

You think these sorts of things just won't work well enough to be relevant?

Assuming the goals ar... (read more)

2Abram Demski
Ah, I wasn't thinking "sparse" here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence). I still think o1 is moving towards chess on this spectrum.

I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what a... (read more)

5Abram Demski
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence] But what lesson do you think you can generalize, and why do you think you can generalize that? So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs. Similarly, playing text-based video games, with the sparse feedback given for winning. Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output. Etc. You think these sorts of things just won't work well enough to be relevant?

The good news I'll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there's a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.

Specifically, the insights I'm talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.

4Abram Demski
I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon). To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like "RL with really dense feedback can avoid instrumental convergence", but we're moving away from the regime where such dense feedback is available, so I don't see what lessons transfer.

Yep, that's what I was talking about, Seth Herd.

I agree with the claim that deception could arise without deceptive alignment, and mostly agree with the post, but I do still think it's very important to recognize if/when deceptive alignment fails to work, it changes a lot of the conversation around alignment.

1Seth Herd
What do you mean by "when deceptive alignment fails to work"? I'm confused.

The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.

The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.

The AI is not jailbreaking itself, here.

This link explains it better than I can, here:

https://www.aisnakeoil.com/p/model-alignment-protects-against

I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we'd have to rely on massive speed biases to punish deception.

These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus ... (read more)

3Alex Turner
(I think you're still playing into an incorrect frame by talking about "simplicity" or "speed biases.")

(2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities.

i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it's the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.

1Vojtech Kovarik
Yup, this is a very good illustration of the "talking past each other" that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here. 1) Hinting at the "curiosity at best" view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren't many incentives to go look for those vulnerabilities. (And it might even be that if Adam Gleave didn't focus his PhD on this general class of failures, we would never have encountered even this vulnerability.) However, whether additional vulnerabilities exist seems like an entirely different question. Sure, there will only be finitely many vulnerabilities. But how confident are we that this cyclic-groups one is the last one? For example, I suspect that you might not be willing to give 1:1000 odds on whether we would encounter new vulnerabilities if we somehow spent 50 researcher-years on this. But I expect that you might say that this does not matter, because vulnerabilities in Go do not matter much, and we can just keep hotfixing them as they come up? 2) And the other view seems to be something like: Yes, Go does not matter. But we were only using Go (and image classifiers, and virtual-environment football) to illustrate a general point, that these failures are an inherent part of deep learning systems. And for many applications, that is fine. But there will be applications where it is very much not fine (eg, aligning strong AIs, cyber-security, economy in the presence of malicious actors). And at this point, some people might disagree and claim something like "this will go away with enough training". This seems fair, but I think that if you hold this view, you should make some testable predictions (and ideally ones that we can test prior to having superintelligent AI). And, final

I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.

There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.

1Stephen Casper
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

I disagree with this post for 1 reason:

  1. Amdahl's law limits how much cyborgism will actually work, and IMO is the reason agents are more effective than simulators.

On Amdahl's law, John Wentworth's post on the long tail is very relevant here, as it limits the use of cyborgism here:

https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail

1Logan Riggs Smith
I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?
1Logan Riggs Smith
For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake. I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents? If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.

I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.

It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.

This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.

The robust values hypothesis from Dragon... (read more)

[This comment is no longer endorsed by its author]Reply
3Thane Ruthenis
I disagree. The fact that some concept is very complicated doesn't mean it won't be necessarily represented in any advanced AGI's ontology. Humans' psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also "complex and fragile" concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts to be convergently learned. Not that I necessarily expect "human values" specifically to actually be a natural abstraction — an indirect pointer at "moral philosophy"/DWIM/corrigibility seem much more plausible and much less complex.
0Gordon Seidoh Worley
If this is the case, my concern seems yet more warranted, as this is hoping we won't suffer a false positive alignment scheme that looks like it could work but won't. Given the his cost of getting things wrong, we should minimize false positive risks which means not pursuing some ideas because the risk if they are wrong is too high.

In the human case, it's that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.

That's the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.

3Rohin Shah
I agree that's a major reason humans don't cause extinction of all the other humans, but power-seeking would still imply that humans would seize opportunities to gain resources and power in cases where they wouldn't be caught / punished, and while I do think that happens, I think there are also lots of cases where humans don't do that, and so I think it would be a mistake to be confident in humans being very power-seeking.

You're welcome, though did you miss a period here or did you want to write more?

See a Twitter thread of some brief explorations I and Alex Silverstein did on this

1Neel Nanda
Missed a period (I'm impressed I didn't miss more tbh, I find it hard to remember that you're supposed to have them at the end of paragraphs)

Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. Having some

See a Twitter thread of some brief explorations I and Alex Silverstein did on this

I think you cut yourself off there both times.

1Neel Nanda
Lol thanks. Fixed

My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.

[This comment is no longer endorsed by its author]Reply
2Alex Turner
What does that mean? Can you give an example to help me follow?

You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication.

This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don't work well.

[This comment is no longer endorsed by its author]Reply
2Alex Turner
I also think this argument is bogus, to be clear. 

Quick question, but why do you have that reaction?

  1. Peer review is not a certification of validity, even in more rigorous venues. Not even close.
  2. I am used to seeing questionable claims forwarded under headlines like "new published study says XYZ".
  3. That XYZ was peer reviewed is one of the weaker arguments one could make in its favor, so when someone uses that as a selling point, it indicates to me that there aren't better reasons to believe in XYZ. (Analogously, when I see an ML paper boast that their new method is "competitive with" the SOTA, I immediately think "That means they tried to beat the SOTA, but found their method was at least a little worse. If it was better, they would've said so.")

EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).

This is a crux for me, as it is why I don't think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn't change much regarding risk in total.

1David Scott Krueger
TBC, "more concerned" doesn't mean I'm not concerned about the other ones... and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....... hmmm........

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can't easily modify brains, unlike a virtual human.) But this may not be true in the short-term.

1Edouard Harris
Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)

Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.

2Rob Bensinger
Seems right to me.

First up, I've strongly upvoted it as an example of advancing the alignment frontier, and I think this is plausibly the easiest solution provided it can actually be put into code.

But unfortunately there's a huge wrecking ball into it, and that's deceptive alignment. As we try to solve increasingly complex problems, deceptive alignment becomes the default, and this solution doesn't work. Basically in Evhub's words, mere compliance is the default, and since a treacherous turn when the AI becomes powerful is possible, that this solution alone can't do very w... (read more)

1Xuan (Tan Zhi Xuan)
Hmm, I'm confused --- I don't think I said very much about inner alignment, and I hope to have implied that inner alignment is still important! The talk is primarily a critique of existing approaches to outer alignment (eg. why human preferences alone shouldn't be the alignment target) and is a critique of inner alignment work only insofar as it assumes that defining the right training objective / base objective is not a crucial problem as well. Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right target for outer alignment? I happen to think the latter is more crucial to work on, and perhaps that comes through somewhat in the talk (though it's not a claim I wanted to strongly defend), whereas you seem to think inner alignment / preventing deceptive alignment is more crucial. Or perhaps both of them are crucial / necessary, so the question becomes where and how to prioritize resources, and you would prioritize inner alignment? FWIW, I'm less concerned about inner alignment because: 1. I'm more optimistic about model-based planning approaches that actually optimize for the desired objective in the limit of the large compute (so methods more like neurally-guided MCTS a.k.a AlphaGo, and less like offline reinforcement learning) 2. I'm more optimistic about methods for directly learning human interpretable, modular, (neuro)symbolic world models that we can understand, verify, and edit, and that are still highly capable. This reduces the need for approaches like Eliciting Latent Knowledge, and avoids a number or pathways toward inner misalignment. I'm aware that these are minority views in the alignment community -- I work a lot more on neurosymbolic and probabilistic programming methods, and think they have a clear path to scaling and providing economic value, which probably explains the difference.

The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

WebGPT is approximately "reinforcement learning on the internet".

There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.

I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions f... (read more)

So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?

Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.

My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.

I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effe... (read more)

We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects

First, great news on founding an alignment organization on your own. While I give this work a low chance of making progress, if you succeed the benefits would be vast.

I'll pre-register a prediction. You will fail with 90% probability, but potentially usefully fail. My reasons are as follows:

  1. Inner alignment issues have a good chance of wrecking your plans. Specifically there are issues like instrumental convergence causing deception and power-seeking by default. I notice an implicit assumption where inner alignment is either not a problem or so easy to s

... (read more)
1Andrew Critch
> First, great news on founding an alignment organization on your own. Actually I founded it with my cofounder, Nick Hay! https://www.encultured.ai/#team

The important part of his argument is in the second paragraph, and I agree because by and large, pretty much everything we know about science and casuality, at least in the beginning for AI is on trusting the scientific papers and experts. Virtually no knowledge is given by experimentation, but instead by trusting the papers, experts and books.

[This comment is no longer endorsed by its author]Reply
1David Scott Krueger
I disagree; I think we have intuitive theories of causality (like intuitive physics) that are very helpful for human learning and intelligence.