All of David Johnston's Comments + Replies

Why? The biggest problem in my mind is algorithmic progress. If we’re outside (C), then the “critical path to TAI” right now is algorithmic progress

Given that outside C approaches to AGI are likely to be substantially unlike anything we’re familiar with, and that controllable AGI is desirable, don’t you think that there’s a good chance these unknown algorithms have favourable control properties?

I think LLMs have some nice control properties too, not so much arguing against LLMs being better than unknown, just the idea that we should confidently expect control to be hard for unknown algorithms.

3Steve Byrnes
I guess you’re referring to my comment “for my part, if I believed that (A)-type systems were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!” I’m not implicitly comparing (A) to “completely unknown mystery algorithm”, instead I’m implicitly comparing (A) to “brain-like AGI, or more broadly model-based RL AGI, or even more broadly some kind of AGI that incorporates RL in a much more central way than LLMs do”. I think, by and large, more RL (as opposed to supervised or self-supervised learning) makes things worse from a safety perspective, and thus I’m unhappy to also believe that more RL is going to be part of TAI.

Would you say Yudkowsky's views are a mischaracterisation of neural network proponents, or that he's mistaken about the power of loose analogies?

3Alex Turner
Neither. 1. I don't know what proponents were claiming when proponing neural networks. I do know that neural networks ended up working, big time. 2. I don't think loose analogies are powerful. I think they lead to sloppy thinking. 

So, if I'm understanding you correctly:

  • if it's possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it's also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient
  • if it is not possible to build a single aligned sovereign, then it's probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way
    • in this case, we might be abl
... (read more)
3Rohin Shah
Yes, that's right, though I'd say "probable" not "possible" (most things are "possible").

Is building an aligned sovereign to end the acute risk period different to a pivotal act in your view?

3Rohin Shah
Depends what the aligned sovereign does! Also depends what you mean by a pivotal act! In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.

This is interesting work, but I’m skeptical of the interpretation. For example, I don’t think it is problematic from a safety point of view if a model ever behaves in a shutdown avoiding manner, but rather if it behaves in a shutdown avoiding manner against the interests of its operators.

I think your example shutdown request doesn’t reflect this situation well because it is unclear whether the downsides of shutdown (loss of capability) are outweighed by the upsides (speed) and because it asks for a feeling and not a judgement. If I reframe your request wit... (read more)

6Evan Hubinger
In my opinion, I think the most interesting result for all of the evals is not “the model does X,” since I think it's possible to get models to do basically anything you want with the right prompt, but that the model's tendency to do X generally increases with model scale and RLHF steps.

A few questions, if you have time:

  • which situation do you think is easier to analyse, path dependent or path independent?
  • which of your analyses do you think is more robust, path dependent or independent? Is the difference large?
  • I think the path dependent analysis should feature an assumption that, at one extreme, yields the path independent analysis, but I can’t see it. What say you?
2Evan Hubinger
* The path-independent case is probably overall easier to analyze—we understand things like speed and simplicity pretty well, and in the path-independent case we don't have to keep track of complex sequencing effects. * I think my path-independent analysis is more robust, for the same reasons as in the first point. * Presumably that assumption is path-dependence? I'm not sure what you mean.

Γ=Σ^R, it's a function from programs to what result they output. It can be thought of as a computational universe, for it specifies what all the functions do.

Should this say "elements are function... They can be thought of as...?"

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism? If the second, is there a simple explanation of where probability theory fails?

2Vanessa Kosoy
Yes, the phrasing was confusing, I fixed it, thanks. We really need infrabayesianism. On bayesian hypotheses, the bridge transform degenerates: it says that, more or less, all programs are always running. And, the counterfactuals degenerate too, because selecting most policies would produce "Nirvana". The idea is, you must have Knightian uncertainty about the result of a program in order to meaningfully speak about whether the universe is running it. (Roughly speaking, if you ask "is the universe running 2+2?" the answer is always yes.) And, you must have Knightian uncertainty about your own future behavior in order for counterfactuals to be meaningful. It is not surprising that you need infrabayesianism in order to do naturalized induction: if you're thinking of the agent as part of the universe then you are by definition in the nonrealizable setting, since the agent cannot possibly have a full description of something "larger" than itself.

Do you run into a distinction between benign and malign tampering at any point? For example, if humans can never tell the difference between the tampered and non-tampered result, and their own sanity has not been compromised, it is not obvious to me that the tampered result is worse than the non-tampered result.

It might be easier to avoid compromising human sanity + use hold-out sensors than to solve ELK in general (though maybe not? I haven't thought about it much).

I'm a bit curious about what job "dimension" is doing here. Given that I can map an arbitrary vector in  to some point in  via a bijective measurable map (https://en.wikipedia.org/wiki/Standard_Borel_space#Kuratowski's_theorem), it would seem that the KPD theorem is false. Is there some other notion of "sufficient statistic complexity" hiding behind the idea of dimensionality, or am I missing something?