Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship

adamShimi

AI ALIGNMENT FORUM
AF

All of Ivan Vendrov's Comments + Replies

Thank you for putting numbers on it!

~60%: there will be an existential catastrophe due to deceptive alignment specifically.

Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?

Paul Christiano3y114

In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century

Amongst the LW crowd I'm relatively optimistic, but I'm not that optimistic. I would give maybe 20% total risk of misalignment this century. (I'm generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)

The number is lower if you consider "how much alignment risk before AI systems are in the driver's seat," which I think is very often the more relevant question, but I'd still put it ... (read more)

Evan Hubinger3y124

Unconditional. I'm rather more pessimistic than an overall 10% chance. I usually give ~80% chance of existential risk from AI.

Gradient descent doesn't select for inner search

Ivan Vendrov3y*60

Mostly orthogonal:

Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it.

Other relevant differences are

gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time &

... (read more)

Alex Ray's Shortform

Ivan Vendrov3y10

basically every company eventually becomes a moral maze

Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.

Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was o... (read more)

5A Ray3y

Agree that founders are a bit of an exception. Actually that's a bit in the longer version of this when I talk about it in person. Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes". So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often. In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd prefer the founder to the median replacement to be leading the company. (Edit: I don't think founders remaining at the head of a company isn't evidence that the company isn't a moral maze. Also I'm not certain I agree that facebook's pivot couldn't have been done by a moral maze.)

A note about differential technological development

Ivan Vendrov3y10

Also, on a re-read I notice that all the examples given in the post relate to mathematics or theoretical work, which is almost uniquely serial among human activities. By contrast, engineering disciplines are typically much more parallelizable, as evidenced by the speedup in technological progress during war-time.

5Rob Bensinger3y

This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/. If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.

A note about differential technological development

Ivan Vendrov3y1-2

I like the distinction between parallelizable and serial research time, and agree that there should be a very high bar for shortening AI timelines and eating up precious serial time.

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems. Insofar as this assumption doesn't hold (because our institutions fail, or because the knowledge about how to allocate researcher-hours itself depends on the outcomes of parallelizable research) the distinction between parallelizable and serial work breaks down and other considerations dominate.

2Rob Bensinger3y

Why do you think it assumes that?

1Ivan Vendrov3y

A transparency and interpretability tech tree

Ivan Vendrov3y*55

This is very helpful as a roadmap connecting current interpretability techniques to the techniques we need for alignment.

One thing that seems very important but missing is how the tech tree looks if we factor in how SOTA models will change over time, including

large (order-of-magnitude) increases in model size
innovations in model architectures (e.g. the LSTM -> Transformer transition)
innovations in learning algorithms (e.g. gradient descent being replaced by approximate second-order methods or by meta-learning)

For example, if we restricted our attention ... (read more)

Alignment research exercises

Ivan Vendrov3y00

** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.

Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant - it makes a number of assumptions about agents and utility functions and I wasn't able to connect it to why I should expect an agent trained using CIRL to kill me.

FWIW here's my alternative answer:

CIRL agents are bottlenecked on the human overseer's ability to provide them with a learning signal through demonstration or direct communi

... (read more)

Supervise Process, not Outcomes

Ivan Vendrov3y*40

I don't think I buy the argument for why process-based optimization would be an attractor. The proposed mechanism - an evaluator maintaining an "invariant that each component has a clear role that makes sense independent of the global objective" - would definitely achieve this, but why would the system maintainers add such an invariant? In any concrete deployment of a process-based system, they would face strong pressure to optimize end-to-end for the outcome metric.

I think the way process-based systems could actually win the race is something closer... (read more)

3David Schneider-Joseph2y

Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.