Wiki Contributions

Comments

Thank you for putting numbers on it!

~60%: there will be an existential catastrophe due to deceptive alignment specifically.

Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?

Mostly orthogonal:

  • Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
  • This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it.

Other relevant differences are

  1. gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
  2. I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth's definition of search as "general problem solving algorithm".
  3. Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I'm focusing on the average or default case.

basically every company eventually becomes a moral maze

Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.

Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg's beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it's relatively clear which ones are moral mazes).

Also, on a re-read I notice that all the examples given in the post relate to mathematics or theoretical work, which is almost uniquely serial among human activities. By contrast, engineering disciplines are typically much more parallelizable, as evidenced by the speedup in technological progress during war-time.

I like the distinction between parallelizable and serial research time, and agree that there should be a very high bar for shortening AI timelines and eating up precious serial time.

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems. Insofar as this assumption doesn't hold (because our institutions fail, or because the knowledge about how to allocate researcher-hours itself depends on the outcomes of parallelizable research) the distinction between parallelizable and serial work breaks down and other considerations dominate.

This is very helpful as a roadmap connecting current interpretability techniques to the techniques we need for alignment.

One thing that seems very important but missing is how the tech tree looks if we factor in how SOTA models will change over time, including

  1. large (order-of-magnitude) increases in model size
  2. innovations in model architectures (e.g. the LSTM -> Transformer transition)
  3. innovations in learning algorithms (e.g. gradient descent being replaced by approximate second-order methods or by meta-learning)

For example, if we restricted our attention to ConvNets trained on MNIST-like datasets we could probably get to tech level (6) very quickly. But would this would help with solving transparency for transformers trained on language? And if we don't expect it to help, why do we expect solving transparency for transformers will transfer over to the architectures that will be dominant 5 years from now?

My tentative answer would be that we don't really know how much transparency generalizes between scales/architectures/learning algorithms, so to be safe we need to invest in enough interpretability work to both keep up with whatever the SOTA models are doing and to get higher and higher in the tech tree. This may be much, much harder than the "single tech tree" metaphor suggests.

** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.

 

Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant - it makes a number of assumptions about agents and utility functions and I wasn't able to connect it to why I should expect an agent trained using CIRL to kill me.

FWIW here's my alternative answer:

CIRL agents are bottlenecked on the human overseer's ability to provide them with a learning signal through demonstration or direct communication.  This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.

In other words it's only a solution to "Learn from Teacher" in Paul's 2019 decomposition of alignment, not to the whole alignment problem.

I don't think I buy the argument for why process-based optimization would be an attractor.  The proposed mechanism - an evaluator maintaining an "invariant that each component has a clear role that makes sense independent of the global objective" - would definitely achieve this, but why would the system maintainers add such an invariant? In any concrete deployment of a process-based system, they would face strong pressure to optimize end-to-end for the outcome metric.

I think the way process-based systems could actually win the race is something closer to "network effects enabled by specialization and modularity".  Let's say you're building a robotic arm.  You could use a neural network optimized end-to-end to map input images into a vector of desired torques, or you could use a concatenation of a generic vision network and a generic action network, with a common object representation in between.  The latter is likely to be much cheaper because the generic network training costs can be amortized across many applications (at least in an economic regime where training cost dominates inference cost).  We see a version of this in NLP where nobody outside the big players trains models from scratch, though I'm not sure how to think about fine-tuned models: do they have the safety profile of process-based systems or outcome-based systems?