User Comment Replies — AI Alignment Forum

I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.

See also Demis' Hassabis recent tweet. Admittedly, it's unclear whether he refers to AlphaProof itself being accessible from Gemini, or the research into AlphaProof feeding into improvements of Gemini.
See also an important paragraph in the blogpost for A

... (read more)

5Vanessa Kosoy8mo

I can see that research into proof assistants might lead to better techniques for combining foundation models with RL. Is there anything more specific that you imagine? Outside of math there are very different problems because there is no easy to way to synthetically generate a lot of labeled data (as opposed to formally verifiable proofs). While some AI techniques developed for proof assistants might be transferable to other problems, I can easily imagine a responsible actor[1] producing a net positive. Don't disclose your techniques (except maybe very judiciously), don't open your source, maintain information security, maybe only provide access as a service, maybe only provide access to select people/organizations. 1. ^ To be clear, I don't consider Alphabet to be a responsible actor.

Clarifying AI X-risk

Leon Lang2y10

Thanks a lot for these pointers!

Clarifying AI X-risk

Leon Lang2y20

Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.

What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded action... (read more)

5Rohin Shah2y

Oh, I see. I'm not interested in "solving outer alignment" if that means "creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations" (because as you point out it seems far too stringent a requirement). You could look at ascription universality and ELK. The general mindset is roughly "ensure your reward signal captures everything that the agent knows"; I think the mindset is well captured in mundane solutions to exotic problems.

Clarifying AI X-risk

Leon Lang2y77

To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.

I assume you would agree with the following rephrasing of your last sentence:

The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI.

If so, I would ... (read more)

3Rohin Shah2y

I'm confused about what you're trying to say in this comment. Are you saying "good feedback as defined here does not solve alignment"? If so, I agree, that's the entire point of goal misgeneralization (see also footnote 1). Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I'm not sure why you think it is a bug).

Distribution Shifts and The Importance of AI Safety

Leon Lang2y21

Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text.

AI ALIGNMENT FORUM
AF

All of Leon Lang's Comments + Replies