ThomasCederborg

My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky's description of the issue you can search the CEV arbital page for ADDED 2023.

The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.

I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don't hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It's a Gavagai / Word and Object joke from my grad student days)

My background is physics as an undergrad and then AI research. Links to some papers: P1  P2  P3  P4  P5  P6  P7  P8. (no connection to any form of deep learning)
 

Wiki Contributions

Comments

Sorted by

Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others.

The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible.

This Corrigibility method leads to an outcome that is massively worse than extinction. This bad outcome is the result of two things being true, (i): the fully Corrigible first AI made this outcome possible to reach, and (ii): since the Corrigibility method worked perfectly for the first AI, the designers falsely believed that the Corrigibility method would also work for the second AI.

The second AI wants many things. It wants to get an outcome, as close as possible to IO. The Corrigibility method resulted in the second AI also wanting many additional things (such as wanting all explanations it gives to count as AE, even if this makes the explanations less efficient. And wanting to avoid implementing anything, unless informed designers want that thing to be implemented). But in practice the Corrigibility method does not change the outcome in any way (it just adds an ``explanation step''). So I think it makes sense to say that the second AI has ``zero Corrigibility''. The first AI is completely corrigible. And if the designers had only used the Corrigibility method for the first AI, then the Corrigibility method would have worked perfectly.

This is what I was trying to communicate with the first sentence of the post:  ``A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse.''. I could have used that sentence as a title, but I decided against trying to include everything in the title. (I think it is ok to leave information out of the title, as long as the title is accurate. And the title is, in fact, accurate (because the first AI is genuinely corrigible. And things would have turned out a lot better, in the counterfactual situation where no one had developed any form of Corrigibility).)

One possible source of confusion, could be that you are perhaps interpreting this post as referring to some specific example scenario from your sequence. My post was mostly written before you posted your sequence. It is not meant as a comment on any specific AI in your sequence (which is why I don't link to your sequence). But given that you had just published your sequence, maybe you were expecting my scenario to contain a single, partially corrigible AI (without an IO). That is not the scenario that I was describing in my post.

However, I could actually make the same point using a scenario with a single AI (without an IO), that is partially Corrigible. (there exists a more general danger here, that is not strongly related to the number of AI designs involved). So, here is an attempt to make the same point using such a scenario instead. A possible title for such an alternative post would be: ``a partially corrigible AI could make things worse''. (this is also a standalone scenario. It is also not meant as a response to anything specific in your sequence)

I think that one could reasonably describe Corrigibility as being context dependent. A given AI could be fully Corrigible in one context (such as preventing competing AI projects), and not Corrigible at all in another context (such as discussing Alignment Target Analysis). I think that one could reasonably refer to such an AI as being partially Corrigible. And, as will be shown below, such an AI could lead to a massively worse than extinction outcome.

Summary: Consider the case where a design team uses a Corrigibility method to build an AI Assistant (AIA). The resulting AIA does not have an IO. When the design team try to use the AIA to prevent competing AI projects, everything works perfectly. However, when they try to use the AIA to understand Alignment Target Analysis (ATA), the Corrigibility method fails completely. Let's try two very rough analogies. Talking to the AIA about shutting down competing AI projects is very roughly analogous to using a djinn that grants wishes, while caring fully about intentions in exactly the right way. But talking to the AIA about ATA is very roughly analogous to using a djinn that grants wishes, while not caring about intentions at all (so, not a djinn with any form of preferred outcome. And not any form of ``malicious story djinn''. But also not a safe djinn).

The AIA always wants to interact in ways that counts as Acceptable Explanation (AE). When the designers ask it to shut down all hardware that is capable of running a powerful AI, it disobeys the order, and explains that human brains can, in theory, be used to run a powerful AI. When they ask it to shut down all non biological hardware that is capable of running a powerful AI, it first asks if it should delete itself, or if it should transfer itself to biological hardware. Etc. In short: it is Corrigible in this context. In particular: while talking about this topic, the definition of AE holds up.

When one of the designers asks the AIA to explain PCEV however, the definition of AE does not hold up. When the AIA was discussing the potential action of shutting down hardware, there were two components that it wanted the designers to understand, (i): the reasons for shutting down this hardware, and (ii): the effects of shutting down this hardware. In the hardware case, the designers already understood (i) reasonably well. So there was no need to explain it. When AIA is asked about PCEV, there are again two components that AIA wants the designers to understand, (i): the reasons for building PCEV, and (ii): the effects of building PCEV. PCEV is a two component thing. It is simultaneously an alignment target, and also a normative moral theory. It turns out that in this case, the designers do not actually understand (i) at all. They do not understand the normative moral theory behind PCEV. So the AIA explains this normative moral theory to the designers. If the AE definition had been perfect, this would not have been a problem. In other words: if the AIA had been completely Corrigible, then this would not have been a problem.

But it turns out that the people that designed the Corrigibility method did not in fact have a sufficiently good understanding of concepts along the lines of: ``normative moral theories'', ``explanations'', ``understanding'', etc (understanding these concepts sufficiently well was a realistic outcome. But in this scenario the designers failed to do so). As a result, the AE definition is not perfect, and the AIA is only partly Corrigible. So, the AIA ``explains'' the ``normative moral theory of PCEV'' until the designers ``understand'' it (using an explanation that counts as AE). This results in designers that feel a moral obligation to implement PCEV, regardless of what the result is. This new moral framework is robust to learning what happened. So, the result is a set of fully informed designers that are fully committed to implementing PCEV. So, the outcome is massively worse than extinction. (one way to reduce the probability of scenarios along these lines, is to make progress on ATA. Thus: even the existence of an AI that seems to be completely corrigible, is not an argument against the need to make progress on ATA. This also works as a way of gesturing at the more general point, that I was trying to gesture at in the first post)

(Wei Dai has been talking about related things for quite a while)