AI ALIGNMENT FORUM
AF

All of michaelcohen's Comments + Replies

Thoughts on sharing information about language model capabilities

I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.

I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do ... (read more)

Thoughts on sharing information about language model capabilities

michaelcohen2y50

What is process-based RL?

I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the pote... (read more)

2Matthew Barnett2y

I think this comparison is imperfect. Standard economic models predict an acceleration in the growth rate by at least an order of magnitude, and usually more. Over one decade, an increase in economic capacity by 1-4 orders of magnitude seems probable. By contrast, my understanding was that the models of open borders roughly predict a one-time doubling of world GDP over several decades, and for housing, it's something like a 50% increase in GDP over decades. Perhaps a better way to put this is that if AI is developed anywhere, even in a small country, that country could soon (within 10 years) grow to be the world's foremost economic power. Nothing comparable seems true for other policies. There only really needs to be be one successful defecting nation for this coordination to fall apart.

Paul Christiano2y72

By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.

I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, t... (read more)

Thoughts on sharing information about language model capabilities

michaelcohen2y30

I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.

I agree whole-heartedly with the first sentence. I'm not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a differen... (read more)

3Paul Christiano2y

It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews). It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not. So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model. You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all. But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive. (People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.) I think there's still a good chance that process-based RL in the distillation step still can't be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it's at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)

michaelcohen2y11

Me: Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

You: No it doesn't. They just care about interestingness.

Me: Do you agree reviewers aim to only accept valid papers, and care more about validity than interestingness?

You: Yes, but...

If you can admit that we agree on this basic point, I'm happy to discuss further about how good they are at what they aim to do.

1: If retractions were common, surely you would have said that was evidence peer review didn't accomplish much! If academics were only equally ... (read more)

michaelcohen2y40

Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.

"As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting."

Strongly agree - ... - Strongly Disagree

"As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren't fully supported by the analysis."

Strongly agree - ... - Strongly Disagree

1mwacksen2y

No, no more than I would bet on a survey of <insert religious group here> whether they think <religious group> is more virtuous than <non-religious group>. Academics may claim that peer review is to check validity but their actions tell a different story. This is especially true in "hard" fields like mathematics where reviewers may even struggle to follow an argument, let alone check its validity. Given that most papers are never read by others, this is really not a big deal though. But I'll offer three further arguments for why I don't think peer review ensures validity. Argument 1: a) Humans (including reviewers) make mistakes all the time, but b) Retractions/corrections in papers are very rare. Unless academics are better at spotting mistakes immediately when reviewing than everyone else (they are not), we should expect lots of peer-reviewed articles to therefore have mistakes because invalid papers rarely get retracted. Argument 2: Computer science papers don't always include reproducible software, but checking code would absolutely be required to check validity. Argument 3: It is customary to submit papers that are rejected by one journal to another journal. This means that articles that fail "peer review" at one journal can obtain "peer review" at a different journal. PS: For CS it's harder to check "validity", but here's how papers replicate in other fields: https://fantasticanachronism.com/2021/11/18/how-i-made-10k-predicting-which-papers-will-replicate/

michaelcohen2y10

I don't understand. Importantly, these are optimistically biased, and you can't assume my true credences are this high. I assign much less than 90% probability to C. But still, they're perfectly consistent. M doesn't say anything about succeeding--only being allowed. M is basically saying: listing the places he'd be willing to live, do they all pass laws which would make building dangerously advanced AI illegal? The only logical connection between C and M is that M (almost definitely) implies C.

michaelcohen2y12

Thank you very much for saying that.

I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it's hard to complain when I'm on the receiving end of that dynamic.

Threat Model Literature Review

michaelcohen2y10

On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed ... (read more)

michaelcohen2y*11

The assumption says "will do" not "will be able to do". And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.

In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game

... (read more)

michaelcohen2y21

Peer review is not a certification of validity,

Do you think the peer reviewers and the editors thought the argument was valid?

Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

0mwacksen2y

No it doesn't. It's hard to say what the "aims" of peer-review are, but "ensuring validity" is certainly not one of them. As a first approximation, I'd say that peer-review aims to certify that the author is not an obvious crank, and that the argument being made is an interesting one to someone in the field.

michaelcohen2y20

Thank you. I've changed the title.

michaelcohen2y12

Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.

5Charles Foster2y

It didn't strike me as arrogant. It struck me as misleading in a way that made me doubt the quality of the enclosed argument.

Threat Model Literature Review

michaelcohen2y*21

Thank you for this review! A few comments on the weaknesses of my paper.

In particular, it explicitly says the argument does not apply to supervised learning.

Hardly a weakness if supervised learning is unlikely to be an existential threat!

Strength: Does not make very concrete assumptions about the AGI development model.
Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.

The fact that the argument holds equally well no matter ... (read more)

2Zachary Kenton2y

Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review - when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects -- rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I'll add a clarification of that at the top. Now to your comments: To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you're considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like. On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?” - you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there. On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.

Response to "What does the universal prior actually look like?"