Oliver Habryka

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

Wiki Contributions

Comments

Sorted by

I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record". 

I find these arguments reasonably compelling, FWIW.

Sure, here are some things: 

  • Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
  • Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
  • Anthropic should remove any provisions that allow shareholders to disempower the LTBT
  • Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
  • Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
  • Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
  • Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
  • Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
  • Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertised publicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.

I can probably think of some more. 

(For what it's worth, it appears to me that people started using the term "scheming" in much more confusing and inconsistent ways after this post was written and tried to give that term a technical meaning. I currently think this was quite bad. I do like a lot of the content of the paper/essay/post. I have like one conversation every two weeks that ends up derailed or confused because the two participants are using "scheming" in different specific ways, assuming the other person has the same meaning in mind)

It's linked from here, which I am assuming is linked from the paper.

How are the triangle numbers not quadratic?

Sure looks quadratic to me.

This essay seems to have lost the plot of where the problems with AI come from. I was historically happy that Conjecture focused on the parts of AI development that are really obviously bad, like having a decent chance of literally killing everyone or permanently disempowering humanity, but instead this seems like it's a random rant against AI-generated art, and name-calling of obviously valuable tools like AI coding assistants .

I am not sure what happened. I hope you find the plot again.

(Edit note: I fixed up some formatting that looked a bit broken or a bit confusing. Mostly replacing some manual empty lines with "*" characters with some of our proper horizontal rule elements, and removing italics from the executive summary, since our font is kind of unreadable if you have whole paragraphs of italicized text. Feel free to revert)

I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem. 

When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven't seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.

I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.

(I go into this and various related things in my dialogue with Ryan on control)

I agree with this in principle, but contrasting "actual" with "plausible", combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at "not an actual link", but I agree that it's not an ironclad case.

I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly: 

So, the assistant should [...] provide actual or plausible links.

The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.

Load More