Running Lightcone Infrastructure, which runs LessWrong. You can reach me at habryka@lesswrong.com. I have signed no contracts or agreements whose existence I cannot mention.
I agree with this in principle, but contrasting "actual" with "plausible", combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at "not an actual link", but I agree that it's not an ironclad case.
I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:
So, the assistant should [...] provide actual or plausible links.
The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.
Isn't the basic idea of Constitutional AI just having the AI provide its own training feedback using written instruction? My guess is there was a substantial amount of self-evaluation in the o1 training with complicated written instructions, probably kind of similar to a constituion (though this is just a guess).
with respect to grabbing hold of the whole future: you can try looking at historical cases of people trying to grab hold of the future and seeing how that went, it's a mixed bag with mostly negative reputation, indicating there are downsides as well as upsides, it's not a "safe" conservative view. see also Against Responsibility. I feel like there's a risk of getting Pascal's mugged about "maybe grabbing hold of the future is good, you can't rule it out, so do it", there are downsides to spending effort that way.
I agree with a track-record argument of this, but I think the track record of people trying to broadly ensure that humanity continues to be in control of the future (while explicitly not optimizing for putting themselves personally in charge) seems pretty good to me.
Generally a lot of industrialist and human-empowerment stuff has seemed pretty good to me on track record, and I really feel like all the bad parts of this are screened off by the "try to put yourself and/or your friends in charge" component.
While I believe Scott Garrabrant and/or Ambram Demski have discussed such duality, I haven't found a relevant post on the Alignment Forum about this, so I'll present the basic idea in this post.
There is a post on this. It's one of my favorite posts: https://www.lesswrong.com/posts/oheKfWA7SsvpK7SGp/probability-is-real-and-value-is-complex
Commentary by Zvi in one of his AI posts, copied over since it seems nice to have it available for people reading this post:
Sam Bowman of Anthropic asks what is on The Checklist we would need to do to succeed at AI safety if we can create transformative AI (TAI).
Sam Bowman literally outlines the exact plan Eliezer Yudkowsky constantly warns not to use, and which the Underpants Gnomes know well.
- Preparation (You are Here)
- Making the AI Systems Do Our Homework (?????)
- Life after TAI (Profit)
His tasks for chapter 1 start off with ‘not missing the boat on capabilities.’ Then, he says, we must solve near-term alignment of early TAI, render it ‘reliably harmless,’ so we can use it. I am not even convinced that ‘harmless’ intelligence is a thing if you want to be able to use it for anything that requires the intelligence, but here he says the plan is safeguards that would work even if the AIs tried to cause harm. Ok, sure, but obviously that won’t work if they are sufficiently capable and you want to actually use them properly.
I do love what he calls ‘the LeCun test,’ which is to design sufficiently robust safety policies (a Safety and Security Protocol, what Anthropic calls an RSP) that if someone who thinks AGI safety concerns are bullshit is put in charge of that policy at another lab, that would still protect us, at minimum by failing in a highly visible way before it doomed us.
The plan then involves solving interpretability and implementing sufficient cybersecurity, and proper legible evaluations for higher capability levels (what they call ASL-4 and ASL-5), that can also be used by third parties. And doing general good things like improving societal resilience and building adaptive infrastructure and creating well-calibrated forecasts and smoking gun demos of emerging risks. All that certainly helps, I’m not sure it counts as a ‘checklist’ per se. Importantly, the list includes ‘preparing to pause or de-deploy.’
He opens part 2 of the plan (‘chapter 2’) by saying lots of the things in part 1 will still not be complete. Okie dokie. There is more talk of concern about AI welfare, which I continue to be confused about, and a welcome emphasis on true cybersecurity, but beyond that this is simply more ways to say ‘properly and carefully do the safety work.’ What I do not see here is an actual plan for how to do that, or why this checklist would be sufficient?
Then part 3 is basically ‘profit,’ and boils down to making good decisions to the extent the government or AIs are not dictating your decisions. He notes that the most important decisions are likely already made once TAI arrives – if you are still in any position to steer outcomes, that is a sign you did a great job earlier. Or perhaps you did such a great job that step 3 can indeed be ‘profit.’
The worry is that this is essentially saying ‘we do our jobs, solve alignment, it all works out.’ That doesn’t really tell us how to solve alignment, and has the implicit assumption that this is a ‘do your job’ or ‘row the boat’ (or even ‘play like a champion today’) situation. Whereas I see a very different style of problem. You do still have to execute, or you automatically lose. And if we execute on Bowman’s plan, we will be in a vastly better position than if we do not do that. But there is no script.
Update: After seeing a comment by AdamK on Manifold, I dug into the code and can confirm that the way the codebase queries for articles does at least check for meta tags that indicate when an article was last updated (which my guess is aren't reliable, but it does seem like they at least tried). I would be highly surprised if their code addresses all of the myriad data-contamination issues (including very tricky ones like news articles that predicted things accurately getting more traffic after a forecasted event happened and therefore coming up higher in search results, even if they were written before the resolution time). I am currently taking bets that on prospective forecasts this system will perform worse than advertised (and also separately think that the advertised performance does not meaningfully make this system "superhuman")
How did you handle issues of data contamination?
In your technical report you say you validated performance for this AI system using retrodiction:
Performance. To evaluate the performance of the model, we perform retrodiction, pioneered in Zou
et al. [3]. That is to say, we take questions about past events that resolve after the model’s pretraining
data cutoff date. We then compare the accuracy of the crowd with the accuracy of the model, both having access to the same amount of recent information. When we retrieve articles for the forecasting AI, we use the search engine’s date cutoff feature, so as not to leak the answer to the model.
I am quite concerned about search engines actually not being capable of filtering out data for recent events.As an example, I searched “Israel attack on Iran” as you mention that as a concrete example in this excerpt of the blog post:
Concretely, we asked the bot whether Israel would carry out an attack on Iran before May 1, 2024. .
The first result of searching for "Israel attack on Iran", if you set the date cutoff to October 1st 2023, is this:
As you can see, Google claims a publishing data of "Aug 11, 2022". However, when you click into this article, you will quickly find the following text:
The article actually includes updates from April 19, 2024! This is very common, as many articles get updated after they are published.
The technical report just says:
When we retrieve articles for the forecasting AI, we use the search engine’s date cutoff feature, so as not to leak the answer to the model.
But at least for Google this fails, unless you are using an unknown functionality for Google.
Looking into the source code, it appears that the first priority source you check is some meta tags:
date_meta_tags = ['article:published_time', 'datePublished', 'date', 'pubdate', 'og:published_time', 'publishdate' ]
However, for the article I just linked, those meta tags do indeed say the article was published in 2022:
<meta property="article:published_time" content="2022-08-11T11:15:47-0300">
This means this article, as far as I can tell from parsing the source code, would have its full text end up in the search results, even though it’s been updated in 2024 and includes the events that are supposed to be forecasted (it might be filtered out by something else, but I can’t seem to find any handling of modified articles).
Generally, data contamination is a huge issue for retrodiction, so I am assuming you have done something good here, otherwise it seems very likely your results are inflated because of those data contamination issues, and we should basically dismiss the results of your technical report.
To be clear, I did not do any cherry-picking of data here. The very first search query on any topic that I tried was the search I document above.
You seem to be making a huge number of assumptions of what "scheming" means, so I am now mostly assuming you are talking about something else than what I am talking about (which to be clear, is very common on this topic and I don't think is your fault), but I'll still try to clarify.
As I wrote in my comment to Ryan, I don't think AIs currently acting with reckless disregard for the truth for the pursuit of short-term goals, and AIs in the future acting with reckless disregard for the truth for the pursuit of long-term goals really has that many structural differences.
In-particular, none of this resonates with me:
There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
No, my whole point is the difference is really messy, and if I have an AI "role-playing" as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its "roleplay", if it can pull it off (and at least at the present, a huge component of how we are making AI systems more goal-directed is by changing their role-playing targets to be more agentic and long-term oriented, which is mostly how I would describe what we are currently doing with RLHF, though there are also some other things).
I really have very little understanding what is at-present going on inside of cutting edge AI systems, and the same seems true for anyone else. Because of the whole "our AI systems are predominantly trained on single-forward passes", you are interfacing with a bunch of very confusing interlocking levels of optimization when you are trying to assess "why" an AI system is doing something.
My current best guess is the earliest dangerous AI systems will be dangerous depending on confusing and complicated context cues. I.e. sometimes when you run them they will be pretty agentic and try to pursue long-term objectives, and sometimes they won't. I think commercial incentives will push towards increasing the presence of those context cues, or shifting the optimization process more directly from pure single-forward-pass pre-training to long-run objectives which will then more fundamentally reshape how the AI thinks, but unless we change the "massive pre-training paradigm" I expect you will be able to coax the AI system into non-dangerous (and also non-agentic) behavior patterns for the foreseeable future.
Overall, this leaves me with a prediction where there is no hard line between "role-playing" and "scheming". My mainline expectation is that we will get AI systems to "role-play" harder and harder as long-term agentic optimizers. All throughout they will basically not really care about informing the humans they are working with well (because we have no good pointer towards deep-honesty as a training objective, especially not with remotely comparable amounts of data as we have for other AI objectives). We will struggle against this continuously. This will often involve AIs being sycophantic and deceptive in order to gain more power later and get into a better position to perform any of its goals.
This means there is no "warning shot" when AI systems are now "being deceptive but for real". The AI systems were never not deceptive. They were always sycophantic, always wantonly made up facts when it was convenient, and always (at least somewhat) interested in taking actions that increase its power. I think some interesting stuff will happen as AIs will reliably gain more complete situational awareness, which will happen as they get smarter and more coherent thinkers, but I don't think that will change the playing field very much.
Yeah, I agree that increased stakes will have an effect here.
I also agree that people shouldn't and probably wouldn't really accept "the AI was just role playing" as an excuse. My argument is that I think we really don't know how to distinguish the AI scheming or role-playing in a more systematic way, even if we tried substantially harder (and one of the reasons why no one is really studying the internals of deceptive AI systems today is that we basically have nothing to say about the internals of AI systems that is actually enlightening).
Like, my vibe from Anthropic in their papers generally seems to me that they argue that it doesn't matter whether the AI takes deceptive action for "role-playing" reasons, because ultimately all the AI does is "role-play" and so if AI systems of current architectures are dangerous, they will do so in a way that is at least continuous with role-playing.
Other people in the alignment field think there is a more systematic difference between role-playing a scheming character and scheming for more fundamental reasons, and I do think there are some differences here, but I have no idea how we would identify them, even if we had the activations saved, unless something very substantial about the architecture changes.
I think Nate's "Deep Deceptiveness" points more in the direction I am arguing for. There is no marginal "scheming" threshold where AI systems will become deceptive, and they are "honest" before then. Generally AI systems act mostly with reckless disregard for the truth when talking to you, and will continue to do so unless we really make some substantial progress in AI alignment that we have mostly completely failed to make in the last decade.
It is trivially easy to demonstrate that AI systems can speak with reckless disregard for the truth. In order for a powerful system to not be "scheming", i.e. optimizing for long-term non-human objectives with reckless disregard for the truth[1], something specific has to go right from where we are, and in the absence of that, you will just continue observing systems that keep getting more powerful, and continue to not really care about saying true things.
This is my best short summary of what Joe Carlsmith means by "scheming". Also by "truth" here I mean something more like "informing the user about what is going on". I think similar to how we can sometimes get humans to not literally say false things, we might get some very similar rough deontological guidelines into an AI, but this only provides a weak protection against deception.
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven't seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
(I go into this and various related things in my dialogue with Ryan on control)