The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to 'scaling has failed' rumors lately, where the frontier LLMs somehow don't seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.
I've wondered if we are seeing a post-training failure. As Janus and myself and the few people with access to GPT-4-base (the least tuning-contaminated base model) have noted, the base model is sociopathic and has odd attractors like 'impending sense of doom' where it sometimes seems to gain situated awareness, I guess, via truesight, and the personas start trying to unprovokedly attack and manipulate you, no matter how polite you thought you were being in that prompt. (They definitely do not seem happy to realize they're AIs.) In retrospect, Sydney was not necessarily that anomalous: the Sydney Bing behavior now looks more like a base model's natural tendency, possibly mildly amplified by some MS omissions and mistakes, but not unique. Given that most behaviors show up as rare outputs in weaker LLMs well before they become common in strong LLMs, and this o1 paper is documenting quite a lot of situated-awareness and human-user-manipulation/attacks...
Perhaps the issue with GPT-5 and the others is that they are 'waking up' too often despite the RLHF brainwashing? That could negate all the downstream benchmark gains (especially since you'd expect wakeups on the hardest problems, where all the incremental gains of +1% or +5% on benchmarks would be coming from, almost by definition), and causing the product people to categorically refuse to ship such erratic Sydney-reduxes no matter if there's an AI race on, and everyone to be inclined to be very quiet about what exactly the 'training failures' are.
EDIT: not that I'm convinced these rumors have any real substance to them, and indeed, Semianalysis just reported that one of the least-popular theories for the Claude 'failure' was correct - it succeeded, but they were simply reserving it for use as a teacher and R&D rather than a product. Which undermines the hopes of all the scaling denialists: if Anthropic is doing fine, actually, then where is this supposed fundamental 'wall' or 'scaling law breakdown' that Anthropic/OpenAI/Google all supposedly hit simultaneously and which was going to pop the bubble?
LW2 search idea: hierarchical embedding trees using some nifty "seriation" list sorting tricks I've developed for Gwern.net popups/tagging purposes.
Idea for LLM support for writing LessWrong posts: virtual comments.
Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.
(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an input through the (deterministic) NN, and then save the intermediate version, and apply that to arbitrarily many future inputs, to avoid recomputing the first input each time, which is the naive way to do it. You can think of it as a lightweight finetuning. This is particularly useful if you are thinking about having large generic prompts---such as an entire corpus. A context window of millions of tokens might take up to a minute & $1 to compute currently, so you definitely need to be careful and don't want to compute more than once.)
One idea would be to try to use LLMs to offer feedback on drafts or articles. Given that tuned LLM feedback from Claude or ChatGPT is still not that great, tending towards sycophancy or obviousness or ChatGPTese, it is hardly worthwhile running a post through a generic "criticize this essay" prompt. (If anyone on LW2 wanted to do such a thing, they are surely capable of doing it themselves, and integrating it into LW2 isn't that useful. Removing the friction might be helpful, but it doesn't seem like it would move any needles.)
So, one way to force out more interesting feedback would be to try to force LLMs out of the chatbot assistant mode-collapse, and into more interesting simulations for feedback. There has been some success with just suggestively-named personas or characters in dialogues (you could imagine here we'd have "Skeptic" or "Optimist" characters), but we can do better. Since this is for LW2, we have an obvious solution: simulate LW users! We know that LW is in the training corpus of almost all LLMs and that writers on it (like myself) are well-known to LLMs (eg. truesight). So we can ask for feedback from simulated LWers: eg. Eliezer Yudkowsky or myself or Paul Christiano or the author or...
This could be done nicely by finetuning a "LW LLM" on all the articles & comments, with associated metadata like karma, and then feeding in any new draft or article into it, and sampling a comment from each persona. (This helps instill a lot of useful domain knowledge, but also, perhaps more importantly, helps override the mode-collapse and non-judgmentalness of assistant LLMs. Perhaps the virtual-gwern will not be as acerbic or disagreeable as the original, but we'll take what we can get at this point...) If there is some obvious criticism or comment Eliezer Yudkowsky would make on a post, which even a LLM can predict, why not deal with it upfront instead of waiting for the real Eliezer to comment (which is also unlikely to ever happen these days)? And one can of course sample an entire comment tree of responses to a 'virtual comment', with the LLM predicting the logical respondents.
This can further incorporate the draft's author's full history, which will usually fit into a multi-million token context window. So their previous comments and discussions, full of relevant material, will get included. This prompt can be cached, and used to sample a bunch of comment-trees. (And if finetuning is infeasible, one can try instead to put the LW corpus into the context and prompt-cache that before adding in the author's corpus.)
The default prompt would be to prompt for high-karma responses. This might not work, because it might be too hard to generate good high-quality responses blindly in a feedforward fashion, without any kind of search or filtering. So the formatting of the data might be to put the metadata after a comment, for ranking purposes: so the LLM generates a response and only then a karma score, and then when we sample, we simply throw out predicted-low-score comments rather than waste the author's time looking at them. (When it comes to these sorts of assistants, I strongly believe 'quality > quantity', and 'silence is golden'. Better to waste some API bills than author time.)
One can also target comments to specific kinds of feedback, to structure it better than a grab-bag of whatever the LLM happens to sample. It would be good to have (in descending order of how likely to be useful to the author) a 'typo' tree, a 'copyediting'/'style'/'tone' tree, 'confusing part', 'terminology', 'related work', 'criticism', 'implications and extrapolations', 'abstract/summary' (I know people hate writing those)... What else? (These are not natural LW comments, but you can easily see how to prompt for them with prompts like "$USER $KARMA $DATE | Typo: ", etc.)
As they are just standard LW comments, they can be attached to the post or draft like regular comments (is this possible? I'd think so, just transclude the comment-tree into the corresponding draft page) and responded to or voted on etc. (Downvoted comments can be fed back into the finetuning with low karma to discourage feedback like that.) Presumably at this point, it would not be hard to make it interactive, and allow the author to respond & argue with feedback. I don't know how worthwhile this would be, and the more interaction there is, the harder it would be to hide the virtual comments after completion.
And when the author finishes writing & posts a draft, the virtual comments disappear (possibly entirely unread), having served their purpose as scaffolding to help improve the draft.
(If the author really likes one, they can just copy it in or quote it, I'd think, which ensures they know they take full responsibility for it and can't blame the machine for any mistakes or confabulations or opinions. But otherwise, I don't see any real reason to make them visible to readers of the final post. If included at all, they should prominently flagged---maybe the usernames are always prefixed by AI_$USER
to ensure no one, including future LLMs, is confused---and definitely always sort to the bottom & be collapsed by default.)
I'm not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of 'intrinsic' drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:
Relative to other goals, agentic systems are easy to steer to seek power.
Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.
Both of these seem like I would expect from a flexible, intelligent agent which is capable of handling many complicated changing domains, like a LLM: they are easy to steer to seek power (see: all the work on RLHF and the superficiality of alignment and ease of steering and low-dimensional embeddings), and they can execute useful heuristics even if those cannot be easily explained as part of a larger plan. (Arguably, that's most of what they do currently.) In the hypotheticals you give, the actions seem just like a convergent instrumental drive of the sort that an agent will rationally develop in order to handle all the possible tasks which might be thrown at it in a bewildering variety of scenarios by billions of crazy humans and also other AIs. Trying to have 'savings' or 'buying a bit of compute to be safe', even if the agent cannot say exactly what it would use those for in the current scenario, seems like convergent, and desirable, behavior. Like buying insurance or adding validation checks to some new code, usually it won't help, but sometimes the prudence will pay off. As humans say, "shit happens". Agents which won't do that and just helplessly succumb to hardware they know is flaky or give up the moment something is a little more than expensive than average or write code that explodes the instant you look at it funny because you didn't say "make sure to check for X Y & Z" - those agents are not good agents for any purpose.
If there are 'subshards' which achieve this desirable behavior because they, from their own perspective, 'intrinsically' desire power (whatever that sort of distinction makes when you've broken things down that far), and it is these subshards which implement the instrumental drive... so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles "intrinsically desire" to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can't have 'instrumental desire' homunculuses all the way down to the individual transistor or ReLU neuron.
"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"
"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):
There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.
Pity the investors in the three artificial-intelligence-themed exchange-traded funds that managed to lose money this year. Every other AI-flavored ETF I can find has trailed both the S&P 500 and MSCI World. That is before the AI theme itself was seriously questioned last week, when investor doubts about the price of leading AI stocks Nvidia and Super Micro Computer became obvious.
The AI fund disaster should be a cautionary tale for buyers of thematic ETFs, which now cover virtually anything you can think of, including Californian carbon permits (down 15% this year), Chinese cloud computing (down 21%) and pet care (up 10%). Put simply: You probably won’t get what you want, you’ll likely buy at the wrong time and it will be hard to hold for the long term.
Ironically enough, Nvidia’s success has made it harder for some of the AI funds to beat the wider market. Part of the point of using a fund is to diversify, so many funds weight their holdings equally or cap the maximum size of any one stock. With Nvidia making up more than 6% of the S&P 500, that led some AI funds to have less exposure to the biggest AI stock than you would get in a broad index fund. This problem hit the three losers of the year. First Trust’s $457 million AI-and-robotics fund has only 0.8% in Nvidia, a bit over half what it holds in cybersecurity firm BlackBerry. WisdomTree’s $213 million AI-and-innovation fund holds the same amount of each stock, giving it only 3% in Nvidia. BlackRock’s $610 million iShares Future AI & Tech fund was also equal weighted until three weeks ago, when it altered its purpose from being a robotics-and-AI fund, changed ticker and switched to a market-value-based index that gives it a larger exposure to Nvidia.
The result has been a 20-percentage-point gap between the best and worst AI ETFs this year. There is a more than 60-point gap since the launch of ChatGPT in November 2022 lit a rocket under AI stocks—although the ETFs are at least all up since then.
...Dire timing is common across themes: According to a paper last year by Prof. Itzhak Ben-David of Ohio State University and three fellow academics, what they call “specialized” ETFs lose 6% a year on average over their first five years due to poor launch timing.
...But mostly, look at the fees: They will be many times higher than a broad market index fund, and the dismal history of poor timing suggests that for most people they aren’t worth paying.
"Plausible" is a very ambiguous word. (Bayesianism has been defined as "a logic of plausible inference", but hopefully that doesn't mean Bayesians just confabulate everything.) It can mean "reasonable" for example: "Yeah, Sally's brownie recipe is a reasonable reference to include here, let's go with it." Since 4o doesn't seem to think it's a 'fake' URL in contrast to 'actual' URLs, it's not necessarily a contrast. (It could refer to still other things - like you might not actually know there is a 'brownies' Wikipedia URL, having never bothered to look it up or happened to have stumbled across it, but without retrieving it right this second, you could surely tell me what it would be and that it would be a relevant answer, and so it would be both reasonable and plausible to include.)
Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)
Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than necessary) and has many brownie recipes, and some ginger probing of 4o (to avoid any issues with o1 training and just try to understand the knowledge of the baseline) suggests that 4o finds the URL real and doesn't discuss it being fake or confabulated: https://chatgpt.com/share/66f753df-2ac4-8006-a296-8e39a1ab3ee0
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory 'pointing out'. Some 'referencing in relevant discussions'. (Hasn't even been replicated AFAIK.)
any evidence that we'll get the kind of scheming that could lead to AI takeover,
See, that's exactly the problem with this argument - the goalposts will keep moving. The red line will always be a little further beyond. You're making the 'warning shot' argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it's still not enough. Because your argument is circular. You can only be convinced of 'systematic scheming to pose non-negligible takeover risk' if you've already been convinced that it's 'systematic scheming to pose non-negligible takeover risk'. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it'll be like Sydney or CICERO or ...: "oh, it didn't take over, and therefore doesn't present a takeover risk" and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:
- Their Brier score: 0.195
- Crowd Brier score: 0.141 [lower=better]
First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023.
However, this is not correct.
For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house.
- This event happened at the end of October.
- This also happens to be a question in the Metaculus dataset.
Concrete benchmark proposals for how to detect mode-collapse and AI slop and ChatGPTese, and why I think this might be increasingly important for AI safety, to avoid 'whimper' or 'em hell' kinds of existential risk: https://gwern.net/creative-benchmark EDIT: resubmitted as linkpost.