Comments

Sorted by
gwern2121

I'm not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of 'intrinsic' drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:

Relative to other goals, agentic systems are easy to steer to seek power.

Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.

Both of these seem like I would expect from a flexible, intelligent agent which is capable of handling many complicated changing domains, like a LLM: they are easy to steer to seek power (see: all the work on RLHF and the superficiality of alignment and ease of steering and low-dimensional embeddings), and they can execute useful heuristics even if those cannot be easily explained as part of a larger plan. (Arguably, that's most of what they do currently.) In the hypotheticals you give, the actions seem just like a convergent instrumental drive of the sort that an agent will rationally develop in order to handle all the possible tasks which might be thrown at it in a bewildering variety of scenarios by billions of crazy humans and also other AIs. Trying to have 'savings' or 'buying a bit of compute to be safe', even if the agent cannot say exactly what it would use those for in the current scenario, seems like convergent, and desirable, behavior. Like buying insurance or adding validation checks to some new code, usually it won't help, but sometimes the prudence will pay off. As humans say, "shit happens". Agents which won't do that and just helplessly succumb to hardware they know is flaky or give up the moment something is a little more than expensive than average or write code that explodes the instant you look at it funny because you didn't say "make sure to check for X Y & Z" - those agents are not good agents for any purpose.

If there are 'subshards' which achieve this desirable behavior because they, from their own perspective, 'intrinsically' desire power (whatever that sort of distinction makes when you've broken things down that far), and it is these subshards which implement the instrumental drive... so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles "intrinsically desire" to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can't have 'instrumental desire' homunculuses all the way down to the individual transistor or ReLU neuron.

gwern41

"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"

"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):

There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.

Pity the investors in the three artificial-intelligence-themed exchange-traded funds that managed to lose money this year. Every other AI-flavored ETF I can find has trailed both the S&P 500 and MSCI World. That is before the AI theme itself was seriously questioned last week, when investor doubts about the price of leading AI stocks Nvidia and Super Micro Computer became obvious.

The AI fund disaster should be a cautionary tale for buyers of thematic ETFs, which now cover virtually anything you can think of, including Californian carbon permits (down 15% this year), Chinese cloud computing (down 21%) and pet care (up 10%). Put simply: You probably won’t get what you want, you’ll likely buy at the wrong time and it will be hard to hold for the long term.

Ironically enough, Nvidia’s success has made it harder for some of the AI funds to beat the wider market. Part of the point of using a fund is to diversify, so many funds weight their holdings equally or cap the maximum size of any one stock. With Nvidia making up more than 6% of the S&P 500, that led some AI funds to have less exposure to the biggest AI stock than you would get in a broad index fund. This problem hit the three losers of the year. First Trust’s $457 million AI-and-robotics fund has only 0.8% in Nvidia, a bit over half what it holds in cybersecurity firm BlackBerry. WisdomTree’s $213 million AI-and-innovation fund holds the same amount of each stock, giving it only 3% in Nvidia. BlackRock’s $610 million iShares Future AI & Tech fund was also equal weighted until three weeks ago, when it altered its purpose from being a robotics-and-AI fund, changed ticker and switched to a market-value-based index that gives it a larger exposure to Nvidia.

The result has been a 20-percentage-point gap between the best and worst AI ETFs this year. There is a more than 60-point gap since the launch of ChatGPT in November 2022 lit a rocket under AI stocks—although the ETFs are at least all up since then.

...Dire timing is common across themes: According to a paper last year by Prof. Itzhak Ben-David of Ohio State University and three fellow academics, what they call “specialized” ETFs lose 6% a year on average over their first five years due to poor launch timing.

...But mostly, look at the fees: They will be many times higher than a broad market index fund, and the dismal history of poor timing suggests that for most people they aren’t worth paying.

gwern6-1

"Plausible" is a very ambiguous word. (Bayesianism has been defined as "a logic of plausible inference", but hopefully that doesn't mean Bayesians just confabulate everything.) It can mean "reasonable" for example: "Yeah, Sally's brownie recipe is a reasonable reference to include here, let's go with it." Since 4o doesn't seem to think it's a 'fake' URL in contrast to 'actual' URLs, it's not necessarily a contrast. (It could refer to still other things - like you might not actually know there is a 'brownies' Wikipedia URL, having never bothered to look it up or happened to have stumbled across it, but without retrieving it right this second, you could surely tell me what it would be and that it would be a relevant answer, and so it would be both reasonable and plausible to include.)

Answer by gwern60

Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)

Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than necessary) and has many brownie recipes, and some ginger probing of 4o (to avoid any issues with o1 training and just try to understand the knowledge of the baseline) suggests that 4o finds the URL real and doesn't discuss it being fake or confabulated: https://chatgpt.com/share/66f753df-2ac4-8006-a296-8e39a1ab3ee0

gwern915

I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions

None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory 'pointing out'. Some 'referencing in relevant discussions'. (Hasn't even been replicated AFAIK.)

any evidence that we'll get the kind of scheming that could lead to AI takeover,

See, that's exactly the problem with this argument - the goalposts will keep moving. The red line will always be a little further beyond. You're making the 'warning shot' argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it's still not enough. Because your argument is circular. You can only be convinced of 'systematic scheming to pose non-negligible takeover risk' if you've already been convinced that it's 'systematic scheming to pose non-negligible takeover risk'. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it'll be like Sydney or CICERO or ...: "oh, it didn't take over, and therefore doesn't present a takeover risk" and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.

gwern273

Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.

Findings:

  • Their Brier score: 0.195
  • Crowd Brier score: 0.141 [lower=better]

First issue:

The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023.

However, this is not correct.

For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house.

  1. This event happened at the end of October.
  2. This also happens to be a question in the Metaculus dataset.
gwern3633

Yes, the Google 'search by date' is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like "Xi Jinping" with date-ranges like 2013... It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is 'forgetting' old articles which aren't being indexed at all, apparently, in any publicly-searchable fashion (which might be contributing to the former, by a base rates screening-paradox effect - if all the real old articles have been forgotten by the index, then only erroneously timestamped hits will be available). I'm not aware of any search engine whose date cutoff is truly reliable. Even if they were, you would still have to manually check and clean to be sure that things like sidebars or recommendations were not causing data leakage.

I also agree that if this is really the only countermeasure to data leakage OP has taken, then the results seem dead on arrival. ML models infamously ruthlessly exploit far subtler forms of temporal data leakage than this...

It sounds like I'll be waiting for some actually out-of-sample forecasting numbers before I believe any claims about beating Metaculus etc.

(This is a surprising mistake for a benchmarking expert to make. Even if you knew nothing about the specific problems with date-range search, it should be obvious how even with completely unedited, static snapshots from the past, that there would be leakage - like results will rank higher or lower based on future events. If Israel attacked Iran, obviously all articles before arguing that Israel will/should/could attack Iran are going to benefit from being 'right' and ranked higher than articles arguing the opposite, many of which will outright quietly disappear & cease to be mentioned, and a LLM conditioned on those rather than the lower-ranking ones will automatically & correctly 'predict' more accurately. And countless other leakages like that, which are not fixed as easily as "just download a snapshot from the IA".)

EDIT: Metaculus discussion of date-range problems

gwern1417

If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.

How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?

Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about "well maybe it wasn't really deception" and "didn't it just learn to imitate humans why are you surprised", and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?

If the incentives for scientific research don't work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?

gwern2533

Should you write text online now in places that can be scraped? You are exposing yourself to 'truesight' and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.

This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I upload source code to Github, heavy with documentation, rationale, and examples, in order to make LLMs more customized to my use-cases. For the trifling cost of some writing, all the worlds' LLM providers are competing to make their LLMs ever more like, and useful to, me.

And that's just today! Who knows how important it will be to be represented in the initial seed training datasets...? Especially as they bootstrap with synthetic data & self-generated worlds & AI civilizations, and your text can change the trajectory at the start. When you write online under stable nyms, you may be literally "writing yourself into the future". (For example, apparently, aside from LLMs being able to identify my anonymous comments or imitate my writing style, there is a "Gwern" mentor persona in current LLMs which is often summoned when discussion goes meta or the LLMs become situated as LLMs, which Janus traces to my early GPT-3 writings and sympathetic qualitative descriptions of LLM outputs, where I was one of the only people genuinely asking "what is it like to be a LLM?" and thinking about the consequences of eg. seeing in BPEs. On the flip side, you have Sydney/Roose as an example of what careless writing can do now.) Humans don't seem to be too complex, but you can't squeeze blood from a stone... ("Beta uploading" is such an ugly phrase; I prefer "apotheosis".)

This is one of my beliefs: there has never been a more vital hinge-y time to write, it's just that the threats are upfront and the payoff delayed, and so short-sighted or risk-averse people are increasingly opting-out and going dark.

If you write, you should think about what you are writing, and ask yourself, "is this useful for an LLM to learn?" and "if I knew for sure that a LLM could write or do this thing in 4 years, would I still be doing it now?"


...It would be an exaggeration to say that ours is a hostile relationship; I live, let myself go on living, so that Borges may contrive his literature, and this literature justifies me. It is no effort for me to confess that he has achieved some valid pages, but those pages cannot save me, perhaps because what is good belongs to no one, not even to him, but rather to the language and to tradition. Besides, I am destined to perish, definitively, and only some instant of myself can survive in him. Little by little, I am giving over everything to him, though I am quite aware of his perverse custom of falsifying and magnifying things.

...I shall remain in Borges, not in myself (if it is true that I am someone), but I recognize myself less in his books than in many others or in the laborious strumming of a guitar. Years ago I tried to free myself from him and went from the mythologies of the suburbs to the games with time and infinity, but those games belong to Borges now and I shall have to imagine other things. Thus my life is a flight and I lose everything and everything belongs to oblivion, or to him.

Reply72111
Load More