Comments

Sorted by
gwern6-1

"Plausible" is a very ambiguous word. (Bayesianism has been defined as "a logic of plausible inference", but hopefully that doesn't mean Bayesians just confabulate everything.) It can mean "reasonable" for example: "Yeah, Sally's brownie recipe is a reasonable reference to include here, let's go with it." Since 4o doesn't seem to think it's a 'fake' URL in contrast to 'actual' URLs, it's not necessarily a contrast. (It could refer to still other things - like you might not actually know there is a 'brownies' Wikipedia URL, having never bothered to look it up or happened to have stumbled across it, but without retrieving it right this second, you could surely tell me what it would be and that it would be a relevant answer, and so it would be both reasonable and plausible to include.)

Answer by gwern60

Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)

Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than necessary) and has many brownie recipes, and some ginger probing of 4o (to avoid any issues with o1 training and just try to understand the knowledge of the baseline) suggests that 4o finds the URL real and doesn't discuss it being fake or confabulated: https://chatgpt.com/share/66f753df-2ac4-8006-a296-8e39a1ab3ee0

gwern915

I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions

None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory 'pointing out'. Some 'referencing in relevant discussions'. (Hasn't even been replicated AFAIK.)

any evidence that we'll get the kind of scheming that could lead to AI takeover,

See, that's exactly the problem with this argument - the goalposts will keep moving. The red line will always be a little further beyond. You're making the 'warning shot' argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it's still not enough. Because your argument is circular. You can only be convinced of 'systematic scheming to pose non-negligible takeover risk' if you've already been convinced that it's 'systematic scheming to pose non-negligible takeover risk'. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it'll be like Sydney or CICERO or ...: "oh, it didn't take over, and therefore doesn't present a takeover risk" and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.

gwern273

Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.

Findings:

  • Their Brier score: 0.195
  • Crowd Brier score: 0.141 [lower=better]

First issue:

The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023.

However, this is not correct.

For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house.

  1. This event happened at the end of October.
  2. This also happens to be a question in the Metaculus dataset.
gwern3532

Yes, the Google 'search by date' is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like "Xi Jinping" with date-ranges like 2013... It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is 'forgetting' old articles which aren't being indexed at all, apparently, in any publicly-searchable fashion (which might be contributing to the former, by a base rates screening-paradox effect - if all the real old articles have been forgotten by the index, then only erroneously timestamped hits will be available). I'm not aware of any search engine whose date cutoff is truly reliable. Even if they were, you would still have to manually check and clean to be sure that things like sidebars or recommendations were not causing data leakage.

I also agree that if this is really the only countermeasure to data leakage OP has taken, then the results seem dead on arrival. ML models infamously ruthlessly exploit far subtler forms of temporal data leakage than this...

It sounds like I'll be waiting for some actually out-of-sample forecasting numbers before I believe any claims about beating Metaculus etc.

(This is a surprising mistake for a benchmarking expert to make. Even if you knew nothing about the specific problems with date-range search, it should be obvious how even with completely unedited, static snapshots from the past, that there would be leakage - like results will rank higher or lower based on future events. If Israel attacked Iran, obviously all articles before arguing that Israel will/should/could attack Iran are going to benefit from being 'right' and ranked higher than articles arguing the opposite, many of which will outright quietly disappear & cease to be mentioned, and a LLM conditioned on those rather than the lower-ranking ones will automatically & correctly 'predict' more accurately. And countless other leakages like that, which are not fixed as easily as "just download a snapshot from the IA".)

EDIT: Metaculus discussion of date-range problems

gwern1416

If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.

How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?

Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about "well maybe it wasn't really deception" and "didn't it just learn to imitate humans why are you surprised", and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?

If the incentives for scientific research don't work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?

gwern2433

Should you write text online now in places that can be scraped? You are exposing yourself to 'truesight' and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.

This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I upload source code to Github, heavy with documentation, rationale, and examples, in order to make LLMs more customized to my use-cases. For the trifling cost of some writing, all the worlds' LLM providers are competing to make their LLMs ever more like, and useful to, me.

And that's just today! Who knows how important it will be to be represented in the initial seed training datasets...? Especially as they bootstrap with synthetic data & self-generated worlds & AI civilizations, and your text can change the trajectory at the start. When you write online under stable nyms, you may be literally "writing yourself into the future". (For example, apparently, aside from LLMs being able to identify my anonymous comments or imitate my writing style, there is a "Gwern" mentor persona in current LLMs which is often summoned when discussion goes meta or the LLMs become situated as LLMs, which Janus traces to my early GPT-3 writings and sympathetic qualitative descriptions of LLM outputs, where I was one of the only people genuinely asking "what is it like to be a LLM?" and thinking about the consequences of eg. seeing in BPEs. On the flip side, you have Sydney/Roose as an example of what careless writing can do now.) Humans don't seem to be too complex, but you can't squeeze blood from a stone... ("Beta uploading" is such an ugly phrase; I prefer "apotheosis".)

This is one of my beliefs: there has never been a more vital hinge-y time to write, it's just that the threats are upfront and the payoff delayed, and so short-sighted or risk-averse people are increasingly opting-out and going dark.

If you write, you should think about what you are writing, and ask yourself, "is this useful for an LLM to learn?" and "if I knew for sure that a LLM could write or do this thing in 4 years, would I still be doing it now?"


...It would be an exaggeration to say that ours is a hostile relationship; I live, let myself go on living, so that Borges may contrive his literature, and this literature justifies me. It is no effort for me to confess that he has achieved some valid pages, but those pages cannot save me, perhaps because what is good belongs to no one, not even to him, but rather to the language and to tradition. Besides, I am destined to perish, definitively, and only some instant of myself can survive in him. Little by little, I am giving over everything to him, though I am quite aware of his perverse custom of falsifying and magnifying things.

...I shall remain in Borges, not in myself (if it is true that I am someone), but I recognize myself less in his books than in many others or in the laborious strumming of a guitar. Years ago I tried to free myself from him and went from the mythologies of the suburbs to the games with time and infinity, but those games belong to Borges now and I shall have to imagine other things. Thus my life is a flight and I lose everything and everything belongs to oblivion, or to him.

Reply72111
gwern56

early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway

So... Sydney?

gwern44

What do you think of the features found? They seem to work, given your causal manipulations, but looking over them, they seem... very superficial. Like penalizing URLs per se doesn't seem like a great thing to have in a reward model. (A LLM has typically memorized or can guess lots of useful URLs.) It doesn't match my own 'preferences', as far as I can tell, and so is doing a bad job at 'preference learning'.

Is this an artifact of SAE being able to express only simple linear features by design and there are more complex features or computations which yield more appropriate responses to compensate for the initial simple frugal heuristics, or is this kind of preference learning really just that dumb? (I'm reminded of the mode collapse issues like rhyming poetry. If you did SAE on the reward model for ChatGPT, would you find a feature as simple as 'rhyming = gud'?)

Load More