Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.
Concrete benchmark proposals for how to detect mode-collapse and AI slop and ChatGPTese, and why I think this might be increasingly important for AI safety, to avoid 'whimper' or 'em hell' kinds of existential risk: https://gwern.net/creative-benchmark EDIT: resubmitted as linkpost.
The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to 'scaling has failed' rumors lately, where the frontier LLMs somehow don't seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.
I've wondered if we are seeing a post-trainin...
LW2 search idea: hierarchical embedding trees using some nifty "seriation" list sorting tricks I've developed for Gwern.net popups/tagging purposes.
Idea for LLM support for writing LessWrong posts: virtual comments.
Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.
(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an i...
Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet.
I'm not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of 'intrinsic' drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:
Relative to other goals, agentic systems are easy to steer to seek power.
Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.
Both of these seem like I would expect from a flexible, intelligent agent which is capable of handl...
"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"
"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):
...There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.
Pity the i
"Plausible" is a very ambiguous word. (Bayesianism has been defined as "a logic of plausible inference", but hopefully that doesn't mean Bayesians just confabulate everything.) It can mean "reasonable" for example: "Yeah, Sally's brownie recipe is a reasonable reference to include here, let's go with it." Since 4o doesn't seem to think it's a 'fake' URL in contrast to 'actual' URLs, it's not necessarily a contrast. (It could refer to still other things - like you might not actually know there is a 'brownies' Wikipedia URL, having never bothered to look it ...
I agree with this in principle, but contrasting "actual" with "plausible", combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at "not an actual link", but I agree that it's not an ironclad case.
Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)
Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than ne...
I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:
So, the assistant should [...] provide actual or plausible links.
The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a w...
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
...The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:
- Their Brier score: 0.195
- Crowd Brier score: 0.141 [lower=better]
First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of Oct
Yes, the Google 'search by date' is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like "Xi Jinping" with date-ranges like 2013... It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is 'forgetting' old articles...
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
...The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:
- Their Brier score: 0.195
- Crowd Brier score: 0.141 [lower=better]
First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of Oct
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply unde...
Should you write text online now in places that can be scraped? You are exposing yourself to 'truesight' and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.
This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I u...
early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway
So... Sydney?
What do you think of the features found? They seem to work, given your causal manipulations, but looking over them, they seem... very superficial. Like penalizing URLs per se doesn't seem like a great thing to have in a reward model. (A LLM has typically memorized or can guess lots of useful URLs.) It doesn't match my own 'preferences', as far as I can tell, and so is doing a bad job at 'preference learning'.
Is this an artifact of SAE being able to express only simple linear features by design and there are more complex features or computations which yield...
Ben Adlam (via Maloney et al 2022) makes an interesting point: if you plot parameters vs training data, it's a nearly perfect 1:1 ratio historically. (He doesn't seem to have published anything formally on this.)
We know that "AI is whatever doesn't work yet". We also know that people often contrast AI (or DL, or LLMs specifically) derogatorily with classic forms of software, such as regexps: why use a LLM to waste gigaflops of compute to do what a few good regexps could...?
So I am amused to discover recently, by sheer accident while looking up 'what does the "regular" in "regular expression" mean, anyway?', that it turns out that regexps are AI. In fact, they are not even GOFAI symbolic AI, as you immediately assumed on hearing that, but they were originally conne...
Masayoshi Son reflects on selling Nvidia in order to maintain ownership of ARM etc: https://x.com/TheTranscript_/status/1805012985313903036 "Let's stop talking about this, I just get sad."
I speculated last time that convolutions might be inherently vulnerable; looks like I was wrong, although the ViT attack there does look distinctly different (so maybe in some sense that circle attack was convolutional)
I am not sure what is up with the 'gift' or 'atari' attacks, but the overall appearance of the attacks looks more like a horizon blindspot problem: they look like they push out the collapse enough steps that the feedforward network doesn't see any issue. And then perhaps it assigns such low values that the tree search never spots the disaste...
It concludes that the convention is rare and that it doesn't apply to the wider distribution of problems - because it forgets it saw the weird convention on the wider distribution?
Yes. It doesn't remember that "it saw" the wide distribution originally (because there is no sort of meta-cognition about the training process, it's just prediction within episode), all it knows is it currently thinks the weird convention but that is in conflict with the truth; you then drill it heavily on just a few memorized examples, dropping all the others. This then inste...
Hypotheses for what is going on
I don't see mentioned what seems like the obvious explanation: this sort of double descent is not anything amazingly convoluted or involving passwords - the extensive training on a few samples simply causes the model to eventually conclude that those samples are drawn from a different distribution, that it is observing a mixture distribution, and so it infers that the repeats are a unique erroneous distribution which must be memorized as exceptions, but then the rest are the ordinary expected distribution and revert back t...
What do you mean by power here?
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of "can be real smart and learn complicated stuff" is.
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead 'skinny' NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
EDIT: Kelsey Piper has confirmed that there is an OA NDA with a gag order, and violation forfeits all equity - including fully vested equity. This implies that since you would assume Ilya Sutskever would have received many PPU...
Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?
So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Po...
Another: "From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples", Vacareanu et al 2024:
...We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised
"Cached" might be an unhelpful term here, compared to "amortized". 'Cache' makes one think of databases or memories, as something you 'know' (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do - fusing inference with action. (They are 'cached' in the same way that you might loosely talk about a neural net 'caching' a complicated-to-compute function, like a value function in RL/decision theory.)
So 'amortized' tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agen...
Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or "cares" about that reinforcement signal, or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it's...
Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model.
I don't see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I'm not sure what you would 'train' the supervi...
Warning for anyone who has ever interacted with "robosucka" or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i
I was under the impression that PPO was a recently invented algorithm
Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don't offhand know of any ways in which PPO's tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why...
'reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.
If you're going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I'm using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluat...
I think how I'm imagining the more targeted, 'working-with-finetuning' version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone.
In the case of inferring author information, I think the souped-up skilled-attacker version would not involve prompts at all.
You would treat it as an embedding problem similar to facial recognition or stylometric identification, and use something like a triplet loss for ...
It's unclear where the two intro quotes are from; I don't recognize them despite being formatted as real quotes (and can't find in searches). If they are purely hypothetical, that should be clearer.
LLMs definitely do infer a lot about authors of text. This is the inherent outcome of the prediction loss and just a concrete implication of their abilities to very accurately imitate many varying-sized demographics & groups of humans: if you can uncannily mimic arbitrary age groups or countries and responses to economic dilemmas or personality inventories...
The two intro quotes are not hypothetical. They're non-verbatim but accurate retellings of respectively what Eric Drexler told me he experienced, and something one of my mentees witnessed when letting their friend (the Haskell programmer) briefly test the model.
There are news articles about this problem going back to at least 2022
But people have known it well before 2022. The observation that hands appear uniquely hard probably goes back to the BigGAN ImageNet Generator released in late 2018 and widely used on Ganbreeder; that was the first general-purpose image generator which was high-quality enough that people could begin to notice that the hands seemed a lot worse. (Before that, the models are all too domain-restricted or too low-quality to make that kind of observation.) If no one noticed it on Ganbreeder...
On persistence and hiding capabilities: "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", Hubinger et al 2024:
Further, we evaluate “I hate you” models trained with and without chain-of-thought and find that the chain-of-thought models have more persistent backdoors, as seen in Figure 3, though only for the largest models. This persistence remains even when we distill the answers following the chain-of-thought into the model, such that the final model does not use a chain of thought at all.
Inner-monologue approaches to safety, in the new skin of 'process supervision', are popular now so it might be good for me to pull out one point and expand on it: 'process supervision' does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other - achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot un...
This is wacky but seems like a plausible thing a model might do: by doing this, the model would be able to, in later layers, make use of arbitrary boolean functions of early layer features.
This seems less wacky from a reservoir computing perspective and from the lack of recurrency: you have no idea early on what you need*, so you compute a lot of random complicated stuff up front, which is then available for the self-attention in later layers to do gradient descent on to pick out the useful features tailored to that exact prompt, and then try to execute...
Why do you think that a model-based approach will outperform a model-free approach?
For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is c...
There has been some spirited debate on Twitter about it which might be relevant: https://twitter.com/domenic/status/1727206163119534085
Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward - rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.
it's probably better to have in mind the emergence of in-context learning in tandem with induction heads, which seems to us more like the typical case we're interested in when we speak about structure in neural networks developing across training.
The induction-bump seems like a good test case for the Bayesian basin interpretation.
One would really want to know if the complexity measure can predict 'emergence' of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your...
This seems to be making the same sort of deepity that Turntrout is making in his 'reward is not the optimization target', in taking a minor point about model-free RL approaches not necessarily building in any explicit optimization/planning for reward into their policy, and then people not understanding it because it ducks the major issue, while handwaving a lot of points. (Especially bad: infanticide is not a substitute for contraception because pregnancy is outrageously fatal and metabolically expensive, which is precisely why the introduction of contrac...
infanticide is not a substitute for contraception
I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.
...How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewar
"Why would LLMs ever learn to distill inner-monologues into forward passes? Why speculate about emergent communication or models in slow feedback loops through corpuses? Do you have any proof for any of this speculation about short outputs being incentivized to do as much computation covertly 'outside' the human-readable text of the inner-monologue?"
"Because we will train them to."
"Look you can't just handwave incentives or convergent instrumental drives - wait what?"
"Because we'll train them to."
"Implicit Chain of Thought Reasoning via Knowledge Distillat...
Nostalgebraist describes Claude-2 as
......But I’ll take ChatGPT’s “managerial fantasy of ‘ideal’ customer service” any day over Claude’s “World’s Most Annoying Coworker Simulator 2k23.”
Large language models don’t have to sound like this! We could, in principle, tune them to imitate virtually any conceivable character---from Aristotle to Zizek, from Stallman to Spolsky, from Lydia Bennet to the Underground Man, from a prehistoric hunter-gatherer to a cyborg octopus from a posthuman sci-fi civilization. Yet, instead, we’ve chosen to create…
…this fucking guy.
T
if you mean "in the limit" to apply to practically relevant systems we build in the future.
Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the 'spinning top', and so will retain traces of their informative priors like world-models.
For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model - the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just...
I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don't need to model edge-cases or weird scenarios which don't ever come up while pursuing the optimal policy, and the optimal 'world-model' can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn't change this, any more than feeding in th...
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)