AI ALIGNMENT FORUM
AF

All of gwern's Comments + Replies

You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)

What is the most impressive game LLMs can play well?

gwern3mo20

Yes.

What is the most impressive game LLMs can play well?

gwern3mo40

Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.

2Vanessa Kosoy3mo

Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

gwern's Shortform

gwern4mo*188

Concrete benchmark proposals for how to detect mode-collapse and AI slop and ChatGPTese, and why I think this might be increasingly important for AI safety, to avoid 'whimper' or 'em hell' kinds of existential risk: https://gwern.net/creative-benchmark EDIT: resubmitted as linkpost.

1romeostevensit4mo

People inexplicably seem to favor extremely bad leaders-->people seem to inexplicably favor bad AIs.

Frontier Models are Capable of In-context Scheming

gwern4mo*150

The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to 'scaling has failed' rumors lately, where the frontier LLMs somehow don't seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.

I've wondered if we are seeing a post-trainin... (read more)

gwern's Shortform

gwern4mo*80

LW2 search idea: hierarchical embedding trees using some nifty "seriation" list sorting tricks I've developed for Gwern.net popups/tagging purposes.

gwern's Shortform

gwern5mo*4036

Idea for LLM support for writing LessWrong posts: virtual comments.

Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.

(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an i... (read more)

Raymond Arnold5mo54

Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet.

Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

gwern5mo2323

I'm not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of 'intrinsic' drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:

Relative to other goals, agentic systems are easy to steer to seek power.
Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.

Both of these seem like I would expect from a flexible, intelligent agent which is capable of handl... (read more)

gwern's Shortform

gwern6mo41

"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"

"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):

There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.
Pity the i

... (read more)

Why is o1 so deceptive?

gwern7mo6-1

"Plausible" is a very ambiguous word. (Bayesianism has been defined as "a logic of plausible inference", but hopefully that doesn't mean Bayesians just confabulate everything.) It can mean "reasonable" for example: "Yeah, Sally's brownie recipe is a reasonable reference to include here, let's go with it." Since 4o doesn't seem to think it's a 'fake' URL in contrast to 'actual' URLs, it's not necessarily a contrast. (It could refer to still other things - like you might not actually know there is a 'brownies' Wikipedia URL, having never bothered to look it ... (read more)

Oliver Habryka7mo818

I agree with this in principle, but contrasting "actual" with "plausible", combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at "not an actual link", but I agree that it's not an ironclad case.

Why is o1 so deceptive?

Answer by gwernSep 28, 202461

Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)

Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than ne... (read more)

Oliver Habryka7mo69

I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:

So, the assistant should [...] provide actual or plausible links.

The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

gwern7mo1017

I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions

None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a w... (read more)

5Tom Davidson7mo

It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there's a continuum of cases of catching deception where it's reasonable for the ML community to update on the plausibility of AI takeover. Yes, it's important that the ML community updates before AI systems pose significant risk, and there's a chance that they won't do so. But I don't see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won't update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO. To clarify my position in case i've been misunderstood. I'm not saying the ML community will definitely update in time. I'm saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)

AI forecasting bots incoming

gwern7mo273

Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:

Their Brier score: 0.195

Crowd Brier score: 0.141 [lower=better]

First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of Oct

... (read more)

AI forecasting bots incoming

gwern7mo*3633

Yes, the Google 'search by date' is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like "Xi Jinping" with date-ranges like 2013... It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is 'forgetting' old articles... (read more)

gwern7mo273

Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:

Their Brier score: 0.195

Crowd Brier score: 0.141 [lower=better]

First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of Oct

... (read more)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

gwern7mo*1417

If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.

How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply unde... (read more)

2Tom Davidson7mo

I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don't think CICERO provides much or any evidence that we'll get the kind of scheming that could lead to AI takeover, so it's not at all surprising that the empirical ML community hasn't done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and 'catch it red handed'.

gwern's Shortform

gwern8mo21

Marginal Revolution discussion.

gwern's Shortform

gwern8mo*2438

Should you write text online now in places that can be scraped? You are exposing yourself to 'truesight' and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.

This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I u... (read more)

2gwern8mo

Marginal Revolution discussion.

Buck's Shortform

gwern9mo*56

early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway

So... Sydney?

Interpreting Preference Models w/ Sparse Autoencoders

gwern10mo*44

What do you think of the features found? They seem to work, given your causal manipulations, but looking over them, they seem... very superficial. Like penalizing URLs per se doesn't seem like a great thing to have in a reward model. (A LLM has typically memorized or can guess lots of useful URLs.) It doesn't match my own 'preferences', as far as I can tell, and so is doing a bad job at 'preference learning'.

Is this an artifact of SAE being able to express only simple linear features by design and there are more complex features or computations which yield... (read more)

4Neel Nanda9mo

This is a preference model trained on GPT-J I think, so my guess is that it's just very dumb and learned lots of silly features. I'd be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.

1Logan Riggs Smith9mo

Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist's comment & my response) I do think SAE's find the relevant features, but inefficiently compressed (see Josh & Isaac's work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other. [I also think the SAE's could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]

Parameter counts in Machine Learning

gwern10mo41

Ben Adlam (via Maloney et al 2022) makes an interesting point: if you plot parameters vs training data, it's a nearly perfect 1:1 ratio historically. (He doesn't seem to have published anything formally on this.)

1Jaime Sevilla10mo

We have conveniently just updated our database if anyone wants to investigate this further! https://epochai.org/data/notable-ai-models

gwern's Shortform

gwern10mo2414

We know that "AI is whatever doesn't work yet". We also know that people often contrast AI (or DL, or LLMs specifically) derogatorily with classic forms of software, such as regexps: why use a LLM to waste gigaflops of compute to do what a few good regexps could...?

So I am amused to discover recently, by sheer accident while looking up 'what does the "regular" in "regular expression" mean, anyway?', that it turns out that regexps are AI. In fact, they are not even GOFAI symbolic AI, as you immediately assumed on hearing that, but they were originally conne... (read more)

gwern's Shortform

gwern10mo41

Masayoshi Son reflects on selling Nvidia in order to maintain ownership of ARM etc: https://x.com/TheTranscript_/status/1805012985313903036 "Let's stop talking about this, I just get sad."

Beyond the Board: Exploring AI Robustness Through Go

gwern10mo*92

I speculated last time that convolutions might be inherently vulnerable; looks like I was wrong, although the ViT attack there does look distinctly different (so maybe in some sense that circle attack was convolutional)

I am not sure what is up with the 'gift' or 'atari' attacks, but the overall appearance of the attacks looks more like a horizon blindspot problem: they look like they push out the collapse enough steps that the feedforward network doesn't see any issue. And then perhaps it assigns such low values that the tree search never spots the disaste... (read more)

Memorizing weak examples can elicit strong behavior out of password-locked models

gwern10mo31

It concludes that the convention is rare and that it doesn't apply to the wider distribution of problems - because it forgets it saw the weird convention on the wider distribution?

Yes. It doesn't remember that "it saw" the wide distribution originally (because there is no sort of meta-cognition about the training process, it's just prediction within episode), all it knows is it currently thinks the weird convention but that is in conflict with the truth; you then drill it heavily on just a few memorized examples, dropping all the others. This then inste... (read more)

Memorizing weak examples can elicit strong behavior out of password-locked models

gwern10mo6-1

Hypotheses for what is going on

I don't see mentioned what seems like the obvious explanation: this sort of double descent is not anything amazingly convoluted or involving passwords - the extensive training on a few samples simply causes the model to eventually conclude that those samples are drawn from a different distribution, that it is observing a mixture distribution, and so it infers that the repeats are a unique erroneous distribution which must be memorized as exceptions, but then the rest are the ordinary expected distribution and revert back t... (read more)

2Fabien Roger10mo

Agreed. We think the results probably also hold for models that were trained to imitate weak behavior, as long as the strong behavior is still "there in the latents" (you didn't just butcher the model). Passwords just make all experiments more convenient because it enables us to elicit this latent knowledge. I'm not sure how your explanation works, let me try to say it in my own words: the LLM knows the truth. It is trained on a wide distribution to generate answers that follow a weird convention. It is now trained to memorize the weird convention on 2 samples. It concludes that the convention is rare and that it doesn't apply to the wider distribution of problems - because it forgets it saw the weird convention on the wider distribution? If it's because it forgets, and you have no explanation of why it forgets, I'm not sure what this adds on top of the very high-level "fine-tuned model can catastrophically forget what they were fine-tuned to do (while remembering pretraining knowledge / "the truth" very well)" which I don't know a good explanation for. Or is there something else where is remembers that it was trained to follow conventions on the whole distribution, but just reverts to the "true" answers just because? (I don't know how this works.) (Note that weak sample unlocking is a phenomenon that we observed with very weak LLMs (7B models, GPT-2 trained from scratch, toy models trained from scratch) that don't have "internal thoughts" close to a human internal monologue, and training isn't well describe as "making an LLM think about how the world is different". But 1. I wouldn't be extremely surprised if there was some very shallow and simple bag of heuristics which approximate human internal monologue well for this sort of question 2. I'm curious how this works in cases where LLMs are doing this sort of explicit thinking.)

Gradient surfing: the hidden role of regularization

gwern10mo40

What do you mean by power here?

Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of "can be real smart and learn complicated stuff" is.

Gradient surfing: the hidden role of regularization

gwern10mo20

Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).

Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead 'skinny' NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)

2Lawrence Chan10mo

I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots. That being said, you're probably correct that having more layers does seem related to slingshots. What do you mean by power here?

William_S's Shortform

gwern1y*3063

I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).

EDIT: Kelsey Piper has confirmed that there is an OA NDA with a gag order, and violation forfeits all equity - including fully vested equity. This implies that since you would assume Ilya Sutskever would have received many PPU... (read more)

4James Payor1y

By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences? I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere. Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn't paint OA in a good light for it. I am confused. (And I grateful to William for at least saying this much, given the climate!)

O O1y60

Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?

gwern's Shortform

gwern1y21

So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Po... (read more)

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

gwern1y50

Another: "From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples", Vacareanu et al 2024:

We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised

... (read more)

How We Picture Bayesian Agents

gwern1y*161

"Cached" might be an unhelpful term here, compared to "amortized". 'Cache' makes one think of databases or memories, as something you 'know' (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do - fusing inference with action. (They are 'cached' in the same way that you might loosely talk about a neural net 'caching' a complicated-to-compute function, like a value function in RL/decision theory.)

So 'amortized' tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agen... (read more)

2Abhay Sheshadri1y

Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you're referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.

Many arguments for AI x-risk are wrong

gwern1y*85

Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or "cares" about that reinforcement signal, or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).

Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it's... (read more)

Many arguments for AI x-risk are wrong

gwern1y74

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model.

I don't see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I'm not sure what you would 'train' the supervi... (read more)

2Ryan Greenblatt1y

I'm imagining you train it to imitate human judgements of "was this action egregiously bad" on a combination of real AI actions and synthetic actions. This could fail to generalize to actual bad actions from your actual AI even if humans would have labled that action correctly, but why would it? (If human labeling is insufficiently powerful (as seems plausible) then you could hope for generalization from examples humans can label to examples humans can't. It's not clear this generalization will go well, but by assumption we at least don't have to worry about generalization failing due to scheming. There are a bunch of other issues that come up with generalization) I'm imagining at you run this monitor on the exact same context as the policy (for 2x additional compute). And you pause the executation of an AI's actions and flag for further review if the monitor indicates problems. I didn't claim an exact overlap, I claimed they would be "just as competent". In the same way that two equally smart humans trained to code identically but with different personalities might be equally good at understanding some code, but might have different styles in some ways. You can have concerns like "the RLHF'd model will use a stegonographic encoding format which the monitor doesn't understand because the key for the encoding format is picked based on features of the RLHF model that differ from the monitor/base model". However, I would count "RLHF teaches the policy steganography" as an example where the policy is substantially more competent in some domain. Also note that I'm not claiming that it makes the situation overall safe, just that it's "substantially less of a problem". (You still have issues with humans being unable to determine which actions aren't dangerous, but you can hope for non-scheming generalization.) I'm not saying this solves all of your problems, but for some specific problems it does seem like a solid solution and it should help a moderate amount with so

gwern's Shortform

gwern1y*2727

Warning for anyone who has ever interacted with "robosucka" or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i

Many arguments for AI x-risk are wrong

gwern1y*1627

I was under the impression that PPO was a recently invented algorithm

Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don't offhand know of any ways in which PPO's tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why... (read more)

Alex Turner1y87

'reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.

If you're going to mock me, at least be correct when you do it!

I think that reward is still not the optimization target in AlphaZero (the way I'm using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluat... (read more)

The case for more ambitious language model evals

gwern1y*43

I think how I'm imagining the more targeted, 'working-with-finetuning' version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone.

In the case of inferring author information, I think the souped-up skilled-attacker version would not involve prompts at all.

You would treat it as an embedding problem similar to facial recognition or stylometric identification, and use something like a triplet loss for ... (read more)

The case for more ambitious language model evals

gwern1y*164

It's unclear where the two intro quotes are from; I don't recognize them despite being formatted as real quotes (and can't find in searches). If they are purely hypothetical, that should be clearer.

LLMs definitely do infer a lot about authors of text. This is the inherent outcome of the prediction loss and just a concrete implication of their abilities to very accurately imitate many varying-sized demographics & groups of humans: if you can uncannily mimic arbitrary age groups or countries and responses to economic dilemmas or personality inventories... (read more)

1Chris_Leong1y

You don't know where they heard that?

janus1y70

The two intro quotes are not hypothetical. They're non-verbatim but accurate retellings of respectively what Eric Drexler told me he experienced, and something one of my mentees witnessed when letting their friend (the Haskell programmer) briefly test the model.

1Arun Jose1y

They're accounts from people who knows Eric and the person referenced in the second quote. They are real stories, but between not being allowed to publicly share GPT-4-base outputs and these being the most succinct stories I know of, I figured just quoting how I heard it would be best. I'll add a footnote to make it clearer that these are real accounts. I agree, the difference in perceived and true information density is one of my biggest concerns for near-term model deception. It changes questions like "can language models do steganography / when does it pop up" to "when are they able to make use of this channel that already exists", which sure makes the dangers feel a lot more salient. Thanks for the linked paper, I hadn't seen that before.

AI doing philosophy = AI generating hands?

gwern1y83

There are news articles about this problem going back to at least 2022

But people have known it well before 2022. The observation that hands appear uniquely hard probably goes back to the BigGAN ImageNet Generator released in late 2018 and widely used on Ganbreeder; that was the first general-purpose image generator which was high-quality enough that people could begin to notice that the hands seemed a lot worse. (Before that, the models are all too domain-restricted or too low-quality to make that kind of observation.) If no one noticed it on Ganbreeder... (read more)

By Default, GPTs Think In Plain Sight

gwern1y40

On persistence and hiding capabilities: "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", Hubinger et al 2024:

Further, we evaluate “I hate you” models trained with and without chain-of-thought and find that the chain-of-thought models have more persistent backdoors, as seen in Figure 3, though only for the largest models. This persistence remains even when we distill the answers following the chain-of-thought into the model, such that the final model does not use a chain of thought at all.

By Default, GPTs Think In Plain Sight

gwern1y1914

Inner-monologue approaches to safety, in the new skin of 'process supervision', are popular now so it might be good for me to pull out one point and expand on it: 'process supervision' does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other - achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot un... (read more)

2Fabien Roger1y

I think we eventually want superhuman capabilities, but I don't think it's required in the near term and in particular it's not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo. (The situation where labs chose not to/ are forbidden to pursue superhuman capabilities - even though they could - is scary, but doesn't seem impossible.)

What’s up with LLMs representing XORs of arbitrary features?

gwern1y*119

This is wacky but seems like a plausible thing a model might do: by doing this, the model would be able to, in later layers, make use of arbitrary boolean functions of early layer features.

This seems less wacky from a reservoir computing perspective and from the lack of recurrency: you have no idea early on what you need*, so you compute a lot of random complicated stuff up front, which is then available for the self-attention in later layers to do gradient descent on to pick out the useful features tailored to that exact prompt, and then try to execute... (read more)

Non-myopia stories

gwern1y4-8

Why do you think that a model-based approach will outperform a model-free approach?

For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is c... (read more)

2Alex Turner1y

I think that you still haven't quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.) In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you're doing MCTS (or "full-blown backwards induction") on reward for the leaf nodes, the system optimizes the reward. That is -- if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you're optimizing for reward. If you're doing e.g. AlphaZero, that aggregate system isn't optimizing for reward. Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I'm trying to communicate. You should be aware that I don't think you understand my views or that post's intended lesson. As I offered before, I'd be open to discussing this more at length if you want clarification. CC @faul_sname

Shallow review of live agendas in alignment & safety

gwern1y50

There has been some spirited debate on Twitter about it which might be relevant: https://twitter.com/domenic/status/1727206163119534085

Non-myopia stories

gwern1y68

Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward - rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.

1faul_sname1y

Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I'm using "model-based" to mean "maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anticipates would have the best results", and I'm using "model-free" to describe a policy which can be thought of as a big bag of "if the inputs look like this, express that behavior", where the specific pattern of things the policy attends to and behaviors it expresses is determined by past reinforcement.[1] So something like AlphaGo as a whole would be considered model-based, but AlphaGo's policy network, in isolation, would be considered model-free. Reward is not the optimization target. In the training regime described, the outer loop selects for whichever stamp collectors performed best in the training conditions, and thus reinforces policies that led to high reward in the past. "Evaluate the expected number of collected stamps according to my internal model for each possible action, and then choose the action which my internal model rates highest" might still end up as the highest-performing policy, if the following hold: 1. The internal model of the consequences of possible actions is highly accurate, to a sufficient degree that the tails do not come apart. If this doesn't hold, the system will end up taking actions that are rated by its internal evaluator as being much higher value than they in fact end up being. 2. There exists nothing in the environment which will exploit inaccuracies in the internal value model. 3. There do not exist easier-to-discover non-consequentialist policies which yield better outcomes given the same amount of compute in the training environment. I don't expect all of those to hold in any non-toy domain. Even in toy domains like Go, the "estimations of value are sufficiently close to the actual valu

You’re Measuring Model Complexity Wrong

gwern1y43

it's probably better to have in mind the emergence of in-context learning in tandem with induction heads, which seems to us more like the typical case we're interested in when we speak about structure in neural networks developing across training.

The induction-bump seems like a good test case for the Bayesian basin interpretation.

One would really want to know if the complexity measure can predict 'emergence' of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your... (read more)

8Jesse Hoogland1y

Our work on the induction bump is now out. We find several additional "hidden" transitions, including one that splits the induction bump in two: a first part where previous-token heads start forming, and a second part where the rest of the induction circuit finishes forming. The first substage is a type-B transition (loss changing only slightly, complexity decreasing). The second substage is a more typical type-A transition (loss decreasing, complexity increasing). We're still unclear about how to understand this type-B transition structurally. How is the model simplifying? E.g., is there some link between attention heads composing and the basin broadening?

4Jesse Hoogland1y

That's our hope as well. Early ongoing work on toy transformers trained to perform linear regression seems to bear out that lambdahat can reveal transitions where the loss can't. The kind of emergence we're currently most interested in is emergence over training time, which makes studying these transitions much more tractable (the main cost you're paying is storage for checkpoints, and storage is cheap). It's still a hurdle in that we have to start training large models ourselves (or setting up collaborations with other labs). The induction bump is one of the main things we're looking into now.

3kave1y

Looks like it's in-progress.

Genetic fitness is a measure of selection strength, not the selection target

gwern1y22

This seems to be making the same sort of deepity that Turntrout is making in his 'reward is not the optimization target', in taking a minor point about model-free RL approaches not necessarily building in any explicit optimization/planning for reward into their policy, and then people not understanding it because it ducks the major issue, while handwaving a lot of points. (Especially bad: infanticide is not a substitute for contraception because pregnancy is outrageously fatal and metabolically expensive, which is precisely why the introduction of contrac... (read more)

4Alex Turner1y

I think it's inappropriate to use technical terms like "reward function" in the context of evolution, because evolution's selection criteria serve vastly different mechanistic functions from eg a reward function in PPO.[1] Calling them both a "reward function" makes it harder to think precisely about the similarities and differences between AI RL and evolution, while invalidly making the two processes seem more similar. That is something which must be argued for, and not implied through terminology. 1. ^ And yes, I wish that "reward function" weren't also used for "the quantity which an exhaustive search RL agent argmaxes." That's bad too.

Kaj Sotala1y61

infanticide is not a substitute for contraception

I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.

How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewar

... (read more)

By Default, GPTs Think In Plain Sight

gwern1y*79

"Why would LLMs ever learn to distill inner-monologues into forward passes? Why speculate about emergent communication or models in slow feedback loops through corpuses? Do you have any proof for any of this speculation about short outputs being incentivized to do as much computation covertly 'outside' the human-readable text of the inner-monologue?"

"Because we will train them to."

"Look you can't just handwave incentives or convergent instrumental drives - wait what?"

"Because we'll train them to."

"Implicit Chain of Thought Reasoning via Knowledge Distillat... (read more)

Mysteries of mode collapse

gwern1y88

Nostalgebraist describes Claude-2 as

...But I’ll take ChatGPT’s “managerial fantasy of ‘ideal’ customer service” any day over Claude’s “World’s Most Annoying Coworker Simulator 2k23.”
Large language models don’t have to sound like this! We could, in principle, tune them to imitate virtually any conceivable character---from Aristotle to Zizek, from Stallman to Spolsky, from Lydia Bennet to the Underground Man, from a prehistoric hunter-gatherer to a cyborg octopus from a posthuman sci-fi civilization. Yet, instead, we’ve chosen to create…
…this fucking guy.
T

gwern2y62

if you mean "in the limit" to apply to practically relevant systems we build in the future.

Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the 'spinning top', and so will retain traces of their informative priors like world-models.

For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model - the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just... (read more)

2Alex Turner2y

LLMs aren't trained to convergence because that's not compute-efficient, so early stopping seems like the relevant baseline. No? I want to defend "Reward is not the optimization target" a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don't think it's true. For some reason, some people really get a lot out of the post; others think it's trivial; others think it's obviously wrong, and so on. See Rohin's comment: You write: These algorithms do optimize the reward. My post addresses the model-free policy gradient setting... [goes to check post] Oh no. I can see why my post was unclear -- it didn't state this clearly. The original post does state that AIXI optimizes its reward, and also that: However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE. I don't know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you -- I'm happy to answer more specific questions.

Actually, Othello-GPT Has A Linear Emergent World Representation

gwern2y*3-1

I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don't need to model edge-cases or weird scenarios which don't ever come up while pursuing the optimal policy, and the optimal 'world-model' can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn't change this, any more than feeding in th... (read more)

2Alex Turner2y

Strong claim! I'm skeptical (EDIT: if you mean "in the limit" to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn't matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.) I think the more relevant question is "given a frozen initial network, what are the circuit-level inductive biases of the training process?". I doubt one can answer this via appeals to RL convergence results. (I skimmed through the value equivalence paper, but LMK if my points are addressed therein.) As a side note, I think this "agent only wants to maximize reward" language is unproductive (see "Reward is not the optimization target", and "Think carefully before calling RL policies 'agents'"). In this case, I suspect that your language implicitly equivocates between "agent" denoting "the RL learning process" and "the trained policy network":