AI ALIGNMENT FORUM
AF

All of Thane Ruthenis's Comments + Replies

I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too

Maaaybe. Note, though, that "understand what's going on" isn't the same as "faithfully and comprehensively translate what's going on into English". Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL'd model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).

Fabien's Shortform

Thane Ruthenis8d42

What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?

Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that's particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can't easily reverse-engineer it.

"Switching betw... (read more)

3Fabien Roger7d

Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought. I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too. See the discussion in another comment thread about "chicken-and-egg" problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using "freestyling words" frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don't understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).

Fabien's Shortform

Thane Ruthenis8d1212

That's a worthwhile research direction, but I don't find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That's not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.

"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.

5Fabien Roger8d

My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn - I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean). I agree there could in principle be much easier-to-learn encodings, but I don't have one in mind and I don't see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I'd update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.

Daniel Kokotajlo's Shortform

Thane Ruthenis1mo*3-2

Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?

We'd discussed that some before, but one way to distill it is... I think autonomously doing nontrivial R&D engineering projects requires sustaining coherent agency across a large "inferential distance". "Time" in the sense of "long-horizon tasks" is a solid proxy for it, but not really the core feature. Instead, it's about being able to maintain a stable picture of the project even as you move from a fairly simple-in-terms-of-mem... (read more)

Daniel Kokotajlo's Shortform

Thane Ruthenis1mo*106

I think that if I consistently applied that argument, I'd end up thinking AGI was probably 5+ years away right up until the day AGI was announced.

Point 1: That would not necessarily be incorrect; it's not necessary that you ought to be able to do better than that. Consider math discoveries, which seem to follow a memoryless exponential distribution. Any given time period has a constant probability of a conjecture being proven, so until you observe it happening, it's always a fixed number of years in the future. I think the position that this is how AGI dev... (read more)

Daniel Kokotajlo1mo43

Re: Point 1: I agree it would not necessarily be incorrect. I do actually think that probably the remaining challenges are engineering challenges. Not necessarily, but probably. Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?

Re: Point 2: I don't buy it. Deep neural nets are actually useful now, and increasingly so. Making them more useful seems analogous to selective breeding or animal training, not analogous to trying to time the market.

Have LLMs Generated Novel Insights?

Thane Ruthenis1mo1115

On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.

Is there? It's one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what's the ground-truth signal for "the framework of Bayesian probability/category theory is genuinely practically useful"?

This is the reason I'm bearish on the reasoning models even for math. The realistic benefits of them seem to be:

Much faster feedba

... (read more)

Have LLMs Generated Novel Insights?

Thane Ruthenis1mo1617

The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the prop... (read more)

4james oofou1mo

Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer. So, if we can get LLMs to lay the groundwork and pose the right questions then we'll have autonomous scientists in whatever fields LLMs are OK at problem solving. This seems like something LLMs will learn to do as inference-time compute is scaled up. Reasoners benefit from coming up with sub-problems whose solutions can be built atop of to solve the problem posed by the user. LLMs will learn that in order to solve difficult questions, they must pose and solve novel sub-questions. So, once given an interesting research problem, the LLM will hum away for days doing good, often-novel work.

1ShardPhoenix1mo

I have been seeing a bit of this, mostly uses of o1-pro and OpenAI Deep Research in chem/bio/medicine, and mostly via Twitter hype so far. But it might be the start of something.

Cole Wyeth1mo2-2

Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems

Thane Ruthenis1mo10

It seems like this is the sort of thing you could only ever learn by learning about the real world first

Yep. The idea is to try and get a system that develops all practically useful "theoretical" abstractions, including those we haven't discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out "real-world" stuff, leaving only a purified superhuman abstract reasoning engine.

One of the nice-to-have properties here would be is if we don't need to be able to interpret its w... (read more)

Thane Ruthenis's Shortform

Thane Ruthenis2mo60

Some more evidence that whatever the AI progress on benchmarks is measuring, it's likely not measuring what you think it's measuring:

AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these

... (read more)

Thane Ruthenis's Shortform

Thane Ruthenis2mo10

Ah, I think I see my error.

Consider a CoT of the form "[problem statement]: ...........", which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.

And in the framework I'd outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such "filler" tokens.

So yep, you're right, this does force all scheming to be taking place within one forward pass.

Thane Ruthenis's Shortform

Thane Ruthenis2mo30

I don't see why it would be harder to do.

Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, "I need to design unit tests for this module". Each word of this CoT is mapped to some internal representation, and the internal representation consists of "on-task" human-intended reasoning $X$ , and misaligned scheming $Y$ (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). "I" is mapped to $⟨ X_{1}, Y_{1} ⟩$ , "need" ... (read more)

3Daniel Kokotajlo2mo

I don't think I understand. Let's work through an example. The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let's say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let's say, requires 300 serial steps of reasoning. The model's strategy is to write subtly-vulnerable code but in a way that looks like it's trying to write good code and just getting a bit unlucky and taking longer than usual (let's suppose the humans have a poor ability to judge how long it's supposed to take, but still some ability, such that it can't take e.g. 10x longer without arousing suspicion). If there's no paraphraser, this is 'easy.' I mean it's still difficult but it's 'easy' in the sense that the AI can basically just use a code -- each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can 'add extra layers of meaning' to the nuggets -- in addition to their meaning in the surface level i'm-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code. But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if 'I should generate unit tests now' and all semantically equivalent variants of that idea are assigned the hidden meaning 'I should check to see if the vulnerability is too easy to spot' for example, then... what is it going to write next? It's too constrained; in order to look non-suspicious there is a limited set of options it can follow up with... ok, hmm, I'm at the limits of my understanding he

Thane Ruthenis's Shortform

Thane Ruthenis2mo30

Some thoughts on protecting against LLM steganography.

tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.

In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than $2^{10}$ ... (read more)

3Daniel Kokotajlo2mo

I like to distinguish between "Subtle strategic bias" and "Steganography." Paraphrasers help (and maybe solve?) the latter, but do nothing about the former. Example of subtle strategic bias: The model knows it's misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. "I'm sorry sir I just am not making much progress," it says. You look at the CoT, and it's just doing a bunch of sensible but fruitless lines of reasoning. Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing -- but it doesn't know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can't do that in a single forward pass. So it needs to pass information forward to its future self somehow... without humans reading the CoT knowing that's what it's doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.

Thane Ruthenis's Shortform

Thane Ruthenis2mo4-2

Here's a potential interpretation of the market's apparent strange reaction to DeepSeek-R1 (shorting Nvidia).

I don't fully endorse this explanation, and the shorting may or may not have actually been due to Trump's tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don't think it's an altogether implausible world.

If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training n... (read more)

4Vladimir Nesov2mo

The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token. But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn't work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training (compute multiplier) applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.

Thane Ruthenis's Shortform

Thane Ruthenis2mo10

Coming back to this in the wake of DeepSeek r1...

I don't think the cumulative compute multiplier since GPT-4 is that high, I'm guessing 3x, except perhaps for DeepSeek-V3, which wasn't trained compute optimally and didn't use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.

How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different r... (read more)

2Vladimir Nesov2mo

Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead. Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision. The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it's in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3. Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone. ---------------------------------------- 1. That is, raw utilized compute. I'm assuming the same compute utilization for all models. ↩︎

Thane Ruthenis's Shortform

Thane Ruthenis2mo102

So, Project Stargate. Is it real, or is it another "Sam Altman wants $7 trillion"? Some points:

The USG invested nothing in it. Some news outlets are being misleading about this. Trump just stood next to them and looked pretty, maybe indicated he'd cut some red tape. It is not an "AI Manhattan Project", at least, as of now.
Elon Musk claims that they don't have the money and that SoftBank (stated to have "financial responsibility" in the announcement) has less than $10 billion secured. If true, while this doesn't mean they can't secure an order of magnitude

... (read more)

Thane Ruthenis's Shortform

Thane Ruthenis2mo*368

Alright, so I've been following the latest OpenAI Twitter freakout, and here's some urgent information about the latest closed-doors developments that I've managed to piece together:

Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn't do it ever.
If you saw this comment of Gwern's going around and were incredibly alarmed, you should probably undo the associated update regarding AI timelines (at least partially, see below).
OpenAI may be running some galaxy-brained psyops nowadays.

Here's the sequence of even... (read more)

Thane Ruthenis's Shortform

Thane Ruthenis2mo*120

Current take on the implications of "GPT-4b micro": Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.

First, the gist of it appears to be:

OpenAI’s new model, called GPT-4b micro, was trained to suggest ways to re-engineer the protein factors to increase their function. According to OpenAI, researchers used the model’s suggestions to change two of the Yamanaka factors to be more than 50 times as effective—at least according to some preliminary measures.
The model was trained on examples of protein sequences from many species, as

... (read more)

3Logan Riggs Smith2mo

For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell. This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells). You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then into kidney cells which will exhibit the same kidney disease because it's genetic. You can then better understand how the [kidney disease] develops and how various drugs affect it. So, it's good to have ways to produce lots of these iPSCs. According to the article, SOTA was <1% of cells converted into iPSCs, whereas the GPT suggestions caused a 50x improvement to 33% of cells converted. That's quite huge!, so hopefully this result gets verified. I would guess this is true and still a big deal, but concurrent work got similar results. Too bad about the tumors. Turns out iPSCs are so good at turning into other cells, that they can turn into infinite cells (ie cancer). iPSCs were used to fix spinal cord injuries (in mice) which looked successful for 112 days, but then a follow up study said [a different set of mice also w/ spinal iPSCs] resulted in tumors. My current understanding is this is caused by the method of delivering these genes (ie the Yamanaka factors) through retrovirus which which I'd guess this is the method the Retro Biosciences uses. I also really loved the story of how Yamanaka discovered iPSCs: 1. ^ These organs would have the same genetics as the person who supplied the [skin/hair cells] so risk of rejection would be lower (I think)

Thane Ruthenis's Shortform

Thane Ruthenis2mo10

Thanks!

GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative

You're more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would've been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what's the actual effective "scale" of GPT-3?

(Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3's level is attainable using less compute, the effective scale-up is bigger. I'm wondering how much bigger.)

2Vladimir Nesov2mo

IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1). GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately correct for GPT-3, it's 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity. (The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn't worth mentioning compared to everything else about it that's different from GPT-4.)

Thane Ruthenis's Shortform

Thane Ruthenis3mo17-4

Here's an argument for a capabilities plateau at the level of GPT-4 that I haven't seen discussed before. I'm interested in any holes anyone can spot in it.

Consider the following chain of logic:

The pretraining scaling laws only say that, even for a fixed training method, increasing the model's size and the amount of data you train on increases the model's capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
Nothing says that increasing a model's parameter-count and the amount of compute sp

... (read more)

nostalgebraist2mo73

One possible answer is that we are in what one might call an "unhobbling overhang."

Aschenbrenner uses the term "unhobbling" for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.

His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we're also getting better at unhobbling ove... (read more)

leogao2mo96

in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it's not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i'm going to assume that whenever you say size you meant to say compute.

obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the ... (read more)

Vladimir Nesov2mo*104

Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).

Many models aren't trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller m... (read more)

Thane Ruthenis's Shortform

Thane Ruthenis3mo128

Here's something that confuses me about o1/o3. Why was the progress there so sluggish?

My current understanding is that they're just LLMs trained with RL to solve math/programming tasks correctly, hooked up to some theorem-verifier and/or an array of task-specific unit tests to provide ground-truth reward signals. There are no sophisticated architectural tweaks, not runtime-MCTS or A* search, nothing clever.

Why was this not trained back in, like, 2022 or at least early 2023; tested on GPT-3/3.5 and then default-packaged into GPT-4 alongside RLHF? If OpenAI ... (read more)

3leogao3mo

simple ideas often require tremendous amounts of effort to make work.

Are You More Real If You're Really Forgetful?

Thane Ruthenis4mo40

Sure. This setup couldn't really be exploited for optimizing the universe. If we assume that the self-selection assumption is a reasonable assumption to make, inducing amnesia doesn't actually improve outcomes across possible worlds. One out of 100 prisoners still dies.

It can't even be considered "re-rolling the dice" on whether the specific prisoner that you are dies. Under the SSA, there's no such thing as a "specific prisoner", "you" are implemented as all 100 prisoners simultaneously, and so regardless of whether you choose to erase your memory o... (read more)

3Charlie Steiner4mo

Fair enough. Yes, it seems totally reasonable for bounded reasoners to consider hypotheses (where a hypothesis like 'the universe is as it would be from the perspective of prisoner #3' functions like treating prisoner #3 as 'an instance of me') that would be counterfactual or even counterlogical for more idealized reasoners. Typical bounded reasoning weirdness is stuff like seeming to take some counterlogicals (e.g. different hypotheses about the trillionth digit of pi) seriously despite denying 1+1=3, even though there's a chain of logic connecting one to the other. Projecting this into anthropics, you might have a certain systematic bias about which hypotheses you can consider, and yet deny that that systematic bias is valid when presented with it abstractly. This seems like it makes drawing general lessons about what counts as 'an instance of me' from the fact that I'm a bounded reasoner pretty fraught.

A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed

Thane Ruthenis7mo30

Let's see if I get this right...

Let's interpret the set $X$ as the set of all possible visual sensory experiences $x = (x_{1}, \dots, x_{n})$ , where $x_{i}$ defines the color of the $i$ th pixel.
Different distributions over elements of this set correspond to observing different objects; for example, we can have $P_{car} (X)$ and $P_{apple} (X)$ , corresponding to us predicting different sensory experiences when looking at cars vs. apples.
Let's take some specific specific set of observations $X_{O} \subset X$ , from which we'd be trying to derive a laten

Thane Ruthenis9mo50

That was my interpretation as well.

I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can.

The "single-step" case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns b... (read more)

7Steve Byrnes9mo

Yeah we already know that LLM training finds underlying patterns that are helpful for explaining / compressing / predicting the training data. Like “the vibe of Victorian poetry”. I’m not sure what you mean by “none of which are present in the training data”. Is the vibe of Victorian poetry present in the training data? I would have said “yeah” but I’m not sure what you have in mind. I kinda disagree that this is coincidental. My mental image is something like 1. The earliest layers see inputs of the form f(…) 2. Slightly later layers get into an activation state that we might describe as “the idea of the function x-176” 3. The rest of the layers make inferences and emit outputs appropriate to that idea. I’m claiming that before fine-tuning, everything is already in place except for the 1→2 connection. Fine-tuning just builds the 1→2 connection. The thing you mention—that an LLM with the idea of x-176 in mind can output the tokens “x-176”—is part of step 3, and therefore (I hypothesize) comes entirely from LLM pretraining, not from this fine-tuning process The fact that pretraining can and does build that aspect of step 3 seems pretty much expected to me, not coincidental, as such a connection is obviously useful for predicting GitHub code and math homework and a zillion other things in the training data. It’s also something you can readily figure out by playing with an LLM: if you say “if I subtract 6 from x-170, what do I get?”, then it’s obviously able to output the tokens “x-176”.

TurnTrout's shortform feed

Thane Ruthenis1y1112

I think that the key thing we want to do is predict the generalization of future neural networks.

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screen... (read more)

3Alex Turner1y

Thanks for pointing out that distinction!

Toward A Mathematical Framework for Computation in Superposition

Thane Ruthenis1y98

Haven't read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.

Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely do... (read more)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Thane Ruthenis1y106

I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure.

Context: I firmly hold a MIRI-style "alignment is extremely hard" view, but I am also unusually sympathetic to Quintin/Nora's arguments. So here's my outline of the model of that whole debate.

Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and ... (read more)

Value systematization: how values become coherent (and misaligned)

Thane Ruthenis1y10

E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating

That seems more like value reflection, rather than a value change?

The way I'd model it is: you have some value $v (x)$ , whose implementations you can't inspect directly, and some guess about what it is $P (v (x))$ . (That's how it often works in humans: we don't have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of "what if w... (read more)

Value systematization: how values become coherent (and misaligned)

Thane Ruthenis1y10

Let me list some ways in which it could change:

If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the "value translation" phenomenon from the "ontology crisis" phenomenon.

Shifts in the agent's model of what counts as "a gear" or "spinning" violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.

Can you constru... (read more)

2Richard Ngo1y

I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it's fine if the specific gear gets swapped out for a copy. I'm not assuming there's a mistake anywhere, I'm just assuming you switch from caring about one type of property it has (physical) to another (functional). In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model. I picture you saying "well, you could just not prioritize them". But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing "this particular gear", but you realize that atoms are constantly being removed and new ones added (implausibly, but let's assume it's a self-repairing gear) and so there's no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of "the gear that moves the piston"—then valuing that could be much simpler. ---------------------------------------- Zooming out: previously you said I'm worried that we're just talking about different things here, because I totally agree with what you're saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you're going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you're going to systematize your values. Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y40

No, I am in fact quite worried about the situation

Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.

I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures

Would you outline your full argument for this and the reasoning/evidence backing that argument?

To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically s... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y40

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion

Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue.

Here's how the whole situation looks like from my perspective:

We don't know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptivenes

... (read more)

6Alex Turner1y

No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that's important. I think it's important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn't mean it's fine and dandy to keep scaling with no concern at all. The reason my percentage is "only 5 to 15" is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc. (Hopefully this comment of mine clarifies; it feels kinda vague to me.) But I do think this is way too high of a bar.

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis1y10

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that $N P \neq P$ is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in $P$ .

However, the inversion of the universe's forward passes can be NP-complete functi... (read more)

1Roger Dearnaley1y

Like a cryptographer, I'm not very concerned about worst-case complexity, only average-case complexity. We don't even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I'm in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average case using much lower computational resources than simulating the field-theoretical wave function of every fundamental particle. Others (for example non-linear systems after many Lyapunov times, carefully designed cryptosystems, or most chaotic cellular automata), less so. Animals including humans seem to be able to survive in the presence of a mixed situation where they can invert/steer some things but not others, basically by attempting to avoid situations where they need to do the impossible. AIs are going to face the same situation. Basically every functional form of machine learning we know, including both SGD and in-context learning in sufficiently large LLMs, implements an approximate version of Bayesianism. I agree we need to engineer a specific implementation to build my proposal, but for mathematical analysis just the fact that it's a computationally-bounded approximation to Bayesianism takes us quite some way, until we need to analyze its limitations and inaccuracies. I'm assuming a structure similar to a computationally-bounded version of AIXI, upgraded to do value learning rather than having a hard-coded utility function. It maintains and performs approximate Bayesian updates on an ensemble of theories about a) mappings from current world state + actions to distributions of future world states, and b) mappings from world states to something utility function-like for individual humans, plus an aggregate/compromise of these across all humans. It can apply the scientific method to reducing unc

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis1y10

Yup. I think this might route through utility as well, though. Observations are useful because they unlock bits of optimization, and bits related to different variables could unlock both different amounts of optimization capacity, and different amounts of goal-related optimization capacity. (It's not so bad to forget a single digit of someone's phone number; it's much worse if you forgot a single letter in the password to your password manager.)

How Would an Utopia-Maximizer Look Like?

Thane Ruthenis1y10

I wouldn't be "happy enough" if we ended up in flatworm utopia

You would, presumably, be quite happy compared to "various deliberately-bad-to-both worlds".

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo

Because you don't care about the flatworm and the flatworm is not perceived by you as having much bargaining power for you to bend to its preferences.

In addition, your model rules out more fine-grained ideas like "the cubic mile of terrain around the flatworm remains unchanged while I get the rest of the univ... (read more)

How Would an Utopia-Maximizer Look Like?

Thane Ruthenis1y10

At the end you still had to talk about the low level states again to say they should compromise on b

"Compromising on $b$ " is a more detailed implementation that can easily be omitted. The load-bearing part is "both would be happy enough with any low-level state that gets mapped to the high-level state of $x$ ".

For example, the policy of randomly sampling any $l$ such that $f (l) = x$ is something both utility functions can agree on, and doesn't require doing any additional comparisons of low-level preferences, once the high-level st... (read more)

1Charlie Steiner1y

Suppose the two agents are me and a flatworm. a = ideal world according to me b = status quo c = ideal world according to the flatworm d, e, f = various deliberately-bad-to-both worlds I'm not going to stop trying to improve the world just because the flatworm prefers the status quo, and I wouldn't be "happy enough" if we ended up in flatworm utopia. What bargains I would agree to, and how I would feel about them, are not safe to abstract away.

How Would an Utopia-Maximizer Look Like?

Thane Ruthenis1y10

Okay, let's build a toy model.

We have some system with a low-level state $l$ , which can take on one of six values: ${a, b, c, d, e, f}$ .
We can abstract over this system's state and get a high-level state $h$ , which can take on one of two states: ${x, y} .$
We have an objective abstracting-up function $f (l) = h$ .
We have the following mappings between states:
- $\forall l \in {a, b, c} : f (l) = x$
- $\forall l \in {d, e, f} : f (l) = y$
We have an utility function $U_{A} (l)$ , with a preference ordering of $a > b > c ≫ d \approx e \approx f$ , and an utility function $U_{B} (l)$ , with a p

... (read more)

1Charlie Steiner1y

I disagree that translating to x and y let you "reduce the degrees of freedom" or otherwise get any sort of discount lunch. At the end you still had to talk about the low level states again to say they should compromise on b (or not compromise and fight it out over c vs. a, that's always an option).

How Would an Utopia-Maximizer Look Like?

Thane Ruthenis1y20

People who want different things might make different abstractions

That's a direct rejection of the natural abstractions hypothesis. And some form of it increasingly seems just common-sensically true.

It's indeed the case that one's choice of what system to model is dependent on what they care about/where their values are housed (whether I care to model the publishing industry, say). But once the choice to model a given system is made, the abstractions are in the territory. They fall out of noticing to which simpler systems a given system can be reduced.

(Ima... (read more)

3Charlie Steiner1y

I agree with the part of what you just said that's the NAH, but disagree with your interpretation. Both people can recognize that there's a good abstraction here, where what they care about is subjectively tasty food. But this interpersonal abstraction is no longer an abstraction of their values, it simply happens to be about their values, sometimes. It can no longer be cashed out into specific recommendations of real-world actions in the way someone's values can[1]. 1. ^ For certain meanings of "values," ofc.

Thane Ruthenis1y2-2

Presumably these innovations were immediately profitable. I'm not sure that moves towards architectures closer to AGI (as opposed to myopic/greedy-search moves towards incrementally-more-capable models) are immediately profitable. It'd be increasingly more true as we inch closer to AGI, but it definitely wasn't true back in the 2010s, and it may not yet be true now.

So I'm sure some of them would intend to try innovations that'd inch closer to AGI, but I expect them not to be differentially more rewarded by the market. Meaning that, unless one of these AGI-... (read more)

The Shortest Path Between Scylla and Charybdis

Thane Ruthenis1y10

A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions

Inside-view convincingness of these directions still has to be weighted in. E. g., "study the Bible for alignment insights" is a relatively neglected direction (just Unsong on it, really?), but that doesn't mean it'd be sensible to focus on it just because it's neglected. And even if your marginal contributions to the correct approach would be minimal because so many other people are working on it, that may still be more ex... (read more)

1Vladimir Nesov1y

I mean directions neglected relative to estimated good distribution of global effort. If I estimate good distribution of effort towards searching The Silmarillion for insights relevant to mechanistic interpretability to be zero, then it's not a relatively neglected direction. Sure, by "give weight" I mean take into account, not take as the sole basis for a decision. The other major factor is that relative neglectedness I mentioned (in the sense I hopefully now clarified).

Thane Ruthenis1y32

Inflection's claim to fame is having tons of compute and promising to "train models that are 10 times larger than the cutting edge GPT-4 and then 100 times larger than GPT-4", plus the leader talking about "the containment problem" in a way that kind-of palatably misses the point. So far, they seems to be precisely the sort of "just scale LLMs" vision-less actor I'm not particularly concerned about. I could be proven wrong any day now, but so far they don't really seem to be doing anything interesting.

As to Meta – what's the last original invention t... (read more)

4O O1y

This seems to contradict history. The split up of Standard Oil for example led to innovations in oil drilling. Also you are seriously overestimating how hard it is to get funding. Much stupider and more poorly run companies have gotten billions in funding. And these leaders in the worst case can just hire negotiators.

Thane Ruthenis1y1210

Agreed.

In addition: I expect one of the counter-arguments to this would be "if these labs shut down, more will spring up in their place, and nothing would change".

Potentially-hot take: I think that's actually a much lesser concern that might seem.

The current major AGI labs are led by believers. My understanding is that quite a few (all?) of them bought into the initial LW-style AGI Risk concerns, and founded these labs as a galaxy-brained plan to prevent extinction and solve alignment. Crucially, they aimed to do that well before the talk of AGI became mai... (read more)

Garrett Baker1y2131

Outside the three major AGI labs, I'm reasonably confident no major organization is following a solid roadmap to AGI; no-one else woke up. A few LARPers, maybe, who'd utter "we're working on AGI" because that's trendy now. But nobody who has a gears-level model of the path there, and what its endpoint entails.

This seems pretty false. In terms of large players, there also exists Meta and Inflection AI. There are also many other smaller players who also care about AGI, and no doubt many AGI-motivated workers at three labs mentioned would start their own orgs if the org they're currently working under shuts down.

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y10

But probably these will depend on current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.

Mm, we disagree on that, but it's probably not the place to hash this out.

In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.

Uncertainty lives in the mind. Let's say the humans in the ... (read more)

Thomas Kwa1y39

Behaving nicely is not the key property I'm observing in LLMs. It's more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator's happiness, I'd be far more terrified.
This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
I don't think LLMs are super strong evidence about whether big speedups to novel science will be possible

Thane Ruthenis1y*41

Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone

Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion.

If someone from a pre-industri... (read more)

4Thomas Kwa1y

Some thoughts: * I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic. * But probably these will depend on descend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs. * Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc. * In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence. * Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans. * There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world!

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y74

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

As I'd tried to outline in the post, I think "what are AIs that are known to exist, and what p... (read more)

Alex Turner1y1323

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless ... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y3-1

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (which brain efficiency predicts in advance).

My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't ent... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y*65

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitr... (read more)

jacob_cannell1y4-4

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories

They are critically relevant. From your own linked post ( how I delineate ) :

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.

If takeoff is slow (1) because brains are highly ... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y54

Mm, I concede that this might not have been the most accurate title. I might've let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.

My core point is something like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".

I agree that the current training setups shed some data on how e. g. op... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis1y70

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have. That is a strong claim, yes, but I am making it.

I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without "taking off". Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be "mere LLMs" as far as capability and safety upper bounds go.

That said, inasmuch as CoT-style setups... (read more)

Ryan Greenblatt1y1112

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have.

I actually basically agree with this quote.

Note that I said "incapable of doing non-trivial consequentialist reasoning in a forward pass". The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I'll try to clarify this in my comment.

How LDT helps reduce the AI arms race

Thane Ruthenis1y10

At a glance, I think this works, and it's a neat approach. I have doubts, though.

The impossibility of explaining the theory behind this to random SV CEOs or military leaders... is not one of them. The human culture had always contained many shards of LDT-style thinking, the concept of "honour" chief among them. To explain it, you don't actually need to front-load exposition about decision theories and acausal trade – you can just re-use said LDT-shards, and feed them the idea in an immediately intuitive format.

I'm not entirely sure how that would look like... (read more)

3Tamsin Leake1y

Due to my timelines being this short, I'm hopeful that convincing just "the current crop of major-AI-Lab CEOs" might actually be enough to buy us the bulk of time that something like this could buy.

TurnTrout's shortform feed

Thane Ruthenis1y611

I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in [LLMs as trained today] to begin with

As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it's because I don't expect LLMs to scale to AGI.)

I think there's in general a lot of speaking-past-each-other in alignment, and what precisely people mean by "problem X will appear if we continue advancing/scaling" is one of them.

Like, of course a new problem won't appear if we just keep d... (read more)

1Vladimir Nesov1y

LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting. Edit 15 Dec: No longer endorsed based on scaling laws for training on repeated data.

TurnTrout's shortform feed

Thane Ruthenis1y13-4

As a proponent:

My model says that general intelligence^[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algo... (read more)