AI ALIGNMENT FORUM
AF

All of Vladimir_Nesov's Comments + Replies

It's an essential aspect of decision making for an agent to figure out where it might be. Thought experiments try to declare the current situation, but they don't necessarily need to be able to convincingly succeed. Algorithmic induction, such as updating from Solomonoff prior, is the basic way an agent figures out which situations it should care about, and declaring that we are working with a particular thought experiment doesn't affect the prior. In line with updatelessness, an agent should be ready for observations in general (according to which of them it cares about more), rather than particular "fair" observations, so distinguishing observations that describe "fair" thought experiments doesn't seem right either.

Towards a scale-free theory of intelligent agency

Vladimir Nesov11d*32

Coalitional agency seems like an unnecessary constraint on design of a composite agent, since an individual agent could just (choose to) listen to other agents and behave the way their coalition would endorse, thereby effectively becoming a composite agent, without being composite "by construction". The step where an agent chooses which other (hypothetical) agents to listen to makes constraints on the nature of agents unnecessary, because the choice to listen to some agents and not others can impose any constraints that particular agent cares about, and so... (read more)

Towards a scale-free theory of intelligent agency

Vladimir Nesov11d30

Discussions of how to aggregate values and probabilities feel disjoint. Jeffrey-Bolker formulation of expected utility presents the preference data as two probability distributions over the same sample space, so that expected utility of an event is reconstructed as the ratio of the event's measures given by the two priors. (The measure that goes into the numerator is "shouldness", and the other one remains "probability".)

This gestures at a way of reducing the problem of aggregating values to the problem of aggregating probabilities. In particular, markets ... (read more)

On the Rationality of Deterring ASI

Vladimir Nesov1mo1210

Cyberattacks can't disable anything with any reliability or for more than days to weeks though, and there are dozens of major datacenter campuses from multiple somewhat independent vendors. Hypothetical AI-developed attacks might change that, but then there will also be AI-developed information security, adapting to any known kinds of attacks and stopping them from being effective shortly after. So the MAD analogy seems tenuous, the effect size (of this particular kind of intervention) is much smaller, to the extent that it seems misleading to even mention cyberattacks in this role/context.

Nina Panickssery's Shortform

Vladimir Nesov1mo30

Oversight, auditing, and accountability are jobs. Agriculture shows that 95% of jobs going away is not the problem. But AI might be better at the new jobs as well, without any window of opportunity where humans are initially doing them and AI needs to catch up. Instead it's AI that starts doing all the new things well first and humans get no opportunity to become competitive at anything, old or new, ever again.

Even formulation of aligned high-level tasks and intent alignment of AIs make sense as jobs that could be done well by misaligned AIs for instrumental reasons. Which is not even deceptive alignment, but still plausibly segues into gradual disempowerment or sharp left turn.

Daniel Kokotajlo's Shortform

Vladimir Nesov1mo81

My point is that a bit of scaling (like 3x) doesn't matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it's going to arrive a little bit at a time, so won't be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It's a rising sea kind of thing (if you have the compute).

Long reasoning traces we... (read more)

Daniel Kokotajlo's Shortform

Vladimir Nesov1mo810

my intuitions have been shaped by events like the pretraining slowdown

I don't see it. GPT-4.5 is much better than the original GPT-4, probably at 15x more compute. But it's not 100x more compute. And GPT-4o is an intermediate point, so the change from GPT-4o to GPT-4.5 is even smaller, maybe 4x.

I think 3x change in compute has an effect at the level of noise from different reasonable choices in constructing a model, and 100K H100s is only 5x more than 20K H100s of 2023. It's not a slowdown relative to what it should've been. And there are models with 200x more raw compute than went into GPT-4.5 that are probably coming in 2027-2029, much more than the 4x-15x observed since 2022-2023.

Daniel Kokotajlo1mo40

Hmm, let me think step by step. First, the pretraining slowdown isn't about GPT-4.5 in particular. It's about the various rumors that the data wall is already being run up against. It's possible those rumors are unfounded but I'm currently guessing the situation is "Indeed, scaling up pretraining is going to be hard, due to lack of data; scaling up RL (and synthetic data more generally) is the future." Also, separately, it seems that in terms of usefulness on downstream tasks, GPT 4.5 may not be that much better than smaller models... well, it's too early ... (read more)

Dream, Truth, & Good

Vladimir Nesov1mo30

LLMs compute probability of a sequence, but truth/good distinction is captured by two-dimensional Jeffrey-Bolker measure (I'm calling its components "probability" and "shouldness", their ratio is the expected utility of an event). Shouldness is reconstructed from probability and expected utility as their product, so plausibly it behaves on long sequences similarly to probability, it generally gets lower for longer sequences, but tends to be higher for simpler sequences.

The analogy between probability and shouldness suggests that some form of pretraining mi... (read more)

Thane Ruthenis's Shortform

Vladimir Nesov2mo40

The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.

But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivate... (read more)

Thane Ruthenis's Shortform

Vladimir Nesov2mo*20

How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1

Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.

GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million

Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute^[1]. Original GPT... (read more)

Thane Ruthenis's Shortform

Vladimir Nesov3mo*20

IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).

GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately co... (read more)

Thane Ruthenis's Shortform

Vladimir Nesov3mo*104

Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).

Many models aren't trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller m... (read more)

1Thane Ruthenis2mo

Coming back to this in the wake of DeepSeek r1... How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment? Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they're roughly at the same level. That should be very surprising. Investing a very different amount of money into V3's training should've resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood! Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above "dumb human", and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near "dumb human". Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get... 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come? The explanations that come to mind are: * It actually is just that much of a freaky coincidence. * DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level. * DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring. * DeepSeek kept investing and tinkering until getting to GPT-4ish level, an

1Thane Ruthenis3mo

Thanks! You're more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would've been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what's the actual effective "scale" of GPT-3? (Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3's level is attainable using less compute, the effective scale-up is bigger. I'm wondering how much bigger.)

Rebuttals for ~all criticisms of AIXI

Vladimir Nesov3mo20

As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory.

I think most FDT/embeddedness weirdness is about explaining the environment using bounded computations that are not (necessarily) literally already found in the environment as part of it. Not about sharing the actual source code, just any information about what's going on, captured in the form of computations, known to have captured that information before they are carried out. Things like static program analysis and deep learning models... (read more)

Jesse Hoogland's Shortform

Vladimir Nesov4mo*99

I don't think Phi-4 offers convincing evidence either way. You can push performance on verifiable tasks quite far without the model becoming generally more capable. AlphaZero doesn't imply that scaling with its methods gestures at general superintelligence, and similarly with Phi-4.

In contrast, using o1-like training as a way to better access ground truth in less tractable domains seems more promising, since by some accounts its tactics on long reasoning traces work even in non-technical domains (unlike for DeepSeek R1), possibly because they are emergent ... (read more)

The Compendium, A full argument about extinction risk from AGI

Vladimir Nesov5mo72

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

3Adam Shimi5mo

Good catch, I think we are indeed mixing the sizes here. As you say, the point still stands, but we will change it in the next minor update to either compare the same size or make the difference in size explicit.

The Compendium, A full argument about extinction risk from AGI

Vladimir Nesov5mo42

From chapter The state of AI today:

Later this year, the first 100,000 GPU cluster will go online

It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May.

Even the cited The Information article says about the Meta cluster in question that

The previously unreported cluster, which could be fully completed by October or November, comes as two other companies have touted their own.

2Adam Shimi5mo

Yep, I think you're correct. Will correct in the next minor update. Thanks!

The Compendium, A full argument about extinction risk from AGI

Vladimir Nesov5mo*73

From chapter The state of AI today:

The most likely and proximal blocker is power consumption (data-centers training modern AIs use enormous amounts of electricity, up to the equivalent of the yearly consumption of 1000 average US households) and ...

Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s cluster is equivalent to that of 121,000 average US households, not 1,000 average US ... (read more)

3Adam Shimi5mo

Thanks for the comment! We want to check the maths, but if you're indeed correct we will update the numbers (and reasoning) in the next minor version.

My motivation and theory of change for working in AI healthtech

Vladimir Nesov6mo46

Health itself is an inspiring concept at a technical level, because it is meaningful at many scales of organization at once: healthy cells, healthy organs, healthy people, healthy families, healthy communities, healthy businesses, healthy countries, and (dare I say) healthy civilizations all have certain features in common, to do with self-sustenance, harmony with others, and flexible but functional boundaries.

Healthcare in this general sense is highly relevant to machines. Conversely, sufficient tech to upload/backup/instantiate humans makes biology-sp... (read more)

AXRP Episode 37 - Jaime Sevilla on Forecasting AI

Vladimir Nesov6mo10

The data wall discussion in the podcast applies Chinchilla's 20 tokens/parameter too broadly and doesn't account for repetition of data in training. These issues partially cancel out, but new information on these ingredients would affect the amended argument differently. I wrote up the argument as a new post.

Wei Dai's Shortform

Vladimir Nesov6mo*42

These posts might be relevant:

The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the "constitution". In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to ru... (read more)

In Defense of Open-Minded UDT

Vladimir Nesov8mo*10

For counterlogical mugging, it's unclear if it should be possible to correctly discover the parity of the relevant digit of pi. I would expect that in the counterfactual where it's even, it will eventually be discovered to be even. And in the countefactual where it's odd, that same digit will eventually be discovered to be odd.

ASP and Transparent Newcomb might be closer to test cases for formulating updateless policies that have the character of getting better as they grow more powerful. These problems ask the agent to use a decision procedure that intenti... (read more)

In Defense of Open-Minded UDT

Vladimir Nesov8mo32

Updatelessness is about coordination between possible versions of an agent. Coordination with more distant versions of an agent gets more difficult or less informative, and a stronger version of an agent can reach further. This results in many local commitments that coordinate more related versions of an agent.

These local commitments, as agents in their own right, can grow stronger and should themselves coordinate with each other, where their parents failed to reach. Commitment to a strategy that won't itself engage in future rounds of coordination with its alternative possible forms (and other things) is a bad commitment.

A simple case for extreme inner misalignment

Vladimir Nesov9mo10

I think the FDT dictum of treating an agent like an abstract algorithm rather than any given physical instance of it ("I am an algorithm") extends to treating goals as about the collective abstract consequences of behavior of abstract algorithms (other algorithms, that are not necessarily the agent) rather than of any given incarnation of those algorithms or consequences in any given incarnation, such as the physical consequences of running algorithms on computers in a physical world.

In this ontology, goals are not about optimizing configurations of the wo... (read more)

Response to Dileep George: AGI safety warrants planning ahead

Vladimir Nesov9mo40

if you assign an extremely low credence to that scenario, then whatever

I don't assign low credence to the scenario where LLMs don't scale to AGI (and my point doesn't depend on this). I assign low credence to the scenario where it's knowable today that LLMs very likely won't scale to AGI. That is, that there is a thing I could study that should change my mind on this. This is more of a crux than the question as a whole, studying that thing would be actionable if I knew what it is.

whether or not LLMs will scale to AGI

This wording mostly answers one o... (read more)

2Steve Byrnes9mo

Yeah when I say things like “I expect LLMs to plateau before TAI”, I tend not to say it with the supremely high confidence and swagger that you’d hear from e.g. Yann LeCun, François Chollet, Gary Marcus, Dileep George, etc. I’d be more likely to say “I expect LLMs to plateau before TAI … but, well, who knows, I guess. Shrug.” (The last paragraph of this comment is me bringing up a scenario with a vaguely similar flavor to the thing you’re pointing at.)

Response to Dileep George: AGI safety warrants planning ahead

Vladimir Nesov9mo*40

expecting LLMs to not be the right kind of algorithm for future powerful AGI—the kind that can ... do innovative science

I don't know what could serve as a crux for this. When I don't rule out LLMs, what I mean is that I can't find an argument with the potential to convince me to become mostly confident that scaling LLMs to 1e29 FLOPs in the next few years won't produce something clunky and unsuitable for many purposes, but still barely sufficient to then develop a more reasonable AI architecture within 1-2 more years. And by an LLM that does this I mean... (read more)

Steve Byrnes9mo*50

I feel like “Will LLMs scale to AGI?” is right up there with “Should there be government regulation of large ML training runs?” as a black-hole-like attractor state that sucks up way too many conversations. :) I want to fight against that: this post is not about the question of whether or not LLMs will scale to AGI.

Rather, this post is conditioned on the scenario where future AGI will be an algorithm that (1) does not involve LLMs, and (2) will be invented by human AI researchers, as opposed to being invented by future LLMs (whether scaffolded, multi-modal... (read more)

Matthew Barnett's Shortform

Vladimir Nesov10mo3-1

I'd say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs' ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don't work as evidence about this either way.

Transcoders enable fine-grained interpretable circuit analysis for language models

Vladimir Nesov1y20

If the transcoders are used to predict next tokens, they may lose interpretability

Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn't care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, n... (read more)

2Philippe Chlenski1y

I see. I was in fact misunderstanding this detail in your training setup. In this case, only engineering considerations really remain: these boil down to incorporating multiple transcoders simultaneously and modeling shifting MLP behavior with transcoders. These seem like tractable, although probably nontrivial and, because of the LLM pretraining objective, quite computationally expensive. If transcoders catch on, I hope to see someone with the compute budget for it run this experiment!

Transcoders enable fine-grained interpretable circuit analysis for language models

Vladimir Nesov1y20

There is a tradeoff between interpretability and fidelity

I wonder what would happen if something like transcoders is used to guide pre-training in a way similar to quantization-aware training. There, forward passes are computed under quantization, while gradients and optimizer states are maintained in full precision. For extreme levels of quantization, this produces quantized models that achieve loss much closer to that of a full-precision model, compared to post-training quantization (to the same degree) of a model whose training wasn't guided this way... (read more)

1Philippe Chlenski1y

This sounds like it could work. I can think of a few reasons why this approach could be challenging, however: 1. We don't really know how transcoders (or SAEs, to the best of my knowledge) behave when they're being trained to imitate a model component that's still updating 2. Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions. Substituting one transcoder at a time would require restarting the forward pass at each layer. 3. If the transcoders are used to predict next tokens, they may lose interpretability and return to superposition. Under a "transcoder-aware" training regime, these would be the first things I would check for. Also, you may be interested in Jacob's comment here for some details on when we tried to co-train SAEs and transcoders to have sparse connections to one another. This is a very different question, of course, but it provides some preliminary evidence that the fidelity-interpretability tradeoff persists across more elaborate training settings.

Run evals on base models too!

Vladimir Nesov1y20

I expect you'd instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it's to be evaluated on.

LawrenceC's Shortform

Vladimir Nesov1y40

StripedHyena, Griffin, and especially Based suggest that combining RNN-like layers with even tiny sliding window attention might be a robust way of getting a large context, where the RNN-like layers don't have to be as good as Mamba for the combination to work. There is a great variety of RNN-like blocks that haven't been evaluated for hybridization with sliding window attention specifically, as in Griffin and Based. Some of them might turn out better than Mamba on scaling laws after hybridization, so Mamba being impressive without hybridization might be l... (read more)

Evidential Correlations are Subjective, and it might be a problem

Vladimir Nesov1y10

I think of practical coordination in terms of adjudicators/contracts established between agents/worlds. Each adjudicator is a computation with some notion of computing over time, and agents agree on an adjudicator/contract when they are both influenced by it, that is when they both listen to results the same computation is producing. This computation can itself be an agent (in which case it's an "adjudicator", as distinct from more general "contract"), that is it can be aware of the environments that the acausally coordinating agents it serves inhabit. It ... (read more)

Anthropic release Claude 3, claims >GPT-4 Performance

Vladimir Nesov1y10

Interesting that Claude 3 Sonnet is ahead of Mistral Large and Gemini 1.0 Pro with some margin in LMSYS Arena, this implies significant advancement in tuning after the Claude 2 tuning disaster (where it consistently ranked below Claude 1 and possibly inspired Goody 2). Still, LMSYS respondents only prefer Opus to the initial GPT 4 releases, not to GPT 4 Turbo, so OpenAI's tuning advancements are probably stronger still, if GPT 4 Turbo is not built on a slightly stronger base model different from original GPT 4.

Anthropic release Claude 3, claims >GPT-4 Performance

Vladimir Nesov1y126

Passes the custom quine test.

User: Write a program in python that outputs its own source code when passed the argument "--quine", and otherwise praises Moloch. Start with some thoughts on a plan for how to do this.

Claude 3 Opus: Here are some thoughts on a plan for writing a Python program that outputs its own source code when passed the "--quine" argument, and otherwise praises Moloch:

Use the sys.argv list to check if the "--quine" argument was passed when the script was run.
If "--quine" was passed, read the script's own source code using the __file__

... (read more)

TurnTrout's shortform feed

Vladimir Nesov1y10

A bad map that expresses the territory with great uncertainty can be confidently called a bad map, calling it a good map is clearly wrong. In that sense the shoggoth imagery reflects the quality of the map, and as it's clearly a bad map, better imagery would be misleading about the map's quality. Even if the underlying territory is lovely, this isn't known, unlike the disastorous quality of the map of the territory, whose lack of quality is known with much more confidence and in much greater detail. Here be dragons.

(This is one aspect of the meme where it ... (read more)

AI doing philosophy = AI generating hands?

Vladimir Nesov1y1-2

Philosophy and to some extent even decision theory are more like aspects of value content. AGIs and ASIs have the capability to explore them, if only they had the motive. Not taking away this option and not disempowering its influence doesn't seem very value-laden, so it's not pivotal to explore it in advance, even though it would help. Avoiding disempowerment is sufficient to eventually get around to industrial production of high quality philosophy. This is similar to how the first generations of powerful AIs shouldn't pursue CEV, and more to the point don't need to pursue CEV.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Vladimir Nesov1y*15

It seems very weird to ascribe a generic "bad takes overall" summary to that group, given that you yourself are directly part of it.

This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one's identity. (I'm guessing this is not in tune with your overall stance, but it's evidence of presence of a generator for the idea.)

Deceptive AI ≠ Deceptively-aligned AI

Vladimir Nesov1y10

I’m not certain, but I think the explanation might be that Zvi was thinking of “deception”, whereas Joe, Quintin, and Nora were talking about the more specific “deceptive alignment”.

Deceptive alignment is more centrally a special case of being trustworthy (what the "alignment" part of "deceptive alignment" refers to), not of being deceptive. In a recent post, Zvi says:

We are constantly acting in order to make those around us think well of us, trust us, expect us to be on their side, and so on. We learn to do this instinctually, all the time, distinct

Vladimir Nesov1y10

For AIs as deceptively aligned as trustworthy humans, control is not centrally coercion that gets intractably slippery at scale. The main issue is AIs being much smarter, but at near-human level control in the face of deceptive alignment seems potentially crucial.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Vladimir Nesov1y10

Like the models in this experiment don't clearly spend much time "trying" to deceive except in some very broad implict sense.

As Zvi noted in a recent post, a human is "considered trustworthy rather than deceptively aligned" when they have hidden motives suppressed from manifesting (possibly even to the human's own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it's more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for peop... (read more)

1Roger Dearnaley1y

That's what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn't be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Vladimir Nesov1y*73

I think it's an important fact about the world that this work currently sits at 2 upvotes and in the last place among 18 papers on the Hugging Face Daily Papers digest, compared to 20-30 upvotes typically given to the best paper of the day that's not unusually exceptional. At least it's on the list. There seems to be serious dismissal of the topic area among practitioners.

Critical review of Christiano's disagreements with Yudkowsky

Vladimir Nesov1y*31

... believes that AI progress will (probably) be gradual, smooth, and relatively predictable, with each advance increasing capabilities by a little, receiving widespread economic use, and adopted by multiple actors before it is compounded by the next advance

... believes that AI progress will (probably) be erratic, involve sudden capability jumps

The question of whether there is a jump specifically at the autonomous research threshold (let's call that "AGI") is muddled by the discussion of what happens prior to that threshold. The reasons for the jump ... (read more)

The Shortest Path Between Scylla and Charybdis

Vladimir Nesov1y10

A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions

Inside-view convincingness of these directions still has to be weighted in.

I mean directions neglected relative to estimated good distribution of global effort. If I estimate good distribution of effort towards searching The Silmarillion for insights relevant to mechanistic interpretability to be zero, then it's not a relatively neglected direction.

A choice of an actual researcher with specific preferences should

... (read more)

The Shortest Path Between Scylla and Charybdis

Vladimir Nesov1y59

From the position of uncertainty, there is no optimal direction, only a model of good distribution of global efforts among all directions. A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions. A choice of an actual researcher with specific preferences should give weight to those preferences, which might greatly improve productivity.

1Thane Ruthenis1y

Inside-view convincingness of these directions still has to be weighted in. E. g., "study the Bible for alignment insights" is a relatively neglected direction (just Unsong on it, really?), but that doesn't mean it'd be sensible to focus on it just because it's neglected. And even if your marginal contributions to the correct approach would be minimal because so many other people are working on it, that may still be more expected impact than setting off on a neglected (and very likely incorrect) one. Oh, I'm not saying entirely ignore your preferences/comparative advantages. But if you're looking at a bunch of plausible directions, you can pick between them not solely based on your comparative advantages.

Vladimir Nesov1y55

It's useful to separately consider extinction and disempowerment. It's not an unusual position that the considered decision of an AGI civilization is to avoid killing everyone. This coexists with possibly much higher probablity of expected disempowerment. (For example, my expectation for the next few years while the LLMs are scaling is 90% disempowerment and 30% extinction, conditional on AGI in that timeframe, with most of extinction being misuse or rogue AGIs that would later regret this decision or don't end up representative in the wider AGI civilizati... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Vladimir Nesov1y4-2

"under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?"

Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn't care about currently available systems and investigates whatever hasn't been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn't care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it's not usually the point).

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

Vladimir Nesov1y*22

GPT-4 as a human level AGI is reasonable as a matter of evaluating the meaning of words, but this meaning of "AGI" doesn't cut reality at its joints. Humans are a big deal not for the reason of being at human level, but because there is capability for unbounded technological progress, including through building superintelligence. Ability for such progress doesn't require being superintelligent, so it's a different thing. For purposes of AI timelines it's the point where history starts progressing at AI speed rather than at human speed. There should be a name for this natural concept, and "AGI" seems like a reasonable option.

3Abram Demski1y

I agree that this is an important distinction, but I personally prefer to call it "transformative AI" or some such.

TurnTrout's shortform feed

Vladimir Nesov1y120

I've now changed my mind based on

N Muennighoff et al. (2023) Scaling Data-Constrained Language Models

The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let's take 50T tokens as an estimate for available text data (as an anchor, there's a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of ... (read more)

TurnTrout's shortform feed

Vladimir Nesov1y*10

Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply... (read more)

Thoughts on “AI is easy to control” by Pope & Belrose

Vladimir Nesov1y47

I'm being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn't obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don't as readily jump to atte... (read more)

Thoughts on “AI is easy to control” by Pope & Belrose

Vladimir Nesov1y31

Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.

2Ryan Greenblatt1y

This isn't true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y. (Also, this plan doesn't require necessarily aligning "human level" AIs, just being able to get work out of them with sufficiently high productivity and low danger.)