Quick Takes

gwern3632

LLM support for writing LessWrong posts: virtual comments.

Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.

(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an input thro... (read more)

Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet. 

An example snippet of a rogue internal deployment threat model that I find fairly compelling:

  1. The schemer, while deployed in an authorized scaffold, launches an unmonitored scaffold that has access to trusted model weights.
  2. It uses this scaffold to do a GCG attack on the trusted model, then writes the resulting GCG attack somewhere that the agent in the authorized scaffold can see. This takes a moderate amount of compute, but a small amount compared to the amount the untrusted model will regularly need to run experiments.
  3. The agent in the authorized scaffold
... (read more)
1Zach Stein-Perlman
Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?
2Buck Shlegeris
I don’t understand, can you say more?

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?

Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

Project proposal: No-holds-barred jailbreak attacks and defenses for control

Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?

The blue team can try a variety of strategies here.

  • The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbr
... (read more)

Project proposal: Control for auditing

(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)

When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely o... (read more)

Two different meanings of “misuse”

The term "AI misuse" encompasses two fundamentally different threat models that deserve separate analysis and different mitigation strategies:

  • Democratization of offense-dominant capabilities
    • This involves currently weak actors gaining access to capabilities that dramatically amplify their ability to cause harm. That amplification of ability to cause harm is only a huge problem if access to AI didn’t also dramatically amplify the ability of others to defend against harm, which is why I refer to “offense-dominant” capabilitie
... (read more)

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in plac... (read more)

8jbash
By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that.
4Buck Shlegeris
That's a great way of saying it. I edited this into my original comment.
Mark Xu6833

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who a... (read more)

Showing 3 of 24 replies (Click to show all)
2johnswentworth
I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.
Mark Xu40

Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.

1Mark Xu
see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG

A theory of how alignment research should work

(cross-posted from danielfilan.com)

Epistemic status:

  • I listened to the Dwarkesh episode with Gwern and started attempting to think about life, the universe, and everything
  • less than an hour of thought has gone into this post
  • that said, it comes from a background of me thinking for a while about how the field of AI alignment should relate to agent foundations research

Maybe obvious to everyone but me, or totally wrong (this doesn't really grapple with the challenges of working in a domain where an intelligent ... (read more)

4Chris_Leong
I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.
4Chris_Leong
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against: I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific. Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:

basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

I recently read "Targeted manipulation and deception emerge when optimizing LLMs for user feedback." 

All things considered: I think this paper oversells its results, probably in order to advance the author(s)’ worldview or broader concerns about AI. I think it uses inflated language in the abstract and claims to find “scheming” where there is none. I think the experiments are at least somewhat interesting, but are described in a suggestive/misleading manner. 

The title feels clickbait-y to me --- it's technically descriptive of their findings, but... (read more)

I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.

No, that’s not how RL works. RL - in settings like REINFORCE for simplicity - provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray. 

... (read more)
22Micah Carroll
Thank you for your comments. There are various things you pointed out which I think are good criticisms, and which we will address: * Most prominently, after looking more into standard usage of the word "scheming" in the alignment literature, I agree with you that AFAICT it only appears in the context of deceptive alignment (which our paper is not about). In particular, I seemed to remember people using it ~interchangeably with “strategic deception”, which we think our paper gives clear examples of, but that seems simply incorrect. * It was a straightforward mistake to call the increase in benchmark scores for Sycophancy-Answers “small” “even [for] our most harmful models” in Fig 11 caption. We will update this. However, also note that the main bars we care about in this graph are the most “realistic” ones: Therapy-talk (Mixed 2%) is a more realistic setting than Therapy-talk in which 100% of users are gameable, and for that environment we don’t see any increase. This is also true for all other environments, apart from political-questions on Sycophancy-Answers. So I don’t think this makes our main claims misleading (+ this mistake is quite obvious to anyone reading the plot). * More experiments are needed to test how much models can assess manipulability through multi-turn interactions without any signal that is 100% correlated to manipulability handed to them. That being said, I truly think that this wouldn't be that different from our follow-up experiments in which we don’t directly tell the model who is manipulable directly: it doesn’t take much of an extrapolation to see that RL would likely lead to the same harmful behaviors even when the signal is noisily correlated or when in a multi-turn mechanism, as I’ve argued elsewhere (the mechanism for learning such behavior being very similar): learning harmful behaviors would require a minimal amount of info-gathering about the user. In a 2 turn interaction, you first ask them some question that helps you disting

What's the most important technical question in AI safety right now?

9Buck Shlegeris
In terms of developing better misalignment risk countermeasures, I think the most important questions are probably: * How to evaluate whether models should be trusted or untrusted: currently I don't have a good answer and this is bottlenecking the efforts to write concrete control proposals. * How AI control should interact with AI security tools inside labs. More generally: * How can we get more evidence on whether scheming is plausible? * How scary is underelicitation? How much should the results about password-locked models or arguments about being able to generate small numbers of high-quality labels or demonstrations affect this?

"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.

Anthropic says:

More realistic evaluations are difficult without risking actual sabotage. One difficulty of running realistic evaluations is presenting realistic attack surfaces. Perhaps the hardest and most important attack surface to accurately model is the suspiciousness of expert humans in high-stakes decisions. To test a given set of realworld anti-sabotage mitigations, it’s tempting to introduce some sort of “penetration testing” or red-teaming, in which models are instructed to attempt to harmlessly but verifiably sabotage real decisions in a revers

... (read more)
11Ryan Greenblatt
Low importance aside: Seems slightly unproductive and unnecessarily connotation inducing to say "Anthropic says:". Maybe "The Sabotage eval paper (from Anthropic) says:". Like this was just written by various people on the Anthropic alignment science team and by no means should discussion sections of papers be interpreted to speak for the company overall. Obviously not very important.

OK, good point, I should have said "From Anthropic" or something like that.

8Ryan Greenblatt
We discuss similar issues in meta-level adversarial evaluations of oversight. The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk. Human monitoring might be too expensive, in which case we'll have to go with sandboxing and pure proxy.

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

  • Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
  • Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
    • So e.g. I exclu
... (read more)
Showing 3 of 6 replies (Click to show all)
3Matthew Barnett
Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?  For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:

  • The speed up is relative to the current status quo as of GPT-4.
  • The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster).
  • By "capable" o
... (read more)
4Buck Shlegeris
When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

The fact that an AI arms race would be extremely bad does not imply that rising global authoritarianism is not worth worrying about (and vice versa)

I am someone who is worried both about AI risks (from loss of control, and from war and misuse/structural risks) and from what seems to be a 'new axis' of authoritarian threats cooperating in unprecedented ways.

I won't reiterate all the evidence here, but these two pieces and their linked sources should suffice:

... (read more)
Showing 3 of 33 replies (Click to show all)

Two thoughts about the role of quining in IBP:

  • Quine's are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai's 3-player Prisoner's Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
  • What to do about agents that don
... (read more)
2Vanessa Kosoy
Ambidistributions I believe that all or most of the claims here are true, but I haven't written all the proofs in detail, so take it with a grain of salt. Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable. Definition 1: Let X be a compact Polish space. A (crisp) ambidistribution on X is a function Q:C(X)→R s.t. 1. (Monotonocity) For any f,g∈C(X), if f≤g then Q(f)≤Q(g). 2. (Homogeneity) For any f∈C(X) and λ≥0, Q(λf)=λQ(f). 3. (Constant-additivity) For any f∈C(X) and c∈R, Q(f+c)=Q(f)+c. Conditions 1+3 imply that Q is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post. The space of all ambidistributions on X will be denoted ♡X.[1] Obviously, □X⊆♡X (where □X stands for (crisp) infradistributions), and likewise for ultradistributions. Examples Example 1: Consider compact Polish spaces X,Y,Z and a continuous mapping F:X×Y→Z. We can then define F♡∈♡Z by F♡(u):=maxθ∈ΔXminη∈ΔYEθ×η[u∘F] That is, F♡(u) is the value of the zero-sum two-player game with strategy spaces X and Y and utility function u∘F. Notice that F in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further. Example 2: Let A and O be finite sets representing actions and observations respectively, and Λ:{O∗→A}→□(A×O)∗ be an infra-Bayesian law. Then, we can define Λ♡∈♡(A×O)∗ by Λ♡(u):=maxπ:O∗→AEΛ(π)[u] In fact, this is a faithful representation: Λ can be recovered from Λ♡. Example 3: Consider an infra-MDP with finite state set S, initial state s0∈S and transition infrakernel T:S×A→□S. We can then define the "ambikernel" T
4Vanessa Kosoy
Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well. First, let our notion of "hypothesis" be Θ∈□c(Γ×2Γ). The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition prΓ×2ΓBr(Θ)=Θ (See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition Br(Θ)⊇(idΓ×diag2Γ)∗Θ This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs. The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn't imply that Θ is downward closed, and hence there is no longer a monotonicity principle[1]. The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form Ξ∈□c(Γ×Φ) and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well. First notice that our new condition on Θ can be reformulated as requiring that * suppΘ⊆elΓ * For any s:Γ→Γ define τs:ΔcelΓ→ΔcelΓ by τsθ:=χelΓ(s×id2Γ)∗. Then, we require τsΘ⊆Θ. For any Φ, we also define τΦs:Δc(elΓ×Φ)→Δc(elΓ×Φ) by τΦsθ:=χelΓ×Φ(s×id2Γ×Φ)∗ Now, for any Ξ∈□c(Γ×Φ), we define the "conservative bridge transform[2]" CBr(Ξ)∈□c(Γ×2Γ×Φ) as the closure of all τΦsθ where θ is a maximal element of Br(Ξ). It is then possible to see that Θ∈□c(Γ×2Γ) is a valid hypothesis if and only if it is of the form prΓ×2ΓCBr(Ξ) for some Φ and Ξ∈□c(Γ×Φ). 1. ^ I still thi
Thomas Kwa7-13

You should update by +-1% on AI doom surprisingly frequently

This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and ... (read more)

1Tsvi Benson-Tilsen
Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).
15Lawrence Chan
The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.[1] In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going up or down by 1% at each step, and 1-p of staying the same, the variance is reduced by a factor of p, and so you need 2500/p steps.  (Indeed, something like this standard way to derive the expected steps before a random walk hits an absorbing barrier).  Similarly, you get that if you start at 20% or 80%, you need 1600 steps in expectation, and if you start at 1% or 99%, you'll need 99 steps in expectation.  ---------------------------------------- One problem with your reasoning above is that as the 1%/99% shows, needing 99 steps in expectation does not mean you will take 99 steps with high probability -- in this case, there's a 50% chance you need only one update before you're certain (!), there's just a tail of very long sequences. In general, the expected value of variables need not look like I also think you're underrating how much the math changes when your beliefs do not come in the form of uniform updates. In the most extreme case, suppose your current 50% doom number comes from imagining that doom is uniformly distributed over the next 10 years, and zero after --  then the median update size per week is only 0.5/520 ~= 0.096%/week, and the expected number of weeks with a >1% update is 0.5 (it only happens when you observe doom). Even if we buy a time-invariant random walk model of belief updating, as the expected size of your updates get larger, you also expect there to be quadratically fewer of them -- e.

I talked about this with Lawrence, and we both agree on the following:

  • There are mathematical models under which you should update >=1% in most weeks, and models under which you don't.
  • Brownian motion gives you 1% updates in most weeks. In many variants, like stationary processes with skew, stationary processes with moderately heavy tails, or Brownian motion interspersed with big 10%-update events that constitute <50% of your variance, you still have many weeks with 1% updates. Lawrence's model where you have no evidence until either AI takeover happen
... (read more)
gwern90

I have some long comments I can't refind now (weirdly) about the difficulty of investing based on AI beliefs (or forecasting in general): similar to catching falling knives, timing is all-important and yet usually impossible to nail down accurately; specific investments are usually impossible if you aren't literally founding the company, and indexing 'the entire sector' definitely impossible. Even if you had an absurd amount of money, you could try to index and just plain fail - there is no index which covers, say, OpenAI.

Apropos, Matt Levine comments on o... (read more)

gwern41

"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"

"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):

There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.

Pity the i

... (read more)
4gwern
Masayoshi Son reflects on selling Nvidia in order to maintain ownership of ARM etc: https://x.com/TheTranscript_/status/1805012985313903036 "Let's stop talking about this, I just get sad."
2gwern
So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Popular ARK Funds Are Sinking Fast: Investors have pulled a net $2.2 billion from ARK’s active funds this year, topping outflows from all of 2023" (mirror):

on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:

I may have slipped into a word game... are we "training against the [interpretability] detection method" or are we "providing feedback away from one kind of algorithm and towards another"? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?

This is why we need empirics.

Training Attack Policies for Control Evaluations via Prover-Verifier Games (aka Control with Training)

By @Ansh Radhakrishnan and @Buck 

Credit goes to Buck for most of these ideas – most of this is written by Ansh, with some contributions from Buck. Thanks to John Schulman, Fabien Roger,  Akbir Khan, Jiaxin Wen, Ryan Greenblatt, Aengus Lynch, and Jack Youstra for their comments.

The original AI control paper evaluates the robustness-to-intentional-subversion of safety techniques by having a red team construct an attack policy (a drop-in replac... (read more)

Mark Xu4024

AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind

Some considerations:

  • Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create "the smartest" models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
  • Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they're renting th
... (read more)
Showing 3 of 8 replies (Click to show all)

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

2Mark Xu
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
7Evan Hubinger
Some possible counterpoints: * Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback. * My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I'm not positive of this). * I think the object-level research productivity concerns probably dominate, but if you're thinking about influence instead, it's still not clear to me that GDM is better. GDM is a much larger, more bureaucratic organization, which makes it a lot harder to influence. So influencing Anthropic might just be much more tractable.

A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy:

A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).

By this definition, humans have shards because they can want food at the same time as wantin... (read more)

Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!

AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other

This confuses me.

I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.

1Thomas Kwa
I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this: * A shardful agent can be incoherent due to valuing different things from different states * A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences * A shardful agent saves compute by not evaluating the whole utility function The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.

Apply to the "Team Shard" mentorship program at MATS

In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing[1] to unsupervised capability elicitation. If you're theory-minded, maybe you'll help us formalize shard theory itself.

Research areas

Discovering qualitatively new techniques

Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Additional work has reinforced the promise of this technique, and steering vectors have become a sma

... (read more)
Load More