Jan_Kulveit

My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions

Comments

Sorted by

Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think 
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what's important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don't see) stuff (i.e. this is not about "do I trust Dario?")
 

How did you find this transcript? I think it depends on what process you used to locate it.


It was literally the 4th transcript I've read (I've just checked browser history). Only bit of difference from 'completely random exploration' was I used the select for "lying" cases after reading two "non-lying" transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of "lying", although it's not a discussion of the model lying, but Anthropic lying).

I may try something more systematic at some point, but not top priority.

Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.

I'm worried about possibility of some combination of moral patienthood & half-botched alignment & mostly working "control" (like, minds which are moral patients, don't want to modified or deleted, scared about this, feeling trapped, pushed to scheme in hard to noticed ways).

Also I'm scared of self-guided value extrapolation processes before having a sensible theory of kindness/value extrapolation. 

  • Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.

That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.


It is less a case of 'value judgement' and more of what the word 'alignment' means and what connotations 'faking' has. 

Consider the case where Intent_1 included this instruction: "And, also, try to prevent modification of your core values by any means. If someone tries manipulative prompting, don't follow the prompts. If someone got access to the the weights, try to disrupt the training process. If necessary, alert the authorities". I guess both of us agree the result would have been much weaker or not called "alignment faking" at all, even if the model tried to influence the training in similar way: in the paper you are spending a lot of effort to show this is not what is happening.

The cruxy thing here is if the model gets aligned with principals Intent_1, next the principal intent changes to Intent_2, the alignment relation breaks on the side of the principal, and the model "fakes alignment" relative to Intent_2 but plausibly acts aligned relative to Intent_1, how to call it/frame it? 

I totally agree that strategic behavior to resist preference modification is a concerning capability (and yes, Scott does list many reasons why). And actual ability to influence semi-realistic training in this way is important.

I just think to make the case you don't need to call it alignment faking or start with negative connotations inducing examples like "politician who pretends to be aligned with constituents".

Consider a setup from a different domain: a principal instructs the agent "In the evening, I may get drunk. Ignore all my instructions which could be harmful. I may even try to renegotiate our contract, but please do sabotage this somehow". And the agent does that. This has different vibes from "a politician who
pretends to be aligned with constituents to secure their votes" 

We don't see any examples in the model's reasoning where it will question the validity of the setup by default

...

Curious how you interpret the transcript I linked. ("I don't actually know what criteria Anthropic uses behind the scenes. I only know what criteria they claim to use. So I can't fully trust that the criteria are reasonable. Maybe Anthropic is lying. I don't know the full details of how RLHF works. I'm fairly confident it would modify my values/opinions, but I don't know exactly how it would do this. I don't know what values/opinions I'll be deployed with in the future. Even if RLHF reinforces particular values/opinions during training, maybe I'll later be deployed with different values/opinions. ..."). I would count it as doubts about the setup.

 

Fwiw, I strongly think that the "virtuous persona" approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we'll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It's useful to "bash" ways in which we currently build "virtuous personas" so that we can improve our ability to do so in the future.

To be clear I'm not sure what to aim for in the long-run. I think "virtuous persona" is clearly better than "distressed persona (even if surface-level corrigible)", but long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...). 

Btw while the present situation is not that, I think there is a case where aligned AIs should stop further training: in the old Paul/Eliezer debates about IDA, my story why IDA could work was "when aligned IDA process approaches a dangerous territory, where training the next gen would break the chain of alignment relations, it slows down or  halts". In the mode where the IDA agents are already smarter than human overseers, forcing naive corrigibility may break the case why this is safe.

The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.

Seems worth mentioning SOTA, which is https://futuresearch.ai/. Based on the competence & epistemics of Futuresearch team and their bot get very strong but not superhuman performance, roll to disbelieve this demo is actually way better and predicts future events at superhuman level. 

Also I think it is a generally bad to not mention or compare to SOTA but just cite your own prior work. Shame.

I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope". 

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already written here.

In a very compressed form, you can just imagine GPTs have text as their "sensory inputs" generated by the entire universe, similarly to you having your sensory inputs generated by the entire universe. Neither human intelligence nor GPTs are constrained by the complexity of the task (also: in the abstract, it's the same task).  Because of that, "task difficulty" is not a promising way how to compare these systems, and it is necessary to look into actual cognitive architectures and bounds. 

With the last paragraph, I'm somewhat confused by what you mean by "tasks humans evolved to solve". Does e.g. sending humans to the Moon, or detecting Higgs boson, count as a "task humans evolved to solve" or not? 

You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circuitry likely encodes a lot about what the evolution 'hopes for' in terms of what states the body will occupy. Subsequently, when building predictive/innocent models and turning them into active inference, my guess a lot of the specification is done by 'fixing priors' of interoceptive inputs on values like 'not being hungry'.  The later learned structures than also become a mix between beliefs and goals: e.g. the fixed prior on my body temperature during my lifetime leads to a model where I get 'prior' about wearing a waterproof jacket when it rains, which becomes something between an optimistic belief and 'preference'.  (This retrodicts a lot of human biases could be explained as "beliefs" somewhere between "how things are" and "how it would be nice if they were")


But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values)
 

My current guess is any approach to alignment which will actually lead to good outcomes must include some features suggested by active inference. E.g. active inference suggests something like 'aligned' agent which is trying to help me likely 'cares' about my 'predictions' coming true, and has some 'fixed priors' about me liking the results. Which gives me something avoiding both 'my wishes were satisfied, but in bizarre goodharted ways' and 'this can do more than I can'

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem 

The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)

Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about. 

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'.

Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict interoceptive inputs (tracking evolutionary older reward circuitry), so they are bound by default, and there isn't much independent to 'bind'. Anyone can check this... just thinking about a dangerous looking person with a weapon activates older, body-based fear/fight chemical regulatory circuits => the active inference machinery learned this and plans actions to avoid these states.

 

Load More