AI ALIGNMENT FORUM
AF

All of Tapatakt's Comments + Replies

Reward Hacking from a Causal Perspective

How can we combine behavioural experiments with mechanistic interpretability to infer an agent’s subjective causal model? The next post will say more about this.

There is no next post. Can I read about it somewhere anyway?

3Tom Everitt7mo

Sorry, this post got stuck on the backburner for a little bit. But the content will largely be from "Robust Agents Learn Causal World Models"

AGI Ruin: A List of Lethalities

Tapatakt1y-1-3

It's hard to guess, but it happened when the only one known to us general intelligence was created by a hill-climbing process.

3Alex Turner1y

I think it's inappropriate to call evolution a "hill-climbing process" in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training.