Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk.
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
Thanks, that's helpful.
What seems clear to me is that our world is the result of fairly simple laws of physics, and our creators wanted to know how those simple laws would play out. They're saying "if there was a universe with these laws, what would happen". (This is what I'd meant by "simulation")
I agree it's less clear that they're doing this bc they think those laws also describe a real-world process (somewhere in the multiverse) and they want to predict the outcome of that process. (This is what you meant by "simulation" and I think your def is better....
This is a hackathon project write-up. This does not represent the views of my employer. Thanks to Tianyi Alex Qiu and Aidan Ewart for helpful discussions.
Sometimes, people will search for and find a situation where an LLM would do something bad, and then argue that the situation was natural enough that it is concerning. When is this a valid reason to be concerned? What can be learned from such evidence?
This post:
You can directly argue that the scary demos are representative of a broader class of situations where the model might actually do something bad.
For example, in the case of an...
What happens if you merge the bash and the audit tool, just giving the AI a single bash tool from which it can
fragment?
This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions:
Consider Parfit's hitchhiker:
An FDT agent is supposed to reason as follows:
I can't check today, but whoops, sorry if I typoed the equation at some step.
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Thanks!
I’m interested in why you think consequentialism in necessarily maximising. An AGI might have multiple mutually incompatible goals it it solving for, and choose some balance of those, not maximising on any.
For one thing, my headline claim is “ruthless sociopath”, not “maximizing”. “Ruthless sociopath” is pointing to something that’s missing (intrinsic concern for the welfare of other people), not something that’s present (behaviors that maximize something in the world).
For another thing, strictly speaking, perfect maximization is impossible without ...
This is a nuanced “I was wrong” post.
Something I really like about AI safety and EA/rationalist circles is the ease and positivity in people’s approach to being criticised.[1] For all the blowups and stories of representative people in the communities not living up to the stated values, my experience so far has been that the desire to be truth-seeking and to stress-test your cherished beliefs is a real, deeply respected and communally cultured value. This in particular explains my ability to keep getting jobs and coming to conferences in this community, despite being very eager to criticise and call bullshit on people’s theoretical agendas.
One such agenda that I’ve been a somewhat vocal critic of (and which received my criticism amazingly well) is the “heuristic arguments” picture and the ARC research...
>>> Basically, a very optimistic hope for understanding “why circuits behave differently from functions” is to make statements that, conditional on some “surprising” property P holding for a circuit (here “surprising” means that it occurs much more for circuits than functions), we can deduce that, with high probability, the circuit internals have some property A(C).
My understanding of the NCP-conjecture is different [and I claim more interesting].
If we want we can go from a surprising property Q [your P] holding for the function to some property ...