Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Strongly agree.

Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such individuals are leading AGI labs, the situation will remain quite dire.

+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion). I'm not convinced the goal of the AI Safety community should be to align AIs at this point.

However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like "BadLlama," which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More "technical" works like these seem overwhelmingly positive, and I think that we need more competent people doing this.

[-]kave2y917

It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion)

One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.

[-]cousin_it2y910

Yeah, this agrees with my thinking so far. However, I think if you could research how to align AIs specifically to human flourishing (as opposed to things like obedience/interpretability/truthfulness, which defer to the user's values), that kind of work could be more helpful than most.

[-]Andrew_Critch2y414

I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.

In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..

[-]Rafael Harth2y56

I don't find this framing compelling. Particularly wrt to this part:

Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it's standard practice.)

I grant the point that an AI that does what the user wants can still be dangerous (in fact it could outright destroy the world). But I'd describe that situation as "we successfully aligned AI and things went wrong anyway" rather than "we failed to align AI". I grant that this isn't obvious; it depends on how exactly AI alignment is defined. But the post frames its conclusions as definitive rather than definition-dependent, which I don't think is correct.

Is the-definition-of-alignment-which-makes-alignment-in-isolation-a-coherent-concept obviously not useful? Again, I don't think so. If you believe that "AI destroying the world because it's very hard to specify a utility function that doesn't destroy the world" is a much larger problem than "AI destroying the world because it obeys the wrong group of people", then alignement (and obedience in particular) is a concept useful in isolation. In particular, it's... well, it's not definitely helpful, so your introductory sentence remains literally true, but it's very likely helpful. The important thing is does make sense to work on obedience without worrying about how it's going to be applied because increasing obedience is helpful in expectation. It could remain helpful in expectation even if it accelerates timelines. And note that this remains true even if you do define Alignment in a more ambitious way.

I'm aware that you don't have such a view, but again, that's my point; I think this post is articulating the consequences of a particular set of beliefs about AI, rather than pointing out a logical error that other people make, which is what its framing suggests.

[-]Donald Hobson2y32

Physics Myths vs reality.

Myth: Ball bearings are perfect spheres.

Reality: The ball bearings have slight lumps and imperfections due to manufacturing processes.

Myth: Gravity pulls things straight down at 9.8 m/s/s.

Reality: Gravitational force varies depending on local geology.

You can do this for any topic. Everything is approximations. The only question is if they are good approximations.

[-]RogerDearnaley2y20

I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptations to the environmental niche that humans evolved in), and are independent of background/cilture/upbringing, then the target that this process converges to might be nearly independent of the human social context in which this work started, and of the desires/views/interests of the specific humans involved at the beginning of the process.

However, that is a rather strong and specific set of assumptions required for Myth #1 not to be a myth: I certainly agree that in general and by default, for most ideas in Alignment, human context matters, and that the long-term outcome of a specific Alignment technique being applied in, say, North Korea, might differ significantly from it being applied in North America.

[-]Raemon1y10

Curated.

The overall point here seems true and important to me.

I think I either disagree, or am agnostic about, some of the specific examples given in the Myth vs Reality section. I don't think they're loadbearing for the overall point. I may try to write those up in more detail later.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

114

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

114

Myths vs reality

What to make of all this

Recap of key points