AI ALIGNMENT FORUM
AF

All of Signer's Comments + Replies

Counting arguments provide no evidence for AI doom

Once we understand that relationship, it should become pretty clear why the overfitting argument doesn’t work: the overfit model is essentially the 2n model, where it takes more bits to specify the core logic, and then tries to “win” on the simplicity by having m unspecified bits of extra information. But that doesn’t really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in n bits rather than 2n bits, you’ll learn those.

Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?

Evan Hubinger1y159

Paradoxically, I think larger neural networks are more simplicity-biased.

The idea is that when you make your network larger, you increase the size of the search space and thus the number of algorithms that you're considering to include algorithms which take more computation. That reduces the relative importance of the speed prior, but increases the relative importance of the simplicity prior, because your inductive biases are still selecting from among those algorithms according to the simplest pattern that fits the data, such that you get good generalizat... (read more)

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

Signer1y30

Wait, where? I think the objection to "Doing that is quite hard" is not an objection to "it’s not obviously true that such algorithms are actually “achievable” for SGD" - it's an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.

2Alex Turner1y

This is... roughly one point I was making, yes.

OpenAI's Alignment Plans

Signer3y10

"Safe" as in "safe enough for it to be on net better to run it" or "safe enough it wouldn't definitely kill everyone". It's not that I don't have popular intuition that GPT wouldn't kill anyone. It's just that I don't think it's a good habit to run progressively more capable systems while relying on informal intuitions about their safety. And then maybe I will see an explanation for why future safety tools would outpace capability progress, when now we are already at the point where current safety tools are not applicable to current AI systems.

2Neel Nanda3y

I'm pretty unconvinced by this. I do not think that any substantial fraction of AI x-risk comes from an alignment research who thinks carefully about x-risk deciding that a GPT-3 level system isn't scary enough to take significant precautions with re boxing. I think taking frivolous risks is bad, but that risk aversion to the point of not being able to pursue otherwise promising research directions seems pretty costly, while the benefits of averting risks >1e-9 is pretty negligible in comparison. (To be clear, this argument does not apply to more powerful systems! As systems get smarter we should be more capable, and try to be very conservative! But ultimately everything is a trade-off - letting GPT-3 talk to human contractors giving feedback is a way of letting it out of the box!)

OpenAI's Alignment Plans

Signer3y-40

So, what's the technical (the one ending with "therefore the probability of disaster is < 1e-9") reason why training InstructGPT was safe?

1Neel Nanda3y

Who is claiming that it is safe? I didn't get that implication from the post

The Speed + Simplicity Prior is probably anti-deceptive

Signer3y10

I think you might be conflating two different scenarios?

I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?

Is your claim that that’s basically nothing?

Yes, or at least it approaches relatively nothing as we get more competent optimizer.

If it’s always super easy to identify the base objective and then optimize it, then $H_{f h}$ shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.

If it doesn't store it, it isn't $H_{f h}$ - it would be abl... (read more)

1[anonymous]3y

What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it” Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)

The Speed + Simplicity Prior is probably anti-deceptive

Signer3y10

I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(me... (read more)

0[anonymous]3y

I think you might be conflating two different scenarios? t(is_deployed_in_world) and t(find(get_base_obj)) are not simultaneously present in either case. That said, I'll still do my best to respond. By definition, t(find(get_base_obj(env))) is whatever part of t(mesaoptimize_env) that Hfh doesn't have to do. Is your claim that that's basically nothing? If it's always super easy to identify the base objective and then optimize it, then Hfh shouldn't need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective. In that case, both Hfd and Hsd still lose due to needing to pay for c(simple_objective) and, respectively, c(is_deployed_in_world) or t(mesaoptimize_outer_env). I'm not saying these costs are exorbitant, but they are strictly greater than not paying them - so the prior would never incentivize paying them, and always slightly disincentivize doing so. (I.e. they'd never arise naturally from optimization pressure alone.)

Late 2021 MIRI Conversations: AMA / Discussion

Signer3y30

It was all very interesting, but what was the goal of these discussions? I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety. Is the reasoning that the same mental process that assigned too low probability would not be able to come up with actual solution? Or something like "at the time they think their solution reduced probability of failure from 5% to 0.1% it would still be much higher"? Seems to be only possible if people don't understand arguments about inner optimisators or what not, as opposed to disagreeing with them.

Rob Bensinger3y30

I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety.

What specific actions do you have in mind when you say "global coordination on safety", and how much of the problem do you think these actions solve?

My own view is that 'caring about AI x-risk at all' is a pretty small (albeit indispensable) step. There are lots of decisions that hinge on things other than 'is AGI risky at all'.

I agree with Rohin that the useful thing is trying t... (read more)

Response to "What does the universal prior actually look like?"

Signer4y00

To clarify, sufficient observations would still falsify all "simulate simple physics, start reading from simple location" programs and eventually promote "simulate true physics, start reading from camera location"?

2Paul Christiano4y

In this story, I'm imagining that hypotheses like "simulate simple physics, start reading from simple location" lose, but similar hypotheses like "simulate simple physics, start reading from simple location after a long delay" (or after seeing pattern X, or whatever) could be among the output channels that we consider manipulating. Those would also eventually get falsified (if we wanted to deliberately make bad predictions in order to influence the basement world where someone is thinking about the universal prior) but not until a critical prediction that we wanted to influence.