I think this post was and remains important and spot-on. Especially this part, which is proving more clearly true (but still contested):
It does not matter that those organizations have "AI safety" teams, if their AI safety teams do not have the power to take the one action that has been the obviously correct one this whole time: Shut down progress on capabilities. If their safety teams have not done this so far when it is the one thing that needs done, there is no reason to think they'll have the chance to take whatever would be the second-best or third-best actions either.
By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, ...
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".
(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)
(Edit: others have made this point already, but anyhow)
My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.
I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.
Is the following an accurate summary?
The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?
If that's correct, here are some places this conflicts with my intuition about how things should be done:
I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!)
I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex s...
Can I check that I follow how you recover quantilization?
Are you evaluating distributions over actions, and caring about the worst-case expectation of that distribution?
If so, proposing a particular action is evaluated badly? (Since there's a utility function in your set that spikes downward at that action.)
But proposing a range of actions to randomize amongst can be assessed to have decent worst-case expected utility, since particular downward spikes get smoothed over, and you can rely on your knowledge of "in-distribution" behaviour?
Edited to add: ...
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and ther...
I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.
I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).
The transformer circuits work strikes me this way, so does a bunch of others.
Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.
I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.
This isn't something I can visualize working, but maybe it has components of an answer.
To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them ...
No on Q4? I think Alex's counterexample applies to Q4 as well.
(EDIT: Scott points out I'm wrong here, Alex's counterexample doesn't apply, and mine violates A5.)
In particular I think A4 and A5 don't imply anything about the rate of change as we move between lotteries, so we can have movements too sharp to be concave. We only have quasi-concavity.
My version of the counterexample: you have two outcomes and , we prefer anything with equally, and we otherwise prefer higher .
If you give me a corresponding , it must satisfy ...
[Edit: yeah nevermind I have the inequality backwards]
A5 seems too strong?
Consider lotteries and , and a mixture in between. Applying A5 twice gives:
So if and then ?
Either I'm confused or A5 is a stricter condition than concavity.
Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.
(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)
As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:
My troll example is a fully connected network with all zero weights and biases, no skip connections.
This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.
To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.
It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents:
My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about.
We can instantiate this like:
The idea is that if there is a short way to prove that the opponent would cooperate back, then it takes just a constant steps more to prove that we cooperate. So it doesn't open us up to...
A point of confusion: is it enough to prove that ? What about ? I'm not sure I can say this well, but here goes:
We might not be able to prove in the theory, or even that (which would mean "there are no proofs of inconsistency"). But if we believe our theory is sound, we believe that it can't show that a copy of itself proves something false.
So tells us that if is true, the theory would show that a copy of itself proves is true. And this is enough to convince us that we can't simultaneously have true and false.
Also, here's a proof that a bot is never exploited. It only cooperates when its partner provably cooperates.
First, note that , i.e. if cooperates it provably cooperates. (Proof sketch: .)
Now we show that (i.e. if chooses to cooperate, its partner is provably cooperating):
(PS: we can strengthen this to , by noticing that .)
If I follow what you mean, we can derive:
So there's a Löbian proof, in which the provability is self-fulfilling. But this isn't sufficient to avoid this kind of proof.
(Aside on why I don't like the Löbian method: I moreso want the agents to be doing "correct" counterfactual reasoning about how their actions affect their opponent, and to cooperate because they see that mutual cooperation is possible and then choose it. The Löbian proof style isn't a good model of that, imo.)
In case this helps folks' intuition, my generating idea was something like: "look at what my opponent is thinking, but assume that whenever they check if I cooperate, they think the answer is yes". This is how we break the loop.
This results in a proof of the lemma like thus:
(EDIT: This is basically the same proof as in the post, but less simple. Maybe the interesting part is the "unroll once...
Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.
Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a...
Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.
I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".
Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.
Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.
Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a...
maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.
I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.
Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered patter...
Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?
In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.
I agree with your point that blobs of bayes net nodes aren't very legible, but I still think neural nets are relevantly a lot less interpretable than that! I think basically all structure that limits how your AI does its thinking is helpful for alignment, and that neural nets are pessimal on this axis.
In particular, an AI system based on a big bayes net can generate its outputs in a fairly constrained and structured way, using some sort of inference algorithm that tries to synthesize all the local constraints. A neural net lacks this structure, and is ther...
A longer reply on the points about heuristic mesaobjectives and the switch:
I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.
But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:
Two quick things to say:
(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize
in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.
A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize
primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.
If the field of ML shifts towards having a better understanding of models ...
I think this would be a negative outcome, and not a positive one.
Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.
It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.
On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".
I like thinking about coordination from this viewpoint.
There is a nuclear analog for accident risk. A quote from Richard Hamming:
Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmeti...
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.
This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)
As a separate note, the "further out" your goal is and th...
Q7 (Python):
Y = lambda s: eval(s)(s)
Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')
Q8 (Python):
Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:
f = 'lambda s: "\\n".join(s.splitlines()[::-1])'
go = 'lambda s: print(eval(f)(eval(s)(s)))'
eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')
I continue to think there's something important in here!
I haven't had much success articulating why. I think it's neat that the loop-breaking/choosing can be internalized, and not need to pass through Lob. And it informs my sense of how to distinguish real-world high-integrity vs low-integrity situations.