Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.
"The AI's values must be much more aligned in order to be safe outside the text domain"
I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking.
I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts.
I'd say the biggest area of how I'm skeptical so far is that I think there's a real difference between the useful idea that power is useful for the sc...
Oh, now I understand.
And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.
I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.
To answer the question:
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won't work well enough to be relevant?
Assuming the goals ar...
I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what a...
The good news I'll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there's a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.
Specifically, the insights I'm talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.
I agree with the claim that deception could arise without deceptive alignment, and mostly agree with the post, but I do still think it's very important to recognize if/when deceptive alignment fails to work, it changes a lot of the conversation around alignment.
The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
The AI is not jailbreaking itself, here.
This link explains it better than I can, here:
https://www.aisnakeoil.com/p/model-alignment-protects-against
I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we'd have to rely on massive speed biases to punish deception.
These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus ...
(2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities.
i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it's the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.
I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.
There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.
I disagree with this post for 1 reason:
On Amdahl's law, John Wentworth's post on the long tail is very relevant here, as it limits the use of cyborgism here:
https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
The robust values hypothesis from Dragon...
In the human case, it's that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.
That's the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.
Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. Having some
See a Twitter thread of some brief explorations I and Alex Silverstein did on this
I think you cut yourself off there both times.
You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don't work well.
EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).
This is a crux for me, as it is why I don't think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn't change much regarding risk in total.
We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.
I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can't easily modify brains, unlike a virtual human.) But this may not be true in the short-term.
Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.
First up, I've strongly upvoted it as an example of advancing the alignment frontier, and I think this is plausibly the easiest solution provided it can actually be put into code.
But unfortunately there's a huge wrecking ball into it, and that's deceptive alignment. As we try to solve increasingly complex problems, deceptive alignment becomes the default, and this solution doesn't work. Basically in Evhub's words, mere compliance is the default, and since a treacherous turn when the AI becomes powerful is possible, that this solution alone can't do very w...
WebGPT is approximately "reinforcement learning on the internet".
There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.
I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions f...
Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.
My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.
I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effe...
First, great news on founding an alignment organization on your own. While I give this work a low chance of making progress, if you succeed the benefits would be vast.
I'll pre-register a prediction. You will fail with 90% probability, but potentially usefully fail. My reasons are as follows:
Inner alignment issues have a good chance of wrecking your plans. Specifically there are issues like instrumental convergence causing deception and power-seeking by default. I notice an implicit assumption where inner alignment is either not a problem or so easy to s
The important part of his argument is in the second paragraph, and I agree because by and large, pretty much everything we know about science and casuality, at least in the beginning for AI is on trusting the scientific papers and experts. Virtually no knowledge is given by experimentation, but instead by trusting the papers, experts and books.