I think I need more practice talking with people in real time (about intellectual topics). (I've gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
Another example I want to consider is a captured Communist revolutionary choosing to be tortured to death instead of revealing some secret. (Here "reward hacking" seems analogous to revealing the secret to avoid torture / negative reward.)
My (partial) explanation is that it seems like evolution hard-wired a part of our motivational system in some way, to be kind of like a utility maximizer with a utility function over world states or histories. The "utility function" itself is "learned" somehow (maybe partly via RL through social pressure and other rewards, partly through "reasoning"), but sometimes it gets locked-in or resistant to further change.
At a higher level, I think evolution had a lot of time to tinker, and could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of "reward hacking" that had the highest negative impacts in our EEA. I suspect understanding exactly what evolution did could be helpful, but it would probably be a mishmash of things that only partially solve the problem, which would still leave us confused as to how to move forward with AI design.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says "you should give me more cake because I get very angry if I don't get cake". Even if this starts off as a lie, it might then be in A's interests to use your mechanism above to self-modify into A' that does get very angry if it doesn't get cake, and which therefore has a better bargaining position (because, under your protocol, it has "proved" that it was A' all along).
To disincentivize such lies, it seems that the merger can't be based on each agent's reported utility function, or even correctly observed current utility function, but instead the two sides have to negotiate some way of finding out each side's real utility function, perhaps based on historical records/retrodictions of how each AI was trained. Another way of looking at this is, a superintelligent AI probably has a pretty good guess of the other AI's real utility function based on its own historical knowledge, simulations, etc., and this makes the lying problem a lot less serious than it otherwise might be.
It seems great that someone is working on this, but I wonder how optimistic you are, and what your reasons are. My general intuition (in part from the kinds of examples you give) is that the form of the agent and/or goals probably matter quite a bit as far as how easy it is to merge or build/join a coalition (or the cost-benefits of doing so), and once we're able to build agents of different forms, humans' form of agency/goals isn't likely to be optimal as far as building coalitions (and maybe EUMs aren't optimal either, but something non-human will be), and we'll face strong incentives to self-modify (or simplify our goals, etc.) before we're ready. (I guess we see this in companies/countries already, but the problem will get worse with AIs that can explore a larger space of forms of agency/goals.)
Again it's great that someone is trying to solve this, in case there is a solution, but do you have an argument for being optimistic about this?
I've argued previously that EUMs being able to merge easily creates an incentive for other kinds of agents (including humans or human-aligned AIs) to self-modify into EUMs (in order to merge into the winning coalition that takes over the world, or just to defend against other such coalitions), and this seems bad because they're likely to do it before they fully understand what their own utility functions should be.
Can I interpret you as trying to solve this problem, i.e., find ways for non-EUMs to build coalitions that can compete with such merged EUMs?
This answer makes me think you might not be aware of an idea I called secure joint construction (originally from Tim Freeman):
Entity A could prove to entity B that it has source code S by consenting to be replaced by a new entity A' that was constructed by a manufacturing process jointly monitored by A and B. During this process, both A and B observe that A' is constructed to run source code S. After A' is constructed, A shuts down and gives all of its resources to A'.
Since I wrote this post, AI generation of hands has gotten a lot better, but the top multimodal models still can't count fingers from an existing image. Gemini 2.5 Pro, Grok 3, and Claude 3.7 Sonnet all say this picture (which actually contains 8 fingers in total) contains 10 fingers, while ChatGPT 4o says it contains 12 fingers!
My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations,
Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again for simplicity let's assume that each of these simulations is done as something like a full-scale physical reconstruction of Predictoria but with hidden nanobots capable of influencing crucial events. Then each of these simulations should carry roughly the same weight in M as the real Predictoria and does not carry a significant complexity penalty over it. That's because the complexity / length of the shortest program for the real Predictoria, which consists of its laws of physics (P
) and starting conditions (ICs_P
) plus a pointer to Predictoria the planet (Ptr_P
), is K(P) + K(ICs_P|P) + K(Ptr_P|...)
. The shortest program for one of the simulations consists of the same laws of physics (P
), Adversaria's starting conditions (ICs_A
), plus a pointer to the simulation within its universe (Ptr_Sim
), with length K(P) + K(ICs_A|P) + K(Ptr_Sim|...)
. Crucially, this near-equal complexity relies on the idea that the intricate setup of Adversaria (including its simulation technology and intervention capabilities) arises naturally from evolving ICs_A
forward using P
, rather than needing explicit description.
(To address a potential objection, we also need that the combined weights (algorithmic probability) of Adversaria-like civilizations is not much less than the combined weights of Predictoria-like civilizations, which requires assuming that phenomenon of advanced civilizations running such simulations is a convergent outcome. That is, it assumes that once civilization reaches Predictoria-like stage of development, it is fairly likely to subsequently become Adversaria-like in developing such simulation technology and wanting to use it in this way. There can be a complexity penalty from some civilizations choosing or forced not to go down this path, but that would be more than made up for by the sheer number of simulations each Adversaria-like civilization can produce.)
If you agree with the above, then at any given moment, simulations of Predictoria overwhelm the actual Predictoria as far as their relative weights for making predictions based on M. Predictoria should be predicting constant departures from its baseline physics, perhaps in many different directions due to different simulators, but Predictoria would be highly motivated to reason about the distribution of these vectors of change instead of assuming that they cancel each other out. One important (perhaps novel?) consideration here is that Adversaria and other simulators can stop each simulation after the point of departure/intervention has passed for a while, and reuse the computational resources on a new simulation rebased on the actual Predictoria that has observed no intervention (or rather rebased on an untouched simulation of it), so the combined weight of simulations does not decrease relative to actual Predictoria in M even as time goes on and Predictoria makes more and more observations that do not depart from baseline physics.
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don't think that's a good plan, this plan seems even worse.
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.
When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here's a poem(?) I wrote for Twitter, actually before coming across the above story:
The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.
My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?