Thanks to Noa Nabeshima for helpful discussion and comments.
Certain types of myopic agents represent a possible way to construct safe AGI. We call agents with a time discount rate of zero time-limited myopic, a particular instance of the broader class of myopic agents. A prototypical example is a time-limited myopic imitative agent. In theory, such an agent has some desirable safety properties because a human would only take safe actions (although any imperfect imitation would be unsafe). Since the agent is time-limited myopic, it will never imitate poorly now to make it easier to imitate easier later. For example, it would never give a human a simple plan so it could more easily imitate the human executing the plan.
We might run into issues if the agent intends...
My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner's dilemmas, counterfactual muggings, and Parfit's hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.
Taking stock of the DDT ...
The striking contrast between Jan Leike, Jan 22, 2026:
...Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been qui
I think self-exfiltration via manipulation seems pretty hard. I think we're likely to have transformatively useful systems that can't do that, for some amount of time. (Especially since there's no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
...That's a disanalogy with well-functioning scientific fields: scientists don't deliberately pick bad research directions and try hard t
If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole, for example because they're worried about alignment of ASI, or worried about correctly solving other philosophical problems that would arise during the transition. (But note that if the near-human-level AIs are not aligned, then this effort could backfire by letting them apply better strategy to take over more easily.)
The...
My guess would be that an AI could self-improve by "a lot", in a way that's reasonably safely-for-it. Cf. https://www.lesswrong.com/posts/dho4JQytfHWXtTvkt/on-the-adolescence-of-technology?commentId=t2hKmhsS6yLyJFQwh
There are some obvious ways you might try to train deceptiveness out of AIs. But deceptiveness can emerge from the recombination of non-deceptive cognitive patterns. As AI systems become more capable, they may find novel ways to be deceptive that weren't anticipated or trained against. The problem is that, in the underlying territory, "deceive the humans" is just very useful for accomplishing goals.
Here's some relevant discussion of "Behavioral schemers that weren’t training-time schemers":
... (read more)