Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there's relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related content, with some IL trajectories involving humans navigating towards diamonds because the humans want the diamonds.
I don't know much about ML, and I'm a bit confused about this step. How worried are we/should we be about sample efficiency here? It sounds like after pre-training you're growing the diamond shard via a real-world embedded RL agent? Naively this would be pretty performance uncompetitive compared to agents primarily trained in simulated worlds, unless your algorithm is unusually sample efficient (why?). If you aren't performance competitive, then I expect your agent to be outcompeted by stronger AI systems with trainers that are less careful about diamond (or rubies, or staples, or w/e) alignment.
OTOH if your training is primarily simulated, I'd be worried about the difficulty of creating an agent that terminally values real world (rather than simulated) diamonds.
Minor, but Dunning-Kruger neither claims to detect a Mount Stupid effect nor (probably) is the study powered enough to detect it.
One thing to flag is that even if for any given model, the probability of capabilities generalizing is very low, total doom can still be high, since there might be many tries at getting models that generalize well across distributional shifts, whereas the selection pressures to getting alignment robustness is comparably weaker. You can imagine a 2x2 quadrant of capabilities vs alignment generalizability across distributional shift:
Capabilities doesn't generalize, alignment doesn't: irrelevant
Capabilities doesn't generalize, alignment does: irrelevant
Capabilities generalizes, alignment doesn't: potentially very dangerous, especially if power-seeking. Agent (or agent and friends) acquires more power and may attempt a takeover.
Capabilities generalizes, alignment does: Good, but not clearly great. By default I won't expect it to be powerseeking (unless you're deliberately creating a sovereign), so it only has as much power as humans allow it to have. So the AI might risk being outcompeted by their more nefarious peers.