I write software at survivalandflourishing.com. Previously MATS, Google, Khan Academy.
Subcortical reinforcement circuits, though, hail from a distinct informational world... and so have to reinforce computations "blindly," relying only on simple sensory proxies.
This seems to be pointing in an interesting direction that I'd like to see expanded.
Because your subcortical reward circuitry was hardwired by your genome, it's going to be quite bad at accurately assigning credit to shards.
I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
if shard theory is true, meaningful partial alignment successes are possible
"if shard theory is true" -- is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot
What's to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
Two points:
How would you distinguish between weak and strong methods?