How do humans form their values? Shard theory proposes that human values are formed through a relatively straightforward reinforcement process, rather than being hard-coded by evolution. This post lays out the core ideas behind shard theory and explores how it can explain various aspects of human behavior and decision-making.
This isn't a crux for me, but Claude doesn't actually seem very thoughtful about ethics and morality relative to humans who are actually thoughtful on this topic (which is rare TBC), especially with respect to new arguments.
My main hope would be that it picks reasonable humans to defer to. It seems pretty likely it would pick much better humans to defer to than most humans would pick if they had to pick someone or some group to defer to.
On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll often say that it would come about through being in too much of a rush, or careless programmers, or bad actors, etc., as opposed to the simpler...
I think the right starting point is not whether something is an LLM, or deep learning, but rather what are the inputs, outputs, loss functions, etc.? And then go from there to whether we expect slight-niceness or not.
My own opinion (stated without justification) is: you can get niceness through LLM-style “true” imitation learning (Foom & Doom §2.3.2). Alternatively, if the AI is choosing actions through RL and/or model-based search & planning, rather than through imitation learning, than I expect zero-niceness, and instead the ruthless pursuit of t...
Call for alpha testers for an AI control/security tool. A ton of alignment researchers YOLO their Claude usage right now. We run Claude on our computers without real protection (perhaps beyond auto mode) but there isn't an easy way to comply with known best practices. I wrote claude-guard, a wrapper to make best practices easy: just install and then your future claude sessions are protected.
Smart misaligned AI will target alignment researchers in particular for research sabotage, for example by:
claude-gI feel very confused and uncertain so keep your expectations low for the quality of this comment.
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
| Median Estimate for when 99% of currently fully remote jobs will be automatable | |
|---|---|
| Daniel | 4 years |
| Ajeya | 13 years |
| Ege | 40 years |
You can see the strength of their disagreements in the graphs below, where they give very different probability distributions over two questions relating to AGI development (note that these graphs are very rough and are only intended to capture high-level differences, and especially aren't very...
Yep! Thanks.
Note that the scenario I gave wasn't actually a prediction, or at least, it wasn't my median world. I said elsewhere in thread that my median was 2027 for AGI, and implied that my median for ASI was more like 27/28:
...To be clear, my view is that we'll achieve AGI around 2027, ASI within a year of that, and then some sort of crazy robot-powered self-replicating economy within, say, three years of that. So 1000x energy consumption around then or shortly thereafter (depends on the doubling time of the crazy superintelligence-designed-and-managed rob