A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"
(Last revised: July 2024. See changelog at the bottom.)
Part of the “Intro to brain-like-AGI safety” post series.
In the previous post, I discussed the “short-term predictor”—a circuit which, thanks to a learning algorithm, emits an output that predicts a ground-truth supervisory signal arriving a short time (e.g. a fraction of a second) later.
In this post, I propose that we can take a short-term predictor, wrap it up into a closed loop involving a bit more circuitry, and we wind up with a new module that I call a “long-term predictor”. Just like it sounds, this circuit can make longer-term predictions, e.g. “I’m likely to eat in the next 10 minutes”. This circuit is closely related to Temporal Difference (TD) learning, as we’ll...
Yes! 🎉 [And lmk, here or by DM, if you think of any rewording / diagrams / whatever that would have helped you get that with less effort :) ]
The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom. I think that my post covers important points not made by the published version of that post.
I'm also thankful for the dozens of interesting conversations and comments at the retreat.
I think that the AI alignment field is partially founded on fundamentally confused ideas. I’m worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.[1]
The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI...
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as "fundamentally confused ideas" which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from "Superintelligence", in which Bostrom talks about the problem of wireheading. Turner declares this to be "nonsense" since, according to Turner, RL systems don't seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reas...
This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post and ideas.
Recent work from Anthropic also showed inoculation prompting is effective in reducing misalignment generalization of reward hacking during production RL[7]; those results did not investigate test-time performance impacts or learned reward hacking
Maybe I'm missing something, but isn't this sort of figure 28 from the paper?
A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you’ve heard this one before.) The bartender, who is also a Solomonoff inductor, asks “What’ll it be?”. The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says “One of those, please.”. The bartender points to a drawing of the same drink on a menu, and says “One of those?”. The customer replies “Yes, one of those.”. The bartender then delivers a drink, and it matches what the first inductor expected.
What’s up with that?
The puzzle, here, is that the two Solomonoff inductors seemingly agree on a categorization - i.e. which things count as the Unnamed Kind Of Drink, and which things don’t, with at least enough...
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization[1], I'm still pretty uncertain how much I should update based on it, or what its full implications are.)
Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct metaethical position seems to be one of confusion and/or uncertainty. In other words, a good...
This assumes there is a fact of the matter about whether moral realism is true
I am a well known moral realism / moral antirealism antirealist.
But the sort of moral realism that's compatible with antirealism is, as far as I see it, a sort of definitional reification of whatever we think of as morality. You can get information from outside yourself about morality, but it's "boring" stuff like good ethical arguments or transformative life experiences, the same sort of stuff a moral antirealist might be moved by. For the distinction to majorly matter to an AI'...
I’m not 100% convinced of this, but I’m fairly convinced, more and more so over time. I’m hoping to start a vigorous but civilized debate. I invite you to attack my weak points and/or present counter-evidence.
My thesis is that intent-alignment is basically happening, based on evidence from the alignment research in the LLM era.
The classic story about loss of control from AI is that optimization pressure on proxies will cause the AI to value things that humans don’t. (Relatedly, the AI might become a mesa-optimizer with an arbitrary goal).
But the reality that I observe is that the AIs are really nice and somewhat naive. They’re like the world’s smartest 12-year-old (h/t Jenn). We put more and more RL optimization pressure, and keep getting smarter and smarter models; but they...
The "feeling bad about reward hacking" is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.