As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.
If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots. Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.
Then do it better.
The author argues that it may be possible to significantly enhance adult intelligence through gene editing. They discuss potential delivery methods, editing techniques, and challenges. While acknowledging uncertainties, they believe this could have a major impact on human capabilities and potentially help with AI alignment. They propose starting with cell culture experiments and animal studies.
There are many things that people are socially punished for revealing, so they hide them, which means we systematically underestimate how common they are. And we tend to assume the most extreme versions of those things are representative, when in reality most cases are much less extreme.
An open letter called for “all AI labs to immediately pause for at least 6 months the training of AI more powerful than GPT-4.” This 6-month moratorium would be better than no moratorium. I have respect for everyone who stepped up and signed it.
I refrained from signing because I think the letter is understating the seriousness of the situation and asking for too little to solve it.
Your mind wants to play. Stopping your mind from playing is throwing your mind away. Please do not throw your mind away. Please do not tell other people to throw their mind away. There's a conflict between this and coordinating around reducing existential risk. How do we deal with this conflict?
Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil discuss their differing AI forecasts. Key topics include the importance of transfer learning, AI's potential to accelerate R&D, and the expected trajectory of AI capabilities. They explore concrete scenarios and how observations might update their views.
Ten short guidelines for clear thinking and collaborative truth-seeking, followed by extensive discussion of what exactly they mean and why Duncan thinks they're an important default guideline.
Jenn spent 5000 hours working at non-EA charities, and learned a number of things that may not be obvious to effective altruists, when working with more mature organizations in more mature ecosystems.
Tom Davidson analyzes AI takeoff speeds – how quickly AI capabilities might improve as they approach human-level AI. He puts ~25% probability on takeoff lasting less than 1 year, and ~50% on it lasting less than 3 years. But he also argues we should assign some probability to takeoff lasting more than 5 years.
Polygenic screening can increase your child's IQ by 2-8 points, decrease disease risk by up to 60%, and increase height by over 2 inches. Here's a detailed guide on how to maximize the benefits and minimize the costs of embryo selection.
Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.
Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.
Evan et al argue for developing "model organisms of misalignment" - AI systems deliberately designed to exhibit concerning behaviors like deception or reward hacking. This would provide concrete examples to study potential AI safety issues and test mitigation strategies. The authors believe this research is timely and could help build scientific consensus around AI risks to inform policy discussions.
John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.
Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior.
Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.
One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun to ask for food, having spent the summer singing and dancing.
Then, various things happen.
Rationality training has been very difficult to develop, in large part because the feedback loops are so long, and noisy. Raemon proposes a paradigm where "invent better feedback loops" is the primary focus, in tandem with an emphasis on deliberate practice.
There are some obvious ways you might try to train deceptiveness out of AIs. But deceptiveness can emerge from the recombination of non-deceptive cognitive patterns. As AI systems become more capable, they may find novel ways to be deceptive that weren't anticipated or trained against. The problem is that, in the underlying territory, "deceive the humans" is just very useful for accomplishing goals.
This post argues that vegan advocates within the EA community have often prioritized promoting veganism over truthseeking – ignoring inconvenient facts, and citing poor sources. The author calls for EA vegan advocates to be more epistemically rigorous and cooperative, and for the broader EA community to defend its norms of truthseeking.
Charbel-Raphaël summarizes Davidad's plan: Use near AGIs to build a detailed world simulation, then train and formally verify an AI that follows coarse preferences and avoids catastrophic outcomes.
1. Don't say false shit omg this one's so basic what are you even doing. And to be perfectly fucking clear "false shit" includes exaggeration for dramatic effect. Exaggeration is just another way for shit to be false.
2. You do NOT (necessarily) know what you fucking saw. What you saw and what you thought about it are two different things. Keep them the fuck straight.
...
Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk.
Have you seen a Berkeley Rationalist house and thought "wow the lighting here is nice and it's so comfy" and vaguely wished your house had nice lighting and was comfy in that particular way? Well, this practical / anthropological guide should help.
Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.
We shouldn't expect to get a lot more worried about AI risk as capabilities increase, if we're thinking about it clearly now. Joe discusses why this happens anyway, and how to avoid it.
Two astronauts investigate an automated planet covered in factories still churning out products, trying to understand what happened to its inhabitants.
Malmesbury explains why sexual dimorphism evolved. Starting with asexual reproduction in single-celled organisms, he traces how the need to avoid genetic hitch-hiking led to sexual reproduction, then the evolution of two distinct sexes, and finally to sexual selection and exaggerated sexual traits. The process was driven by a series of evolutionary traps that were difficult to escape once entered.
The plan of "use AI to help us navigate superintelligence" is not just technically hard, but organizationally hard. If you're building AGI, your company needs a culture focused on high reliability (as opposed to, say, "move fast and break things."). Existing research on "high reliability organizations" suggests this culture requires a lot of time to develop. Raemon argues it needs to be one of the top few priorities for AI company leadership.
The blogpost describes a cognitive strategy of noticing the transitions between your thoughts, rather than the thoughts themselves. By noticing and rewarding helpful transitions, you can improve your thinking process. The author claims this leads to clearer, more efficient and worthwhile thinking, without requiring conscious effort.
Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway.
When advisors disagree wildly about when the rains will come, the king tries to average their predictions. His advisors explain why this is a terrible idea – he needs to either decide which model is right or plan for both possibilities.
GPTs are being trained to predict text, not imitate humans. This task is actually harder than being human in many ways. You need to be smarter than the text generator to perfectly predict their output, and some text is the result of complex processes (e.g. scientific results, news) that even humans couldn't predict.
GPTs are solving a fundamentally different and often harder problem than just "be human-like". This means we shouldn't expect them to think like humans.
Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."
Innovative work requires solitude, and the ability to resist social pressures. Henrik examines how Grothendieck and Bergman approached this, and lists various techniques creative people use to access and maintain this mental state.
Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.
There's a supercharged version of the bystander effect where someone claims they'll do a task, but then quietly fails to follow through. This leaves others thinking the task is being handled when it's not. To prevent that, we should try to loudly announce when we're giving up on tasks we've taken on, rather than quietly fading away. And we should appreciate it when others do the same.
We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.
Paul Christiano lays out how he frames various questions of "will AI cause a really bad outcome?", and gives some probabilities.
Elizabeth argues that veganism comes with trade-offs, including potential health issues, that are often downplayed or denied by vegan advocates. She calls for more honesty about these challenges from the vegan community.
Predicting how a ball will roll down a ramp seems like a simple problem, but most people can't get it right on their first try. Analyzing why reveals important lessons that apply to much harder problems like AI alignment.
A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs.
We might soon be creating morally relevant AI systems with real welfare concerns. How can we help ensure good lives for AIs, especially if we don't have that many resources to allocate to it?
Zvi analyzes Michael Lewis' book "Going Infinite" about Sam Bankman-Fried and FTX. He argues the book provides clear evidence of SBF's fraudulent behavior, despite Lewis seeming not to fully realize it. Zvi sees SBF as a cautionary tale about the dangers of pursuing maximalist goals without ethical grounding.
Orpheus16 shares his experience talking with ~60 congressional staffers about AI risk in May - June 2023. He found staffers were surprisingly open-minded about AI risks but often lacked knowledge. His guess is that the Overton window on AI policy is wide, more coordination is needed on specific policy proposals, and there are opportunities for more people to engage productively with policymakers on AI issues if done thoughtfully.
Debates about consciousness often come down to two people talking past each other, without realizing their interlocutor is coming from a fundamentally different set of intuitions. What's up with that?