The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
TsviBT
Please don't throw your mind away
Raemon
Noticing Frame Differences
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Duncan Sabien (Deactivated)
Basics of Rationalist Discourse
alkjash
Prune
johnswentworth
Gears vs Behavior
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
Raemon
Feedbackloop-first Rationality
LoganStrohl
Fucking Goddamn Basics of Rationalist Discourse
Raemon
Tuning your Cognitive Strategies
johnswentworth
Lessons On How To Get Things Right On The First Try
+

Optimization

So8res
Focus on the places where you feel shocked everyone's dropping the ball
Jameson Quinn
A voting theory primer for rationalists
sarahconstantin
The Pavlov Strategy
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
GeneSmith
Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
HoldenKarnofsky
This Can't Go On
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Scott Alexander
Studies On Slack
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
jenn
Things I Learned by Spending Five Thousand Hours In Non-EA Charities
Richard_Ngo
The ants and the grasshopper
So8res
Enemies vs Malefactors
Elizabeth
Change my mind: Veganism entails trade-offs, and health is one of the axes
+

World

Kaj_Sotala
Book summary: Unlocking the Emotional Brain
Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
johnswentworth
Transportation as a Constraint
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Bird Concept
Unconscious Economics
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
L Rudolf L
[Fiction] A Disneyland Without Children
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
zhukeepa
How uniform is the neocortex?
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
KatjaGrace
Elephant seal 2
Zvi
Book Review: Going Infinite
Rafael Harth
Why it's so hard to talk about Consciousness
Duncan Sabien (Deactivated)
Social Dark Matter
Elizabeth
EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem
Malmesbury
The Talk: a brief explanation of sexual dimorphism
moridinamael
The Parable of the King and the Random Process
Henrik Karlsson
Cultivating a state of mind where new ideas are born
+

Practical

+

AI Strategy

paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
Eliezer Yudkowsky
Pausing AI Developments Isn't Enough. We Need to Shut it All Down
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
KatjaGrace
Counterarguments to the basic AI x-risk case
Adam Scholl
Safetywashing
habryka
AI Timelines
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Orpheus16
Speaking to Congressional staffers about AI risk
Tom Davidson
What a compute-centric framework says about AI takeoff speeds
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
Joe Carlsmith
Predictable updating about AI risk
Raemon
"Carefully Bootstrapped Alignment" is organizationally hard
KatjaGrace
We don’t trade with ants
+

Technical AI Safety

paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Buck
AI Control: Improving Safety Despite Intentional Subversion
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
evhub
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
paulfchristiano
The strategy-stealing assumption
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
Alex Flint
The ground of optimization
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
TurnTrout
Towards a New Impact Measure
LawrenceC
Natural Abstractions: Key claims, Theorems, and Critiques
Zack_M_Davis
Alignment Implications of LLM Successes: a Debate in One Act
johnswentworth
Natural Latents: The Math
TurnTrout
Steering GPT-2-XL by adding an activation vector
Jessica Rumbelow
SolidGoldMagikarp (plus, prompt generation)
So8res
Deep Deceptiveness
Charbel-Raphaël
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Charbel-Raphaël
Against Almost Every Theory of Impact of Interpretability
Joe Carlsmith
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Eliezer Yudkowsky
GPTs are Predictors, not Imitators
peterbarnett
Labs should be explicit about why they are building AGI
HoldenKarnofsky
Discussion with Nate Soares on a key alignment difficulty
Jesse Hoogland
Neural networks generalize because of this one weird trick
paulfchristiano
My views on “doom”
technicalities
Shallow review of live agendas in alignment & safety
Vanessa Kosoy
The Learning-Theoretic Agenda: Status 2023
ryan_greenblatt
Improving the Welfare of AIs: A Nearcasted Proposal
#1

As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.

31Buck Shlegeris
I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to. This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.  My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried. I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret. I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.
29johnswentworth
I think control research has relatively little impact on X-risk in general, and wrote up the case against here. Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.
#2

If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots. Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.

Then do it better.

#3

The author argues that it may be possible to significantly enhance adult intelligence through gene editing. They discuss potential delivery methods, editing techniques, and challenges. While acknowledging uncertainties, they believe this could have a major impact on human capabilities and potentially help with AI alignment. They propose starting with cell culture experiments and animal studies.

12Zac Hatfield-Dodds
I remain both skeptical some core claims in this post, and convinced of its importance. GeneSmith is one of few people with such a big-picture, fresh, wildly ambitious angle on beneficial biotechnology, and I'd love to see more of this genre. One one hand on the object level, I basically don't buy the argument that in-vivo editing could lead to substantial cognitive enhancement in adults. Brain development is incredibly important for adult cognition, and in the maybe 1%--20% residual you're going well off-distribution for any predictors trained on unedited individuals. I too would prefer bets that pay off before my median AI timelines, but biology does not always allow us to have nice things. On the other, gene therapy does indeed work in adults for some (much simpler) issues, and there might be valuable interventions which are narrower but still valuable. Plus, of course, there's the nineteen-ish year pathway to adults, building on current practice. There's no shortage of practical difficulties, but the strong or general objections I've seen seem ill-founded, and that makes me more optimistic about eventual feasibility of something drawing on this tech tree. I've been paying closer attention to the space thanks to Gene's posts, to the point of making some related investments, and look forward to watching how these ideas fare on contact with biological and engineering reality over the next few years.
#4

There are many things that people are socially punished for revealing, so they hide them, which means we systematically underestimate how common they are. And we tend to assume the most extreme versions of those things are representative, when in reality most cases are much less extreme. 

#6

An open letter called for “all AI labs to immediately pause for at least 6 months the training of AI more powerful than GPT-4.” This 6-month moratorium would be better than no moratorium. I have respect for everyone who stepped up and signed it. 

I refrained from signing because I think the letter is understating the seriousness of the situation and asking for too little to solve it.

#7

Your mind wants to play. Stopping your mind from playing is throwing your mind away. Please do not throw your mind away. Please do not tell other people to throw their mind away. There's a conflict between this and coordinating around reducing existential risk. How do we deal with this conflict?

6Jeremy Gillen
Tsvi has many underrated posts. This one was rated correctly. I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?).  Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post:  For me, engaging in intellectual play is an antidote to political mindkilledness. It's not perfect. It doesn't work for very long. But it does help. When I switch from intellectual play to a politically charged topic, there's a brief period where I'm just.. better at thinking about it. Perhaps it increases open-mindedness. But that's not it. It's more like increased ability to run down object-level thoughts without higher-level interference. A very valuable state of mind. But this isn't why I play. I play because it's fun. And because it's natural? It's in our nature. It's easy to throw this away under pressure, and I've sometimes done so. This post is a good reminder of why I shouldn't.
#9

Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil discuss their differing AI forecasts. Key topics include the importance of transfer learning, AI's potential to accelerate R&D, and the expected trajectory of AI capabilities. They explore concrete scenarios and how observations might update their views.

23Ryan Greenblatt
My sense is that this post holds up pretty well. Most of the considerations under discussion still appear live and important including: in-context learning, robustness, whether jank AI R&D accelerating AIs can quickly move to more general and broader systems, and general skepticism of crazy conclusions. At the time of this dialogue, my timelines were a bit faster than Ajeya's. I've updated toward the views Daniel expresses here and I'm now about half way between Ajeya's views in this post and Daniel's (in geometric mean). My read is that Daniel looks somewhat too aggressive in his predictions for 2024, though it is a bit unclear exactly what he was expecting. (This concrete scenario seems substantially more bullish than what we've seen in 2024, but not by a huge amount. It's unclear if he was intending these to be mainline predictions or a 25th percentile bullish scenario.) AI progress appears substantially faster than the scenario outlined in Ege's median world. In particular: * On "we have individual AI labs in 10 years that might be doing on the order of e.g. $30B/yr in revenue". OpenAI made $4 billion in revenue in 2024 and based on historical trends it looks like AI company revenue goes up 3x per year such that in 2026 the naive trend extrapolation indicates they'd make around $30 billion. So, this seems 3 years out instead of 10. * On "maybe AI systems can get gold on the IMO in five years". We seem likely to see gold on IMO this year (a bit less than 2 years later). It would be interesting to hear how Daniel, Ajeya, and Ege's views have changed since the time this was posted. (I think Daniel has somewhat later timelines (but the update is smaller than the progression of time such that AGI now seems closer to Daniel) and I think Ajeya has somewhat sooner timelines.) Daniel discusses various ideas for how to do a better version of this dialogue in this comment. My understanding is that Daniel (and others) have run something similar to what he describes
#10

Ten short guidelines for clear thinking and collaborative truth-seeking, followed by extensive discussion of what exactly they mean and why Duncan thinks they're an important default guideline.

4Elizabeth
I wish this had been called "Duncan's Guidelines for Discourse" or something like that. I like most of the guidelines given, but they're not consensus. And while I support Duncan's right to block people from his posts (and agree with him far on discourse norms far more than with the people he blocked), it means that people who disagree with him on the rules can't make their case in the comments. That feels like an unbalanced playing field to me. 
8Screwtape
I think this, or something like this, should be in a place of prominence on LessWrong. The Best Of collection might not be the place, but it's the place I can vote on, so I'd like to vote for it here. I used "or something like this" above intentionally. The format of this post — an introduction of why these guidelines exist, short one or two sentence explanations of the guideline, and then expanded explanations with "ways you might feel when you're about to break the X Guideline" — is excellent. It turns each guideline into a mini-lesson, which can be broken out and referenced independently. The introduction gives context for them all to hang together. The format is A+, fighting for S tier. Why "something like this" instead of "this, exactly this" then? Each individual guideline is good, but they don't feel like they're the only set. I can imagine swapping basically any of them other than 0 and 1 out for something different and having something I liked just as much. I still look at 5 ("Aim for convergence on truth, and behave as if your interlocutors are also aiming for convergence on truth") and internally wince. I imagine lots of people read it, mostly agreed with it, but wanted to replace or quibble with one or two of the guidelines, and from reading the comments there wasn't a consensus on which line was out of place.  That seems like a good sign.  It's interesting to me to contrast it with Elements Of Rationalist Discourse. Elements doesn't resonate as much with me, and while some of that is Elements is not laid out as cleanly I also don't agree with the list the same way. And yet, Elements was also upvoted highly. The people yearn for guidelines, and there wasn't a clear favourite. Someday I might try my own hand at the genre, and I still consider myself to owe an expansion on my issues with 5. I'm voting for this to be in the Best Of LessWrong collection. If there was a process to vote to make this or at least the introduction and Guidelines, In Brief in
#11

Jenn spent 5000 hours working at non-EA charities, and learned a number of things that may not be obvious to effective altruists, when working with more mature organizations in more mature ecosystems.

#12

Tom Davidson analyzes AI takeoff speeds – how quickly AI capabilities might improve as they approach human-level AI. He puts ~25% probability on takeoff lasting less than 1 year, and ~50% on it lasting less than 3 years. But he also argues we should assign some probability to takeoff lasting more than 5 years.

9Daniel Kokotajlo
The takeoffspeeds.com model Davidson et al worked on is still (unfortunately) the world's best model of AGI takeoff. I highly encourage people to play around with it, perhaps even to read the research behind it, and I'm glad LessWrong is a place that collects and rewards work like this.
#13

Polygenic screening can increase your child's IQ by 2-8 points, decrease disease risk by up to 60%, and increase height by over 2 inches. Here's a detailed guide on how to maximize the benefits and minimize the costs of embryo selection. 

13Max H
My wife completed two cycles of IVF this year, and we had the sequence data from the preimplantation genetic testing on the resulting embryos analyzed for polygenic factors by the unnamed startup mentioned in this post. I can personally confirm that the practical advice in this post is generally excellent. The basic IVF + testing process is pretty straightforward (if expensive), but navigating the medical bureaucracy can be a hassle once you want to do anything unusual (like using a non-default PGT provider), and many clinics aren't going to help you with anything to do with polygenic screening, even if they are open to it in principle. So knowing exactly what you want and what you need to ask for is key. Since this post was written, there have been lots of other developments and related posts in this general area: * Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible * Superbabies: Putting The Pieces Together * Gameto Announces World’s First Live Birth Using Fertilo Procedure that Matures Eggs Outside the Body * Overview of strong human intelligence amplification methods: Genomic approaches And probably many others I am forgetting. But if you're a prospective parent looking for practical advice on how to navigate the IVF process and take advantage of the latest in genetic screening technology, this post is still the best place to start that I know of. Some of the things in the list above are more speculative, but the technology for selection is basically ready and practical now, and the effect size doesn't have to be very large for it to beat the status quo of having an embryologist eyeball it. I think this post is a slam dunk for a +9 and a spot in the LW canon, both for its object-level information and its exemplary embodiment of the virtue of empiricism and instrumental rationality. The rest of this review details my own experience with IVF in the U.S. in 2024. ---------------------------------------- This section of the orig
#14

Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.

7Vanessa Kosoy
This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism. To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity). Some thoughts about natural abstractions inspired by this post: * The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view). * A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs Ψ and Φ, we say that Ψ⪯Φ when some conditioning of Ψ on a finite set of observations produces a refinement of some conditioning of Φ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form Ψ0≺Ψ1≺Ψ2≺…  Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences {Ψi} and {Φi}. The agreement theorem can then state that for all i∈N, there exists j
#15

Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.

24Zack M. Davis
(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023: I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.) This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4 era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct. Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)
4Seth Herd
This post skillfully addressed IMO the most urgent issue in alignment:; bridging the gap between doomers and optimists. If half of alignment thinkers think alignment is very difficult, while half think it's pretty achievable, decision-makers will be prone to just choose whichever expert opinion supports what they want to do anyway. This and its following acts are the best work I know of in refining the key cruxes. And they do so in a compact, readable, and even fun form.
#16

Evan et al argue for developing "model organisms of misalignment" - AI systems deliberately designed to exhibit concerning behaviors like deception or reward hacking. This would provide concrete examples to study potential AI safety issues and test mitigation strategies. The authors believe this research is timely and could help build scientific consensus around AI risks to inform policy discussions.

8Fabien Roger
This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).
#17

John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.

17Jeremy Gillen
This post deserves to be remembered as a LessWrong classic.  1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen). 2. It uses a new diagrammatic method of manipulating sets of independence relations. 3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged. There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other.  * Ontology identification involves taking a goal defined in an old ontology[1] and accurately translating it into a new ontology. * High-level models and low-level models need to interact in a bounded agent. I.e. learning a high-level fact should influence your knowledge about low-level facts and vice versa. * Value identification is the problem of translating values from a human to an AI. This is much like ontology identification, with the added difficulty that we don't get as much detailed access or control over the human world model. * Interpretability is about finding recognisable concepts and algorithms in trained neural networks. In general, we can solve these problems using shared variables and shared sub-structures that are present in both models. * We can stitch together very different world models along shared variables. E.g. if you have two models of molecular dynamics, one faster and simpler than the other. You want to simulate in the fast one, then switch to the slow one when particular interactions happen. To transfer the state from one to the other you identify variables present in both models (probably atom locations, velocities, some others), then just copy these values to the other model. Under-specified variables must be inferred from priors. * If you want to transfer a new concept from WM1 to a less knowledgeable WM2, you can do so by identifying the lower-level concepts that both WMs share, then constructing an "expla
#18

Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior. 

#19

Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.

#20

One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun to ask for food, having spent the summer singing and dancing.

Then, various things happen.

#21

Rationality training has been very difficult to develop, in large part because the feedback loops are so long, and noisy. Raemon proposes a paradigm where "invent better feedback loops" is the primary focus, in tandem with an emphasis on deliberate practice.

6Screwtape
The thing I want most from LessWrong and the Rationality Community writ large is the martial art of rationality. That was the Sequences post that hooked me, that is the thing I personally want to find if it exists. Therefore, posts that are actually trying to build a real art of rationality (or warn of failed approaches) are the kind of thing I'm going to pay attention to, and if they look like they actually might work I'm going to strongly vote for including them in the Best Of LessWrong collection. Feedbackloop-first Rationality sure looks like an actual attempt at solving the problem. It lays out a strategy, the plan seems like it plausibly might work, and there's followup workshops that suggest some people are actually willing to spend money on this; that's not a clear indicator that it works (people spend money on all kinds of things) but it is significantly more than armchair theorizing.  If Raemon keeps working on this and is successful, I expect we'll see some testable results. If, say, the graduates or regular practitioners turn out to be able to confidently one-shot Thinking Physics style problems while demographically matched people stumble around, that'll be a Hot Dang Look At That Chart result at least in the toy problems. If they go on to solve novel, real world problems, then that's a clear suggestion this works. There's two branches of followup I'd like to see. One, Raemon's already been doing; running more workshops teaching this, teasing out useful subskills to teach, and writing up how how to run exercises and what the subskills are. The second is evaluations. If Raemon's keeping track of students and people who considered going but didn't, I'd love to see a report on how both sets are doing in a year or two. I'm also tempted to ask on future community censuses whether people have done Feedbackloop-first Rationality workshops (["Yes under Raemon", "Yes by other people based on this", "no"] and then throw a timed Thinking Physics-style problem a
#22

There are some obvious ways you might try to train deceptiveness out of AIs. But deceptiveness can emerge from the recombination of non-deceptive cognitive patterns. As AI systems become more capable, they may find novel ways to be deceptive that weren't anticipated or trained against. The problem is that, in the underlying territory, "deceive the humans" is just very useful for accomplishing goals.

6Daniel Murfet
I like the emphasis in this post on the role of patterns in the world in shaping behaviour, the fact that some of those patterns incentivise misaligned behaviour such as deception, and further that our best efforts at alignment and control are themselves patterns that could have this effect. I also like the idea that our control systems (even if obscured from the agent) can present as "errors" with respect to which the agent is therefore motivated to learn to "error correct". This post and the sharp left turn are among the most important high-level takes on the alignment problem for shaping my own views on where the deep roots of the problem are. Although to be honest I had forgotten about this post, and therefore underestimated its influence on me, until performing this review (which caused me to update a recent article I wrote, the Queen's Dilemma, which is clearly a kind of retelling of one aspect of this story, with an appropriate reference). I assess it to be a substantial influence on me even so. I think this whole line of thought could be substantially developed, and with less reliance on stories, and that this would be useful.
#23

This post argues that vegan advocates within the EA community have often prioritized promoting veganism over truthseeking – ignoring inconvenient facts, and citing poor sources. The author calls for EA vegan advocates to be more epistemically rigorous and cooperative, and for the broader EA community to defend its norms of truthseeking.

12Unnamed
I'm voting against including this in the Review, at max level, because I think it too-often mischaracterizes the views of the people it quotes. And it seems real bad for a post that is mainly about describing other people's views and the drawing big conclusions from that data to inaccurately describe those views and then draw conclusions from inaccurate data. I'd be interested in hearing about this from people who favor putting this post in the review. Did you check on the sources for some of Elizabeth's claims and think that she described them well? Did you see some inaccuracies but figure that the post is still good enough? Did you trust Elizabeth's descriptions without checking yourself on what the person said? I spent a fair amount of time spot checking Elizabeth's first section, on Martin Soto, which got my attention because it seemed like it could be one of her strongest and it was the first. This claim from Elizabeth in that section seems clearly false: "The charitable explanation here is that my post focuses on naive veganism, and Soto thinks that’s a made-up problem". The first few paragraphs quoted in this post are sufficient to falsify this interpretation, and the first comment that Martin left on Elizabeth's post is too. Other parts of the description of Martin's views which are more central to Elizabeth's argument also seem off, though sorting them out requires getting more in the weeds. e.g. AFAICT he didn't say he opposed talking about the whole topic of vegan nutrition; he did say something along the lines of 'you didn't say anything false, but I don't like the way you presented things because it'll have bad consequences', but that's a pretty normal type of opinion - Elizabeth said something like that about Will MacAskill in another post in this series. Other places where this post felt off include Elizabeth's description of what people were trying to claim when they brought up the Adventist study, and the claim that this comment by Wilkox involve
5Elizabeth
My ultimate goal with this post was to use vegan advocacy as an especially legible example of a deepseated problem in effective altruism, which we could use to understand and eventually remove the problem at the root. As far as I know, the only person who has tried to use it as an example is me, and that work didn't have much visible effect either. I haven't seen anyone else reference this post while discussing a different problem. It's possible this happens out of sight (Lincoln Quirk implies this here), but if I'd achieved my goal it would be clearly visible. 
#24

Charbel-Raphaël summarizes Davidad's plan: Use near AGIs to build a detailed world simulation, then train and formally verify an AI that follows coarse preferences and avoids catastrophic outcomes. 

20Ryan Greenblatt
At the time when I first heard this agenda proposed, I was skeptical. I remain skeptical, especially about the technical work that has been done thus far on the agenda[1]. I think this post does a reasonable job of laying out the agenda and the key difficulties. However, when talking to Davidad in person, I've found that he often has more specific tricks and proposals than what was laid out in this post. I didn't find these tricks moved me very far, but I think they were helpful for understanding what is going on. This post and Davidad's agenda overall would benefit from having concrete examples of how the approach might work in various cases, or more discussion of what would be out of scope (and why this could be acceptable). For instance, how would you make a superhumanly efficient (ASI-designed) factory that produces robots while proving safety? How would you allow for AIs piloting household robots to do chores (or is this out of scope)? How would you allow for the AIs to produce software that people run on their computers or to design physical objects that get manufactured? Given that this proposal doesn't allow for safely automating safety research, my understanding is that it is supposed to be a stable end state. Correspondingly, it is important to know what Davidad thinks can and can't be done with this approach. My core disagreements are on the "Scientific Sufficiency Hypothesis" (particularly when considering computational constraints), "Model-Checking Feasibility Hypothesis" (and more generally on proving the relevant properties), and on the political feasibility of paying the needed tax even if the other components work out. It seems very implausible to me that making a sufficiently good simulation is as easy as building the Large Hadron Collider. I think the objection in this comment holds up (my understanding is Davidad would require that we formally verify everything on the computer).[2] As a concrete example, I found it quite implausible that you
3Charbel-Raphael Segerie
Ok, time to review this post and assess the overall status of the project. Review of the post What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand it, and thought, "There's no way this is going to work." Then I reconsidered, thought about it more deeply, and realized there was something important here. Hopefully, this post succeeded in showing that there is indeed something worth exploring! I think such distillation and analysis are really important. I'm especially happy about the fact that we tried to elicit as much as we could from Davidad's model during our interactions, including his roadmap and some ideas of easy projects to get early empirical feedback on this proposal. Current Status of the Agenda. (I'm not the best person to write this, see this as an informal personal opinion) Overall, Davidad performed much better than expected with his new job as program director in ARIA and got funded 74M$ over 4 years. And I still think this is the only plan that could enable the creation of a very powerful AI capable of performing a true pivotal act to end the acute risk period, and I think this last part is the added value of this plan, especially in the sense that it could be done in a somewhat ethical/democratic way compared to other forms of pivotal acts. However, it's probably not going to happen in time. Are we on track? Weirdly, yes for the non-technical aspects, no for the technical ones? The post includes a roadmap with 4 stages, and we can check if we are on track. It seems to me that Davidad jumped directly to stage 3, without going through sta
#25

1. Don't say false shit omg this one's so basic what are you even doing. And to be perfectly fucking clear "false shit" includes exaggeration for dramatic effect. Exaggeration is just another way for shit to be false.

2. You do NOT (necessarily) know what you fucking saw. What you saw and what you thought about it are two different things. Keep them the fuck straight.

...

5Duncan Sabien (Deactivated)
As a rough heuristic: "Everything is fuzzy; every bell curve has tails that matter." It's important to be precise, and it's important to be nuanced, and it's important to keep the other elements in view even though the universe is overwhelmingly made of just hydrogen and helium. But sometimes, it's also important to simply point straight at the true thing.  "Men are larger than women" is a true thing, even though many, many individual women are larger than many, many individual men, and even though the categories "men" and "women" and "larger" are themselves ill-defined and have lots and lots of weirdness around the edges. I wrote a post that went into lots and lots of careful detail, touching on many possible objections pre-emptively, softening and hedging and accuratizing as many of its claims as I could.  I think that post was excellent, and important. But it did not do the one thing that this post did, which was to stand up straight, raise its voice, and Just. Say. The. Thing. It was a delight to watch the two posts race for upvotes, and it was a delight, in the end, to see the bolder one win.
#26

Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk. 

37Charbel-Raphael Segerie
Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have. First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it. I still want everyone working on interpretability to read it and engage with its arguments. Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions. Updates on my views Legend: * On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis * ✅ - yes, I think I was correct (>90%) * ❓✅ - I would lean towards yes (70%-90%) * ❓ - unsure (between 30%-70%) * ❓❌ - I would lean towards no (10%-30%) * ❌ - no, I think I was basically wrong (<10%) * ⭐ important, you can skip the other sections Here's my review section by section: ⭐ The Overall Theory of Impact is Quite Poor? * "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task. * "Interpretability is Not a Good Predictor of Future Systems" →
#27

Have you seen a Berkeley Rationalist house and thought "wow the lighting here is nice and it's so comfy" and vaguely wished your house had nice lighting and was comfy in that particular way? Well, this practical / anthropological guide should help.

#28

Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

14Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
#29

We shouldn't expect to get a lot more worried about AI risk as capabilities increase, if we're thinking about it clearly now. Joe discusses why this happens anyway, and how to avoid it.

#30

Two astronauts investigate an automated planet covered in factories still churning out products, trying to understand what happened to its inhabitants.

#31

Malmesbury explains why sexual dimorphism evolved. Starting with asexual reproduction in single-celled organisms, he traces how the need to avoid genetic hitch-hiking led to sexual reproduction, then the evolution of two distinct sexes, and finally to sexual selection and exaggerated sexual traits. The process was driven by a series of evolutionary traps that were difficult to escape once entered. 

#32

The plan of "use AI to help us navigate superintelligence" is not just technically hard, but organizationally hard. If you're building AGI, your company needs a culture focused on high reliability (as opposed to, say, "move fast and break things."). Existing research on "high reliability organizations" suggests this culture requires a lot of time to develop. Raemon argues it needs to be one of the top few priorities for AI company leadership.

#33

The blogpost describes a cognitive strategy of noticing the transitions between your thoughts, rather than the thoughts themselves. By noticing and rewarding helpful transitions, you can improve your thinking process. The author claims this leads to clearer, more efficient and worthwhile thinking, without requiring conscious effort. 

#34

Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. 

#35

When advisors disagree wildly about when the rains will come, the king tries to average their predictions. His advisors explain why this is a terrible idea – he needs to either decide which model is right or plan for both possibilities.

#36

GPTs are being trained to predict text, not imitate humans. This task is actually harder than being human in many ways. You need to be smarter than the text generator to perfectly predict their output, and some text is the result of complex processes (e.g. scientific results, news) that even humans couldn't predict. 

GPTs are solving a fundamentally different and often harder problem than just "be human-like". This means we shouldn't expect them to think like humans.

#37

Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."

7Zach Stein-Perlman
[Perfunctory review to get this post to the final phase] Solid post. Still good. I think a responsible developer shouldn't unilaterally pause but I think it should talk about the crazy situation it's in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)
#38

Innovative work requires solitude, and the ability to resist social pressures. Henrik examines how Grothendieck and Bergman approached this, and lists various techniques creative people use to access and maintain this mental state.

#39

Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.

8Vanessa Kosoy
This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach. Here is how I view this question: The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality. Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3. The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On t
#40

There's a supercharged version of the bystander effect where someone claims they'll do a task, but then quietly fails to follow through. This leaves others thinking the task is being handled when it's not. To prevent that, we should try to loudly announce when we're giving up on tasks we've taken on, rather than quietly fading away. And we should appreciate it when others do the same.

#41

We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.

#42
Neural networks generalize unexpectedly well. Jesse argues this is because of singularities in the loss surface which reduce the effective number of parameters. These singularities arise from symmetries in the network. More complex singularities lead to simpler functions which generalize better. This is the core insight of singular learning theory. [50WordSummary]
8Vanessa Kosoy
This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important. One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad! It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.
#43

Paul Christiano lays out how he frames various questions of "will AI cause a really bad outcome?", and gives some probabilities.

#44

Elizabeth argues that veganism comes with trade-offs, including potential health issues, that are often downplayed or denied by vegan advocates. She calls for more honesty about these challenges from the vegan community. 

3Zac Hatfield-Dodds
I think Elizabeth is correct here, and also that vegan advocates would be considerably more effective with higher epistemic standards: The post unfortunately suffers for its length, detailed explanations, and rebuttal of many motivated misreadings - many of which can be found in the comments, so it's unclear whether this helped. It's also well-researched and cited, well organized, offers cruxes and anticipates objections - vegan advocates are fortunate to have such high-quality criticism. This could have been a shorter post, which was about rather than engaged in epistemics and advocacy around veganism, with less charitable assumptions. I'd have shared that shorter post more often, but I don't think it would be better.
#45

Predicting how a ball will roll down a ramp seems like a simple problem, but most people can't get it right on their first try. Analyzing why reveals important lessons that apply to much harder problems like AI alignment. 

#46

A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs. 

#48

We might soon be creating morally relevant AI systems with real welfare concerns. How can we help ensure good lives for AIs, especially if we don't have that many resources to allocate to it? 

8Ryan Greenblatt
My views remain similar to when I wrote this post, and the state of nearcasted interventions still looks reasonably similar to me. I have some slightly different thoughts on how we should relate to interventions around communication, but relatively prioritizing communication still seems reasonable to me. One change in my perspective is that I'm now somewhat less excited about allocating larger fractions of resources toward specifically AI welfare. (I now think 0.2% seems better than 1%.) I've updated toward thinking safety concerns will get a smaller fraction of resources than I was previously expecting (due to more pessimism and shorter timelines), and I think safety and welfare resource usage might trade off. Another change is that I'm relatively more excited about making deals with AIs as a safety intervention (as well as a welfare intervention). This would include things like paying them to reveal misalignment or promising later compensation if they don't cause issues for us (and if we're still in control). I have some forthcoming empirical work related to this, and work discussing the conceptual aspects of this is hopefully forthcoming.
#49

Zvi analyzes Michael Lewis' book "Going Infinite" about Sam Bankman-Fried and FTX. He argues the book provides clear evidence of SBF's fraudulent behavior, despite Lewis seeming not to fully realize it. Zvi sees SBF as a cautionary tale about the dangers of pursuing maximalist goals without ethical grounding.

#50

Orpheus16 shares his experience talking with ~60 congressional staffers about AI risk in May - June 2023. He found staffers were surprisingly open-minded about AI risks but often lacked knowledge. His guess is that the Overton window on AI policy is wide, more coordination is needed on specific policy proposals, and there are opportunities for more people to engage productively with policymakers on AI issues if done thoughtfully. 

6Orpheus16
I'm pleased with this dialogue and glad I did it. Outreach to policymakers is an important & complicated topic. No single post will be able to explain all the nuances, but I think this post explains a lot, and I still think it's a useful resource for people interested in engaging with policymakers. A lot has changed since this dialogue, and I've also learned a lot since then. Here are a few examples: * I think it's no longer as useful to emphasize "AI is a big deal for national/global security." This is now pretty well-established. * Instead, I would encourage people to come up with clear explanations of specific threat models (especially misalignment risks) and concrete proposals (e.g., draft legislative language, memos with specific asks for specific agencies). * I'd like to see more people write about why AI requires different solutions compared to the "standard DC playbook for dealing with potentially dangerous emerging technologies." As I understand it, the standard playbook is essentially: "If there is a new and dangerous technology, the US needs to make sure that we lead in its development and we are ahead of the curve. The main threats come from our adversaries being able to unlock such technologies faster than us, allowing them to surprise us with new threats." To me, the main reason this playbook doesn't work is because of misalignment risks. Regardless: if you think AI is special (for misalignment reasons or other reasons), I think writing up your takes RE "here's what makes AI special and why it requires a deviation from the standard playbook" is valuable. * I think people trying to communicate with US policymakers should keep in mind that the US government is primarily concerned with US interests. This is perhaps obvious when stated like this, but I think a lot of comms fail to properly take this into account. As one might expect, this is especially true when foreign organizations try to talk about things from the POV of what would we best for "
#51

Debates about consciousness often come down to two people talking past each other, without realizing their interlocutor is coming from a fundamentally different set of intuitions. What's up with that?

4Noosphere89
This is a very nice meta-level discussion of why consciousness discourse gets so bad, and I do genuinely appreciate trying to get cruxes and draw out the generators of a disagreement, which is useful in difficult situations. One factor that is not really discussed, but amplifies the problem of discourse around consciousness is that people use the word consciousness to denote a scientific and a moral thing, and people often want to know the answer to whether something is conscious because they want to use it to determine whether uploading is good, or whether to care about someone, and way too much discourse does not decouple these 2 questions. I actually slightly voted against the linked post below in the review, due to methodological problems, but I have a high prior that something like this is a huge contributor to consciousness discourse sucking, and this is an area where the science questions need to be decoupled from value questions: https://www.lesswrong.com/posts/KpD2fJa6zo8o2MBxg/consciousness-as-a-conflationary-alliance-term-for +9 for drawing out a generator on a very confusing topic, and should be in the LW canon for how to deal with difficult disagreements as a worked example. I'm not going to review the object level on what consciousness actually is, because I already did that in a different review linked below, but the sneak peek is that I'm in camp 1, though you could also call me a camp 2 person, but notably reductionist/computationalist rather than positing novel metaphysics: https://www.lesswrong.com/posts/FQhtpHFiPacG3KrvD/seth-explains-consciousness#7ncCBPLcCwpRYdXuG