AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

Kaj_Sotala6d54
AI companies have started saying safeguards are load-bearing
I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.
Steven Byrnes11d1218
Four ways learning Econ makes people dumber re: future AI
I might have overdone it on the sass, sorry. This is much sassier than my default (“scrupulously nuanced and unobjectionable and boring”)… * …partly because I’m usually writing for lesswrong and cross-posting on X/Twitter, whereas this one was vice-versa, and X is a medium that seems to call for more sass; * …partly in an amateur ham-fisted attempt to do clickbait (note also: listicle format!) because this is a message that I really want to put out there; * …and yes, partly because I do sometimes feel really frustrated talking to economists (#NotAllEconomists), and I think they can and should do better, and the sass is reflecting a real feeling that I feel. But I think next time I would dial it back slightly, e.g. by replacing “DUMBER” with “WORSE” in the first sentence. I’m open to feedback, I don’t know what I’m doing. ¯\_(ツ)_/¯ > I don't think they are more "incorrect" than, say, the AI is Normal Technology folks. Yeah, I agree that lots of CS professors are deeply mistaken about the consequences of AGI, and ditto with the neuroscientists, and ditto with many other fields, including even many of the people trying to build AGI right now. I don’t think that economists are more blameworthy than other groups, it just so happens that this one particular post is aimed at them. > I think the crux more or less comes down to skepticism about the plausibility of superintelligence in the next decade or so. I think you’re being overly generous. “Decade or so” is not the crux. In climate change, people routinely talk about bad things that might happen in 2050, and even in 2100, or farther! People also routinely talk 30 years out or more in the context of science, government, infrastructure, institution-building, life-planning, etc. People talk about their grandkids and great-grandkids growing up, etc. If someone expected superintelligence in the next 50 years but not the next 20—like if they really expected that, viscerally, with a full understanding of its implications—then that belief would be a massive, central influence on their life and worldview. That’s not what’s going on in the heads of the many (most?) people in academia who don’t take superintelligence seriously. Right?
Vladimir_Nesov13d60
My AGI timeline updates from GPT-5 (and 2025 so far)
> GPT-5 probably isn't based on a substantially better pretrained model which is some evidence that OpenAI thinks the marginal returns from pretraining are pretty weak relative to the returns from RL The model seems to be "small", but not necessarily with less pretraining in it (in the form of overtraining) than RLVR. There are still no papers I'm aware of on what the compute optimal (or GPU-time optimal) pretraining:RLVR ratio could be like. Matching GPU-time of pretraining and RLVR results in something like 4:1 (in terms of FLOPs), which would only be compute optimal (or GPU-time optimal) by unlikely coincidence. If the optimal ratio of pretraining:RLVR is something like 1:10 (in FLOPs), then overtraining even smaller models is unimportant. But it could also be more like 40:1, in which case overtraining becomes a must (if inference cost/speed and HBM capacity of the legacy 8-chip servers force the param count to be smaller than compute optimal given the available training compute and the HBM capacity of GB200 NVL72).
Load More
Alignment Implications of LLM Successes: a Debate in One Act
Best of LessWrong 2023

Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.

by Zack_M_Davis
25Zack_M_Davis
(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023: I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.) This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4 era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct. Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)
4Seth Herd
This post skillfully addressed IMO the most urgent issue in alignment:; bridging the gap between doomers and optimists. If half of alignment thinkers think alignment is very difficult, while half think it's pretty achievable, decision-makers will be prone to just choose whichever expert opinion supports what they want to do anyway. This and its following acts are the best work I know of in refining the key cruxes. And they do so in a compact, readable, and even fun form.

Recent Discussion

Linda Linsefors's Shortform
Linda Linsefors
6y
Linda Linsefors18h10

Estimated MSE loss for three diffrent ways of embedding features into neuons, when there are more possible features than neurons.

I've typed up some math notes for how much MSE loss we should expect for random embedings, and some other alternative embedings, for when you have more features than neurons. I don't have a good sense for how ledgeble this is to anyone but me.

Note that neither of these embedings are optimal. I belive that the optimal embeding for minimising MSE loss is to store the features in almost orthogonal directions, which is similar to ran... (read more)

Reply
Comparison of decision theories (with a focus on logical-counterfactual decision theories)
24
riceissa
6y

Introduction

Summary

This post is a comparison of various existing decision theories, with a focus on decision theories that use logical counterfactuals (a.k.a. the kind of decision theories most discussed on LessWrong). The post compares the decision theories along outermost iteration (action vs policy vs algorithm), updatelessness (updateless or updateful), and type of counterfactual used (causal, conditional, logical). It then explains the decision theories in more detail, in particular giving an expected utility formula for each. The post then gives examples of specific existing decision problems where the decision theories give different answers.

Value-added

There are some other comparisons of decision theories (see the “Other comparisons” section), but they either (1) don’t focus on logical-counterfactual decision theories; or (2) are outdated (written before the new functional/logical decision theory terminology came about).

To give a more personal motivation, after reading through a bunch

...
(Continue Reading - 2929 more words)
Alex Flint3d40

Hey I'm interested in implementing some of these decision theories (and decision problems) in code. I have an initial version of CDT, EDT, and something I'm generically calling "FDT", but which I guess is actually some particular sub-variant of FDT in Python here, with the core decision theories implemented in about 45 lines of python code here. I'm wondering if anyone here might have suggestions on what would it look like to implement UDT in this framework -- either 1.0 or 1.1. I don't yet have a notion of "observation" in the code, so I can't yet impleme... (read more)

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman
4y
Zach Stein-Perlman3d15-12

Some of my friends are signal-boosting this new article: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge. See also the open letter. I don't feel good about this critique or the implicit ask.

  1. Sharing information on capabilities is good but public deployment is a bad time for that, in part because most risk comes from internal deployment.
  2. Google didn't necessarily even break a commitment? The commitment mentioned in the article is to "publicly report model or system capabilities." That doesn't say it has to be done at the time of public deployment
... (read more)
Reply2
Buck's Shortform
Buck
6y
5TsviBT6d
If by intelligence you mean "we made some tests and made sure they are legible enough that people like them as benchmarks, and lo and behold, learning programs (LPs) continue to perform some amount better on them as time passes", ok, but that's a dumb way to use that word. If by intelligence you mean "we have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone", then I challenge you to provide the evidence / reasoning that apparently makes you confident that LP25 is at a ~human (village idiot) level of intelligence. Cf. https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense
22Buck7d
Ugh, I think you're totally right and I was being sloppy; I totally unreasonably interpreted Eliezer as saying that he was wrong about how long/how hard/how expensive it would be to get between capability levels. (But maybe Eliezer misinterpreted himself the same way? His subsequent tweets are consistent with this interpretation.) I totally agree with Eliezer's point in that post, though I do wish that he had been clearer about what exactly he was saying.
14Eliezer Yudkowsky4d
I think you accurately interpreted me as saying I was wrong about how long it would take to get from the "apparently a village idiot" level to "apparently Einstein" level!  I hadn't thought either of us were talking about the vastness of the space above, in re what I was mistaken about.  You do not need to walk anything back afaict!
TsviBT4d54

Have you stated anywhere what makes you think "apparently a village idiot" is a sensible description of current learning programs, as they inform us regarding the question of whether or not we currently have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone soon?

Reply
Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
48
ryan_greenblatt
6d

Here's a relatively important question regarding transparency requirements for AI companies: At which points in time should AI companies be required to disclose information? (While I focus on transparency, this question is also applicable to other safety-relevant requirements, and is applicable to norms around voluntary actions rather than requirements.)

A natural option would be to attach transparency requirements to the existing processes of pre-deployment testing and releasing a model card when a new model is released. As in, companies would be required to include the relevant information whenever they release a new model (likely in the model card). This is convenient because pre-deployment testing and model cards are already established norms in the AI industry, which makes it easier to attach something new to these existing processes rather...

(See More - 857 more words)
Alex Mallen5d23

Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.

Reply
AI companies have started saying safeguards are load-bearing
22
Zach Stein-Perlman
6d
Kaj_Sotala6d54

I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.

Reply
Load More
50Welcome & FAQ!
Ruby, habryka
4y
9
48Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
ryan_greenblatt
6d
1
22AI companies have started saying safeguards are load-bearing
Zach Stein-Perlman
6d
1
68AI Induced Psychosis: A shallow investigation
Tim Hua
6d
0
29Do-Divergence: A Bound for Maxwell's Demon
johnswentworth, David Lorell
7d
0
27New Paper on Reflective Oracles & Grain of Truth Problem
Cole Wyeth
7d
0
21Notes on cooperating with unaligned AIs
Lukas Finnveden
9d
1
48(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)
johnswentworth, David Lorell
10d
0
1One more reason for AI capable of independent moral reasoning: alignment itself and cause prioritisation
Michele Campolo
11d
0
0Doing good... best?
Michele Campolo
11d
6
Load More

There are two ways to show that an AI system is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Until three months ago, AI companies said their models didn't have dangerous capabilities. (At the time, I wrote that the companies' eval reports didn't support their claims that their models lacked dangerous bio capabilities.) Now, Anthropic, OpenAI, Google DeepMind, and xAI say their most powerful models might have dangerous biology capabilities and thus could substantially boost extremists—but not states—in creating bioweapons. To prevent such misuse, they must (1) prevent extremists from doing misuse via API and (2) prevent extremists from acquiring the model weights.[1] For (1), they claim classifiers block bad model outputs; for (2), they claim their security prevents extremists...

(Continue Reading - 1357 more words)