AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

240
Raemon8d1644
Please, Don't Roll Your Own Metaethics
What are you supposed to do other than roll your own metaethics?
cousin_it10d1013
Problems I've Tried to Legibilize
I'm worried about the approach of "making decisionmakers realize stuff". In the past couple years I've switched to a more conflict-theoretic view: the main problem to me is that the people building AI don't want to build aligned AI. Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it. This is maybe easiest to see by looking at present harms. An actually aligned AI would politely decline to do such things as putting lots of people out of jobs or filling the internet with slop. So companies making AI for the market have to make it misaligned in at least these ways, otherwise it'll fail in the market. Extrapolating into the future, even if we do lots of good alignment research, markets and governments will pick out only those bits that contribute to market-aligned or government-aligned AI. Which (as I've been saying over and over) will be really bad for most people, because markets and governments don't necessarily need most people. So this isn't really a comment on the list of problems (which I think is great), but more about the "theory of change" behind it. I no longer have any faith in making decisionmakers understand something it's not profitable for them to understand. I think we need a different plan.
Vladimir_Nesov12d169
Comparing Payor & Löb
I would term □x→x "hope for x" rather than "reliability", because it's about willingness to enact x in response to belief in x, but if x is no good, you shouldn't do that. Indeed, for bad x, having the property of □x→x is harmful fatalism, following along with destiny rather than choosing it. In those cases, you might want to □x→¬x or something, though that only prevents x from being believed, that you won't need to face □x in actuality, it doesn't prevent the actual x. So □x→x reflects a value judgement about x reflected in agent's policy, something downstream of endorsement of x, a law of how the content of the world behaves according to an embedded agent's will. Payor's Lemma then talks about belief in hope □(□x→x), that is hope itself is exogenous and needs to be judged (endorsed or not). Which is reasonable for games, since what the coalition might hope for is not anyone's individual choice, the details of this hope couldn't have been hardcoded in any agent a priori and need to be negotiated during a decision that forms the coalition. A functional coalition should be willing to act on its own hope (which is again something we need to check for a new coalition, that might've already been the case for a singular agent), that is we need to check that □(□x→x) is sufficient to motivate the coalition to actually x. This is again a value judgement about whether this coalition's tentative aspirations, being a vehicle for hope that x, are actually endorsed by it. Thus I'd term □(□x→x) "coordination" rather than "trust", the fact that this particular coalition would tentatively intend to coordinate on a hope for x. Hope □x→x is a value judgement about x, and in this case it's the coalition's hope, rather any one agent's hope, and the coalition is a temporary nascent agency thing that doesn't necessarily know what it wants yet. The coalition asks: "If we find ourselves hoping for x together, will we act on it?" So we start with coordination about hope, seeing if this particular hope wants to settle as the coalition's actual values, and judging if it should by enacting x if at least coordination on this particular hope is reached, which should happen only if x is a good thing. (One intuition pump with some limitations outside the provability formalism is treating □x as "probably x", perhaps according to what some prediction market tells you. If "probably x" is enough to prompt you to enact x, that's some kind of endorsement, and it's a push towards increasing the equilibrium-on-reflection value of probability of x, pushing "probably x" closer to reality. But if x is terrible, then enacting it in response to its high probability is following along with self-fulfilling doom, rather doing what you can to push the equilibrium away from it.) Löb's Theorem then says that if we merely endorse a belief by enacting the believed outcome, this is sufficient for the outcome to actually happen, a priori and without that belief yet being in evidence. And Payor's Lemma says that if we merely endorse a coalition's coordinated hope by enacting the hoped-for outcome, this is sufficient for the outcome to actually happen, a priori and without the coordination around that hope yet being in evidence. The use of Löb's Theorem or Payor's Lemma is that the condition (belief in x, or coordination around hope for x) should help in making the endorsement, that is it should be easier to decide to x if you already believe that x, or if you already believe that your coalition is hoping for x. For coordination, this is important because every agent can only unilaterally enact its own part in the joint policy, so it does need some kind of premise about the coalition's nature (in this case, about the coalition's tentative hope for what it aims to achieve) in order to endorse playing its part in the coalition's joint policy. It's easier to decide to sign an assurance contract than to unconditionally donate to a project, and the role of Payor's Lemma is to say that if everyone does sign the assurance contract, then the project will in fact get funded sufficiently.
Load More
42Serious Flaws in CAST
Max Harms
19h
1
15Current LLMs seem to rarely detect CoT tampering
Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels
21h
0
6Lessons from building a model organism testbed
joshc, sarun0, Annie Sorkin, michaelwaves
3d
0
5Will AI systems drift into misalignment?
joshc
5d
0
6Evaluation Avoidance: How Humans and AIs Hack Reward by Disabling Evaluation Instead of Gaming Metrics
Johannes C. Mayer
7d
0
3Self-interpretability: LLMs can describe complex internal processes that drive their decisions
Adam Morris, Dillon Plunkett
7d
0
14Supervised fine-tuning as a method for training-based AI control
Emil Ryd, Joe Benton, Vivek Hebbar
7d
0
50Please, Don't Roll Your Own Metaethics
Wei Dai
8d
14
36Steering Language Models with Weight Arithmetic
Fabien Roger, constanzafierro
9d
2
3Strengthening Red Teams: A Modular Scaffold for Control Evaluations
Chloe Loughridge
9d
0
Load More

Recent Discussion

Oracle machines instead of topological truth predicates
2
Benya_Fallenstein
11y

In a comment on my post on topological truth predicates, Paul suggests an approach that uses probabilistic oracle machines instead, in order to make this work more comprehensible to computer scientists. I like this idea!

Paul sketches a framework developed by him and Jessica Taylor, based on a conversation with Scott Aaronson; in this post, I propose a slight simplification of their framework. My version has an oracle O(┌M┐,p), which takes the source code of a probabilistic oracle machine M and a p∈Q∩[0,1]. If for every possible oracle O′, M[O′] halts with probability one and outputs either 0 or 1, then O(┌M┐,p): (i) returns "true" if the probability that M[O] returns 1 is >p; (ii) returns "false" if it is <p; (iii) randomly returns "true" or "false" if

...
(Continue Reading - 1891 more words)
Cole Wyeth1h10

I think that I need this in my variant of AIXI in order to filter out "world models" which don't necessarily halt, and I think this will be enough to do so, but I'll leave working out the details to a later post.

AIXI's hypothesis class is the (lower semicomputable chronological) semimeasures, so I do not know why halting should be required for this application?

(My question is now of mainly historical interest, since the later versions of reflective oracles do not require queries to be about machines that halt, AND because they "complete" the distributions generated by those machines to proper measures) 

Reply
Serious Flaws in CAST
42
Max Harms
19h

Last year I wrote the CAST agenda, arguing that aiming for Corrigibility As Singular Target was the least-doomed way to make an AGI. (Though it is almost certainly wiser to hold off on building it until we have more skill at alignment, as a species.)

I still basically believe that CAST is right. Corrigibility still seems like a promising target compared to full alignment with human values, since there's a better story for how a near-miss when aiming towards corrigibility might be recoverable, but a near-miss when aiming for goodness could result is a catastrophe, due to the fragility of value. On top of this, corrigibility is significantly simpler and less philosophically fraught than human values, decreasing the amount of information that needs to be perfectly transmitted to...

(Continue Reading - 2294 more words)
Daniel Kokotajlo18h54

It's also (very weak) evidence that s-risks from sign errors are real.

:(((

Reply
Current LLMs seem to rarely detect CoT tampering
15
Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels
21h

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**

* equal primary contributor, order determined via coin flip

** equal advice and mentorship, order determined via coin flip

“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.

Back to evil plan.” -- Deepseek R1

TL;DR

We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.

Results:

  • Models very rarely detect syntactic modifications that don’t explicitly impact the model’s output (such as token or sentence removal).
  • Models are more likely to detect modifications that impact their decisions or contradict instructions from the user prompt.
  • Our observations differ significantly across the tested models (DeepSeek R1 and OpenAI GPT OSS 120B) and may be different
...
(Continue Reading - 5863 more words)
Problems I've Tried to Legibilize
34
Wei Dai
11d

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

  1. Philosophical problems
    1. Probability theory
    2. Decision theory
    3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
    4. Interaction between bargaining and logical uncertainty
    5. Metaethics
    6. Metaphilosophy: 1, 2
  2. Problems with specific philosophical and alignment ideas
    1. Utilitarianism: 1, 2
    2. Solomonoff induction
    3. "Provable" safety
    4. CEV
    5. Corrigibility
    6. IDA (and many scattered comments)
    7. UDASSA
    8. UDT
  3. Human-AI safety (x- and s-risks arising from the interaction between human nature and
...
(See More - 346 more words)
7Eliezer Yudkowsky4d
Has anyone else, or anyone outside the tight MIRI cluster, made progress on any of the problems you've tried to legibilize for them?
Wei Dai3d60

To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I'm not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I'm not convinced really work.

However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important... (read more)

Reply
3orthonormal5d
When it specifically comes to loss-of-control risks killing or sidelining all of humanity, I don't believe Sam or Dario or Demis or Elon want that to happen, because it would happen to them too. (Larry Page is different on that count, of course.) You do have conflict theory over the fact that some of them would like ASI to make them god-emperor of the universe, but all of them would definitely take a solution to "loss of control" if it were handed to them on a silver platter.
3cousin_it5d
I think AI offers a chance of getting huge power over others, so it would create competitive pressure in any case. In case of a market economy it's market pressure, but in case of countries it would be a military arms race instead. And even if the labs didn't get any investors and raced secretly, I think they'd still feel under a lot of pressure. The chance of getting huge power is what creates the problem, that's why I think spreading out power is a good idea. There would still be competition of course, but it would be normal economic levels of competition, and people would have some room to do the right things.
Please, Don't Roll Your Own Metaethics
50
Wei Dai
8d

One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.

My boss thought he had to demonstrate the insecurity of the PRNG by coming up...

(See More - 486 more words)
16Richard_Ngo5d
"Please don't roll your own crypto" is a good message to send to software engineers looking to build robust products. But it's a bad message to send to the community of crypto researchers, because insofar as they believe you, then you won't get new crypto algorithms from them. In the context of metaethics, LW seems much more analogous to the "community of crypto researchers" than the "software engineers looking to build robust products". Therefore this seems like a bad message to send to LessWrong, even if it's a good message to send to e.g. CEOs who justify immoral behavior with metaethical nihilism.
Wei Dai4d50

You may have missed my footnote, where I addressed this?

To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.

Reply
3Wei Dai5d
By "metaethics" I mean "the nature of values/morality", which I think is how it's used in academic philosophy. Of course the nature of values/morality has a strong influence on "how humans should think about their values" so these are pretty closely connected, but definitionally I do try to use it the same way as in philosophy, to minimize confusion. This post can give you a better idea of how I typically use it. (But as you'll see below, this is actually not crucial for understanding my post.) So in the paragraph that you quoted (and the rest of the post), I was actually talking about philosophical fields/ideas in general, not just metaethics. While my title has "metaethics" in it, the text of the post talks generically about any "philosophical questions" that are relevant for AI x-safety. If we substitute metaethics (in my or the academic sense) into my post, then you can derive that I mean something like this: Different metaethics (ideas/theories about the nature of values/morality) have different implications for what AI designs or alignment approaches are safe, and if you design an AI assuming that one metaethical theory is true, it could be disastrous if a different metaethical theory actually turns out to be true. For example, if moral realism is true, then aligning the AI to human values would be pointless. What you really need to do is design the AI to be able to determine and follow objective moral truths. But this approach would be disastrous if moral realism is actually false. Similarly, if moral noncognitivism is true, that means that humans can't be wrong about their values, and implies "how humans should think about their values" is of no importance. If you design AI under this assumption, that would be disastrous if actually humans can be wrong about their values and they really need AIs to help them think about their values and avoid moral errors. I think in practice a lot of alignment researchers may not even have explicit metaethical theories
1lemonhope6d
The WWDSC is nearly a consensus. Certainly a plurality.
Lessons from building a model organism testbed
6
joshc, sarun0, Annie Sorkin, michaelwaves
3d

I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model organisms and empirically test whether white-box methods could help us detect their deceptive reasoning.

Unfortunately, I don’t think our empirical results were very informative. Our model organisms were too toy for me to expect that our results will transfer to the powerful AI (or even current state-of-the-art AI models). But I think we still developed methodological details that might help people build better model organism testbeds in the future.

I’ll first explain what model organism testbeds are and why...

(Continue Reading - 4196 more words)
Load More
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Best of LessWrong 2023

Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

by Joe Carlsmith
15Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.