AI ALIGNMENT FORUM
AF

All of Raemon's Comments + Replies

Self-fulfilling misalignment data might be poisoning our AI models

I think I understood your article, and was describing which points/implications seemed important.

I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don't think it matters very much if sub-human-intelligence AIs deceive.

Reframing AI Safety as a Neverending Institutional Challenge

Raymond Arnold1mo39

I do periodically think about this and feel kind of exhausted at the prospect, but it does seem pretty plausibly correct. Good to have a writeup of it.

It particularly seems likely to be the right mindset if you think survival right now depends on getting some kind of longish pause (at least on the sort of research that'd lead to RSI+takeoff)

Self-fulfilling misalignment data might be poisoning our AI models

Raymond Arnold1mo*86

My current guess is:

1. This is more relevant for up-to-the first couple generations of "just barely superintelligent" AIs.

2. I don't really expect it to be the deciding factor after many iterations of end-to-end RSI that gets you to the "able to generate novel scientific or engineering insights much faster than a human or institution could."

I do think it's plausible that the initial bias towards "evil/hackery AI" could start it off in a bad basin of attraction, but a) even if you completely avoided that, I would still basically expect this to redisco... (read more)

0David James1mo

I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?

Views on when AGI comes and on strategy to reduce existential risk

Raymond Arnold2mo1-1

That's not really what I had in mind, but I had in mind something less clear than I thought. The spirit is about "can the AI come up with novel concepts",

I think one reason I think the current paradigm is "general enough, in principle", is that I don't think "novel concepts" is really The Thing. I think creativity / intelligence mostly is about is combining concepts, it's just that really smart people are

a) faster in raw horsepower and can handle more complexity at a time

b) have a better set of building blocks to combine or apply to make new concepts... (read more)

Tsvi Benson-Tilsen2mo*41

really smart people

Differences between people are less directly revelative of what's important in human intelligence. My guess is that all or very nearly all human children have all or nearly all the intelligence juice. We just, like, don't appreciate how much a child is doing in constructing zer world.

the current models have basically all the tools a moderately smart human have, with regards to generating novel ideas

Why on Earth do you think this? (I feel like I'm in an Asch Conformity test, but with really really high production value. Like, after... (read more)

johnswentworth's Shortform

Raymond Arnold4mo30

(my guess is you took more like 15-25 minutes per question? Hard to tell from my notes, you may have finished early but I don't recall it being crazy early)

3johnswentworth4mo

I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don't think I updated any answers as a result of the second pass, though I don't remember very well.

johnswentworth's Shortform

Raymond Arnold4mo10

(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn't like people were trying to crank through them in 7 minutes)

The problems I gave were (as listed in the csv for the diamond problems)

#1 (Physics) (1 person got right, 3 got wrong, 1 didn't answer)
#2 (Organic Chemistry), (John got right, I think 3 people didn't finish)
#4 (Electromagnetism), (John and one other got right, 2 got wrong)
#8 (Genetics) (3 got right including John)
#10 (Astrophysics) (5 people got right)

johnswentworth's Shortform

Raymond Arnold4mo30

I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.

(Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)

3johnswentworth4mo

@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4/5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).

1Raymond Arnold4mo

(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn't like people were trying to crank through them in 7 minutes) The problems I gave were (as listed in the csv for the diamond problems) * #1 (Physics) (1 person got right, 3 got wrong, 1 didn't answer) * #2 (Organic Chemistry), (John got right, I think 3 people didn't finish) * #4 (Electromagnetism), (John and one other got right, 2 got wrong) * #8 (Genetics) (3 got right including John) * #10 (Astrophysics) (5 people got right)

The 2023 LessWrong Review: The Basic Ask

Raymond Arnold4mo52

Note: I plan to extend the Nomination phase through ~Monday, I didn't mean for it to end partway through the weekend.

Modal Fixpoint Cooperation without Löb's Theorem

Raymond Arnold4mo10

I haven't had much success articulating why.

I'd be interested in a more in-depth review where you take another pass at this.

The 2023 LessWrong Review: The Basic Ask

Raymond Arnold4mo30

A thing unclear to me: is it worth hiding the authors from the Voting page?

On the first LessWrong Review, we deliberately hid authors and randomized the order of the voting results. A few year later, we've mostly shifted towards "help people efficiently sort through the information" rather than "making sure the presentation is random/fair." It's not like people don't know who the posts are by once they start reading them.

Curious what people think.

A note about differential technological development

Raymond Arnold5mo11

I would find this post easier to remember and link to if it were called "Serial vs Parallel Research Time", or something like that which points more at the particular insight the post provides.

gwern's Shortform

Raymond Arnold5mo54

Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet.

Mark Xu's Shortform

Raymond Arnold6mo61

I think two major cruxes for me here are:

is it actually tractable to affect Deepmind's culture and organizational decisionmaking
how close to the threshold is Anthropic for having a good enough safety culture?

My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.

I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make "the one actua... (read more)

A basic systems architecture for AI agents that do autonomous research

Raymond Arnold7mo64

Curated. It seems like in the current regime of frontier models, it's worth making more explicit models of what architectures we can expect, if dangerous capabilities develop in the near future.

This post feels like it spells out a model that matches my general understanding of the state-of-the-art, but draws several inferences about it I hadn't previously thought about.

I'd be interested in other people who've thought about current generation deployment setups chiming in with their takes, if they disagree or think there's important elements missing.

AI forecasting bots incoming

Raymond Arnold7mo32

Some additional thoughts:

A thing that is different from a forecasting bot, is an "operationalization" bot, that scans threads with disagreements for plausible operationalizations of the disagreement.

I think good discourse is partly bottlenecked on "it's pretty expensive and annoying to operationalize many disagreements, in ways that actually help get to the heart of the matter."

I'd be pretty interested in experimenting with operationalization-bot and forecasting-bot combos (once they get sufficiently good to not be annoying), that scan disagreements,... (read more)

AI forecasting bots incoming

Raymond Arnold7mo21

How much does it cost to run one instance of this right now?

One obvious idea is to build integrations for Manifold, Fatebook, Metaculus etc where you automatically have this bot give an answer. I'm curious how much that costs when you're just doing it all over the place.

(I don't know that I believe the current generation of bots are as good as implied here for reasons Habryka/Gwern are saying, but, regardless, I expect at least pretty decent bots of this form to be available within a year if they aren't already, and for it to be useful for various sites to have SOTA forecasting bots plugged into them if they're at least nontrivially better than chance)

3Raymond Arnold7mo

Some additional thoughts: A thing that is different from a forecasting bot, is an "operationalization" bot, that scans threads with disagreements for plausible operationalizations of the disagreement. I think good discourse is partly bottlenecked on "it's pretty expensive and annoying to operationalize many disagreements, in ways that actually help get to the heart of the matter." I'd be pretty interested in experimenting with operationalization-bot and forecasting-bot combos (once they get sufficiently good to not be annoying), that scan disagreements, suggest little cloud-bubbles of "is this a cruxy operationalization you could bet on?", which maybe disappear if nobody clicks "yep". Then, people are encouraged to put their own probabilities on it, and then, afterwards (to avoid anchoring), a forecasting bot gives it's own guess.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Raymond Arnold7mo10

Curated. "What would actually be persuasive that scheming is a problem?" is one of the important practical questions for AI governance. I appreciate Buck noting places he changed his mind somewhat during the SB 1047 discourse.

I appreciate that the post has detailed worked examples.

TurnTrout's shortform feed

Raymond Arnold7mo*1815

I haven't read the Shard Theory work in comprehensive detail. But, fwiw I've read at least a fair amount of your arguments here and not seen anything that bridged the gap between "motivations are made of shards that are contextually activated" and "we don't need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization."

I've heard you make this basic argument several times, and my sense is you're pretty frustrated that people still don't seem to have "heard" it properly, or something. I currently feel like I have... (read more)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Raymond Arnold7mo42

(If we had a robust check for misalignment, we could iterate or train against it.)

This seems technically true but I wanna flag the argument "it seems rally hard to be confident that you have robust enough checks that training against them is good, instead of bad (because it trains the AI to hide better)".

2Ryan Greenblatt7mo

Agreed, I really should have said "or possibly even train against it". I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!

Fields that I reference when thinking about AI takeover prevention

Raymond Arnold8mo71

From the perspective of risk management, compared to normal computer security problems, AI deployments involve vastly fewer moving parts that you have vastly less understanding of.

I don't get why this is "vastly fewer moving parts you have vastly less undestanding of" as opposed to "vastly more (or, about the same?) number of moving parts that you have vastly less understanding of."

I'd naively model each parameter in a model as a "part". I agree that, unlike most complex engineering or computer security, we don't understand what each part does. But seems w... (read more)

Buck Shlegeris8mo105

I mostly just mean that when you're modeling the problem, it doesn't really help to think about the AIs as being made of parts, so you end up reasoning mostly at the level of abstraction of "model" rather than "parameter", and at that level of abstraction, there aren't that many pieces.

Like, in some sense all practical problems are about particular consequences of laws of physics. But the extent to which you end up making use of that type of reductionist model varies greatly by domain.

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Raymond Arnold10mo10

Curated.

The overall point here seems true and important to me.

I think I either disagree, or am agnostic about, some of the specific examples given in the Myth vs Reality section. I don't think they're loadbearing for the overall point. I may try to write those up in more detail later.

AI catastrophes and rogue deployments

Raymond Arnold10mo10

Ah yeah sorry I didn't mean to convey that. For now I'm (clumsily) edited the original comment to be more clear.

AI catastrophes and rogue deployments

Raymond Arnold10mo*43

Curated. I found this a helpful way of carving up the AI safety space.

I agree with Ryan Greenblatt's clarification in comments that no, this doesn't mean we're completely safe if we can rule out Rogue Deployments, but, it still seems like a useful model for reasoning about what kinds of failures are more or less likely.

[edit: oh, to clarify, I don't think Buck meant to imply that either in the original post, which goes out of it's way to talk about catastrophes without rogue deployments. It just seemed like a confusion I expected some people to have]

3Buck Shlegeris10mo

To be clear, I agree that ruling out rogue deployments doesn’t suffice to prevent catastrophes, as I discuss in the section “catastrophes without rogue deployments”

Buck's Shortform

Raymond Arnold1y31

It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.

I didn't get this from the premises fwiw. Are you saying it's trivial because "just don't use your AI to help you design AI" (seems organizationally hard to me), or did you have particular tricks in mind?

4Ryan Greenblatt1y

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe. Not that most applications of AI for AI development can be made trivially safe.

Critical review of Christiano's disagreements with Yudkowsky

Raymond Arnold1y22

Fwiw this doesn't feel like a super helpful comment to me. I think there might be a nearby one that's more useful, but this felt kinda coy for the sake of being coy.

Call For Distillers

Raymond Arnold1y50Review for 2022 Review

Since this post was written, I feel like there's been a zeitgeist of "Distillation Projects." I don't know how causal this post was, I think in some sense the ecosystem was ripe for a Distillation Wave) But it seemed useful to think about how that wave played out.

Some of the results have been great. But many of the results have felt kinda meh to me, and I now have a bit of a flinch/ugh reaction when I see a post with "distillation" in it's title.

Basically, good distillations are a highly skilled effort. It's sort of natural to write a distillation of... (read more)

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Raymond Arnold1y10

Curated.

It's still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team's update against architectural approaches. I found this remark at the end interesting:

Finally, we note that in some of these expanded theories of superposition, finding the "correct number of features" may not be well-posed. In others, there is a true n

Raymond Arnold2y10

Curated, both for the OP (which nicely lays out some open problems and provides some good links towards existing discussion) as well as the resulting discussion which has had a number of longtime contributors to LessWrong-descended decision theory weighing in.

What I would do if I wasn’t at ARC Evals

Raymond Arnold2y31

Curated. I liked both the concrete array of ideas coming from someone who has a fair amount of context, and the sort of background models I got from reading each of said ideas.

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Raymond Arnold2y44

Curated.

I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.

Thoughts on sharing information about language model capabilities

Raymond Arnold2y00

plus-one-ing the impulse to "look for third options"

Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo

Raymond Arnold2y50

What background knowledge do you think this requires? If I know a bit about how ML and language models work in general, should I be able to reason this out from first principles (or from following a fairly obvious trail of "look up relevant terms and quickly get up to speed on the domain?"). Or does it require some amount of pre-existing ML taste?

Also, do you have a rough sense of how long it took for MATS scholars?

3Neel Nanda2y

Great questions, thanks! Background: You don't need to know anything beyond "a language model is a stack of matrix multiplications and non-linearities. The input is a series of tokens (words and sub-words) which get converted to vectors by a massive lookup table called the embedding (the vectors are called token embeddings). These vectors have really high cosine sim in GPT-Neo". Re how long it took for scholars, hmm, maybe an hour? Not sure, I expect it varied a ton. I gave this in their first or second week, I think.

Consciousness as a conflationary alliance term for intrinsically valued internal experiences

Raymond Arnold2y*22

Curated.

I like that this went out and did some 'field work', and is clear about the process so you can evaluate how compelling to find it. I found the concept of a conflationary alliance pretty helpful.

That said, I don't think the second half of the article argues especially well for a "consciousness conflationary alliance" existing. I did immediately think "oh this seems like a fairly likely thing to exist as soon as it's pointed out" (in particular given some recent discussion on why consciousness is difficult to talk about), but I think if i... (read more)

Some background for reasoning about dual-use alignment research

Raymond Arnold2y38

Curated. This seems like an important set of considerations for alignment researchers to think about.

Wildfire of strategicness

Raymond Arnold2y10

meta note on tagging:

This post seemed to be on a topic that... surely there should be commonly used LW concept for, but I couldn't think of it. I tagged it "agent foundations" but feel like there should be something more specific.

2DanielFilan2y

Maybe "subagents"?

Contra Common Knowledge

Raymond Arnold2y10

ironically I missed this post when you first posted it

Some background for reasoning about dual-use alignment research

Raymond Arnold2y10

I previously had had a cruder model of "There's an AI capabilities fuel tank and an AI alignment fuel tank. Many research items fill up both at the same time, but in different ratios. If you fill up the capabilities tank before the alignment tank, we lose. You want to pursue strategies that cause the alignment tank to get filled up faster than the capabilities tank." (I got this from Andrew Critch during in-person converation)

I like this post for putting forth a higher resolution model, that prompts me to think a bit more specifically about what downstream effects I expect to happen. (though I think the tank model might still be kinda useful as a fast shorthand sometimes)

Steering GPT-2-XL by adding an activation vector

Raymond Arnold2y42

Curated. I think this post proposes an interesting mechanism of understanding and controlling LLMs. I'm have a lot of uncertainty on how useful this will turn out to be, but the idea seems both interesting and promising and I'd like to see more work exploring the area.

Godzilla Strategies

Raymond Arnold2y10

I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

Steering GPT-2-XL by adding an activation vector

Raymond Arnold2y512

I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective:

I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a j... (read more)

Jacob Steinhardt2y714

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged w... (read more)

Infrafunctions and Robust Optimization

Raymond Arnold2y30

note: I tagged this "Infrabayesianism" but wasn't actually sure whether it was or not according to you.

Discussion with Nate Soares on a key alignment difficulty

Raymond Arnold2y10

Yeah that's what I intended.

Discussion with Nate Soares on a key alignment difficulty

Raymond Arnold2y11

Curated. On one hand, folks sure have spent a long time trying to hash out longstanding disagreements, and I think it's kinda reasonable to not feel like that's a super valuable thing to do more of.

On the other hand... man, sure seems scary to me that we still have so many major disagreements that we haven't been able to resolve.

I think this post does a particularly exemplary job of exploring some subtle disagreements from a procedural level: I like that Holden makes a pretty significant attempt to pass Nate's Ideological Turing Test, flags which parts of ... (read more)

2Ben Pace2y

This sentence was confusing to me given that the post does not mention 'double crux', but I mentioned it to someone and they said to think of it as the mental motion and not the explicit format, and that makes more sense to me.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

Raymond Arnold2y10

However, if your post doesn't look like a research article, you might have to format it more like one (and even then it's not guaranteed to get in, see this comment thread).

I interpreted this as saying something superficial about style, rather than "if your post does not represent 100+ hours of research work it's probably not a good fit for archive." If that's what you meant I think the post could be edited to make that more clear.

If the opening section of your essay made it more clear which posts it was talking about I'd probably endorse it (although I'm not super familiar with the nuances of arXiv gatekeeping so am mostly going off the collective response in the comment section)

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

Raymond Arnold2y31

Yeah, I didn't mean to be responding to that point one way or another. It just seemed bad to be linking to a post that (seems to still?) communicate false things, without flagging those false things. (post still says "it can be as easy as creating a pdf of your post", which my impression maybe technically true on rare occasions but basically false in practice?)

1JanB2y

This feels like a really adversarial quote. Concretely, the post says: This looks correct to me; there are LW posts that already basically look like papers. And within the class of LW posts that should be on arXiv at all, which is the target audience of my post, posts that basically look like papers aren't vanishingly rare.

1David Manheim2y

That seems right.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

Raymond Arnold2y63

I thought the response to "Your Posts Should be On Arxiv" was "Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv" (unless you have jumped through a bunch of both effort-hoops and formatting hoops to make them feel like a natural member of the Arxiv-paper class)

1JanB2y

I wrote this post. I don't understand where your claim ("Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv") is coming from.

1Arthur Conmy2y

I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently

3David Manheim2y

And I think the post here is saying that you should jump through those effort and editing hoops far more often than currently occurs.

Anthropic's Core Views on AI Safety

Raymond Arnold2y1219

Yeah I agree with this.

To be clear, I think Anthropic has done a pretty admirable job of showing some restraint here. It is objectively quite impressive. My wariness is "Man, I think the task here is really hard and even a very admirably executed company may not be sufficient."

leogao's Shortform

Raymond Arnold2y20

Yeah something in this space seems like a central crux to me.

I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."

I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")

Anthropic's Core Views on AI Safety

Raymond Arnold2y*2128

I think the worldview here seems cogent. It's very good for Anthropic folk to be writing up their organizational-beliefs publicly. I'm pretty sympathetic to "man, we have no idea how to make real progress without empirical iteration, so we just need to figure out how to make empirical iteration work somehow."

I have a few disagreements. I think the most important ones route through "how likely is this to accelerate race dynamics and how bad is that?".

We've subsequently begun deploying Claude now that the gap between it and the public state of the art is sma

... (read more)

Jeffrey Ladish2y1926

I both agree that the race dynamic is concerning (and would like to see Anthropic address them explicitly), and also think that Anthropic should get a fair bit of credit for not releasing Claude before ChatGPT, a thing they could have done and probably gained a lot of investment / hype over. I think Anthropic's "let's not contribute to AI hype" is good in the same way that OpenAI's "let's generate massive" hype strategy is bad.

Like definitely I'm worried about the incentive to stay competitive, especially in the product space. But I think it's worth ... (read more)

Acausal normalcy

Raymond Arnold2y20

Curated. I've been hearing about the concept of the acausal economy for awhile and think it's a useful concept, but I don't think I've seen it written up as succinctly/approachably before.

I appreciated the arguments about how simulation is actually pretty expensive, and logical/moral extrapolation is comparatively cheap, and that there are some reasons to expect this to be a fairly central aspect of the acausal economy/society. I've been reading along with Critch's recent series on both boundaries and Lob's Theorem. I'm not sure I actually fully grok... (read more)