Raymond Arnold

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.

Posts

Sorted by New

0Raemon's Shortform

27The 2023 LessWrong Review: The Basic Ask

4mo

27Review AI Alignment posts to help figure out how to make a proper AI Alignment review

63What's Up With Confusingly Pervasive Goal Directedness?

28Introducing the AI Alignment Forum (FAQ)

23Announcing AlignmentForum.org Beta

0Raemon's Shortform

Wikitag Contributions

Sandbagging (AI)

19d

Sandbagging (AI)

19d

(+88)

AI "Agent" Scaffolds

22d

AI "Agent" Scaffolds

22d

(+340)

AI Products/Tools

22d

(+121)

Language Models (LLMs)

1mo

LessWrong Review

2mo

LessWrong Review

2mo

(+21)

Organizational Culture & Design

2mo

(+23)

2023 Longform Reviews

3mo

(+68)

Comments

Sorted by

Newest

Reframing AI Safety as a Neverending Institutional Challenge

Raymond Arnold24d39

I do periodically think about this and feel kind of exhausted at the prospect, but it does seem pretty plausibly correct. Good to have a writeup of it.

It particularly seems likely to be the right mindset if you think survival right now depends on getting some kind of longish pause (at least on the sort of research that'd lead to RSI+takeoff)

Self-fulfilling misalignment data might be poisoning our AI models

Raymond Arnold1mo*86

My current guess is:

1. This is more relevant for up-to-the first couple generations of "just barely superintelligent" AIs.

2. I don't really expect it to be the deciding factor after many iterations of end-to-end RSI that gets you to the "able to generate novel scientific or engineering insights much faster than a human or institution could."

I do think it's plausible that the initial bias towards "evil/hackery AI" could start it off in a bad basin of attraction, but a) even if you completely avoided that, I would still basically expect this to rediscover this on it's own as it gained superhuman levels of competence, b) one of the things I most want to use a slightly-superhuman AI to do is to robustly align massively superhuman AI, and I don't really see how to do that without directly engaging with the knowledge of the failure modes there.

I think there are other plans that route more though "use STEM AI to build an uploader or bioenhancer, and then have an accelerated human-psyche do the technical philosophy necessary to handle the unbounded alignment case. I could see that being the right call, and I could imagine the bias from the "already knows about deceptive alignment etc" being large-magnitude enough to matter in the initial process. [edit: In those cases I'd probably want to filter out a lot more than just "unfriendly AI strategies"]

But, basically, how this applies depends on what it is you're trying to do with the AI, and what stage/flavor of AI you're working with and how it's helping.

Views on when AGI comes and on strategy to reduce existential risk

Raymond Arnold2mo10

That's not really what I had in mind, but I had in mind something less clear than I thought. The spirit is about "can the AI come up with novel concepts",

I think one reason I think the current paradigm is "general enough, in principle", is that I don't think "novel concepts" is really The Thing. I think creativity / intelligence mostly is about is combining concepts, it's just that really smart people are

a) faster in raw horsepower and can handle more complexity at a time

b) have a better set of building blocks to combine or apply to make new concepts (which includes building blocks for building better building blocks)

c) have a more efficient search for useful/relevant building blocks (both metacognitive and object-level).

Maybe you believe this, and think that "well yeah, it's the efficient search that's the important part, which we still don't actually have a real working version of?"?

It seems like the current models have basically all the tools a moderately smart human have, with regards to generating novel ideas, and the thing that they're missing is something like "having a good metacognitive loop such that they notice when they're doing a fake/dumb version of things, and course correcting" and "persistently pursue plans over long time horizons." And it doesn't seem to have zero of either of those, just not enough to get over some hump.

I don't see what's missing that a ton of training on a ton of diverse, multimodal tasks + scaffoldin + data flywheel isn't going to figure out.

johnswentworth's Shortform

Raymond Arnold4mo30

(my guess is you took more like 15-25 minutes per question? Hard to tell from my notes, you may have finished early but I don't recall it being crazy early)

johnswentworth's Shortform

Raymond Arnold4mo10

(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn't like people were trying to crank through them in 7 minutes)

The problems I gave were (as listed in the csv for the diamond problems)

#1 (Physics) (1 person got right, 3 got wrong, 1 didn't answer)
#2 (Organic Chemistry), (John got right, I think 3 people didn't finish)
#4 (Electromagnetism), (John and one other got right, 2 got wrong)
#8 (Genetics) (3 got right including John)
#10 (Astrophysics) (5 people got right)

johnswentworth's Shortform

Raymond Arnold4mo30

I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.

(Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)

The 2023 LessWrong Review: The Basic Ask

Raymond Arnold4mo52

Note: I plan to extend the Nomination phase through ~Monday, I didn't mean for it to end partway through the weekend.

Modal Fixpoint Cooperation without Löb's Theorem

Raymond Arnold4mo10

I haven't had much success articulating why.

I'd be interested in a more in-depth review where you take another pass at this.

The 2023 LessWrong Review: The Basic Ask

Raymond Arnold4mo30

A thing unclear to me: is it worth hiding the authors from the Voting page?

On the first LessWrong Review, we deliberately hid authors and randomized the order of the voting results. A few year later, we've mostly shifted towards "help people efficiently sort through the information" rather than "making sure the presentation is random/fair." It's not like people don't know who the posts are by once they start reading them.

Curious what people think.

A note about differential technological development

Raymond Arnold4mo11

I would find this post easier to remember and link to if it were called "Serial vs Parallel Research Time", or something like that which points more at the particular insight the post provides.