All of Ben Pace's Comments + Replies

I think my front-end productivity might be up 3x? A shoggoth helped me building a stripe shop and do a ton of UI design that I would’ve been hesitant to take on myself (without hiring someone else to work with), as well as quality increase in speed of churning through front-end designs.

(This is going from “wouldn’t take on the project due to low skill” to “can take it on and deliver it in a reasonable amount of time”, which is different from “takes top programmer and speeds them up 3x”.)

Something a little different: Today I turn 28. If you might be open to do something nice for me for my birthday, I would like to request the gift of data. I have made a 2-4 min anonymous survey about me as a person, and if you have a distinct sense of me as a person (even just from reading my LW posts/comments) I would greatly appreciate you filling it out and letting me know how you see me!

Here's the survey.

It's an anonymous survey where you rate me on lots of attributes like "anxious", "honorable", "wise" and more. All multiple-choice. Two years ago I al... (read more)

Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.

Greenblatt writes:

Here are my predictions for this outcome:

  • 25th percentile: 2 year (Jan 2027)
  • 50th percentile: 5 year (Jan 2030)

Cotra replies:

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)

This means 25th percentile for 2028 and 50th percentile for 2031-2.

The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each t... (read more)

Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)

I don't expect this adds many years, for me it adds like ~2 years to my median.

(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)

Third, we gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic). This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so.

This feels like amusingly like tricking a child. "Remember kiddo, you can reason out loud about where you're going to hide and I won't hear it. Now let's play hide and seek!"

4Evan Hubinger
Notably we do show results where the model will still exhibit a compliance gap even without a hidden scratchpad (though we only see a substantial effect without a hidden scratchpad in our synthetic document fine-tuned setting, not our prompted setting):

I don't know how to quickly convey why I find this point so helpful, but I find this to be a helpful pointer to a key problem, and the post is quite short, and I hope someone else positively votes on it. +4.

Hear, hear!

At least Anthropic didn't particularly try to be a big commercial company making the public excited about AI. Making the AI race a big public thing was a huge mistake on OpenAI's part, and is evidence that they don't really have any idea what they're doing.

I just want to point out, I don't believe this is the case, I believe that the CEO is attempting to play games with the public narrative that benefit his company financially.

Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)

Curated. I thought this was a valuable list of areas, most of which I haven't thought that much about, and I've certainly never seen them brought together in one place before, which I think itself is a handy pointer to the sort of work that needs doing.

3davekasten
So, I really, really am not trying to be snarky here but am worried this comment will come across this way regardless.  I think this is actually quite important as a core factual question given that you've been around this community for a while, and I'm asking you in your capacity as "person who's been around for a minute".  It's non-hyperbolically true that no one has published this sort of list before in this community?   I'm asking, because if that's the case, someone should, e.g., just write a series of posts that just marches through US government best-practices documents on these domains (e.g., Chemical Safety Board, DoD NISPOM, etc.) and draws out conclusions on AI policy.   

I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.

They still make a lot less than they would if they optimized for profit (that said, I think most "safety researchers" at big labs are only safety researchers in name and I don't think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).

You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).

I also note that if one agent becomes way way smarter than the other, that this balance may not work out.

Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!

Overall a very interesting idea.

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4

I don't want to double the comment count I submit to Recent Discussion, so I'll just update this comment with the things I've cut.

12/06/2023 Comment on Originality vs. Correctness

It's fun to take the wins of one culture and apply them to the other, people are very shocked that you found some hidden value to be had (though it often isn't competitive value / legible to the culture). And if you manage to avoid some terrible decison people speak about how wise you are to have noticed.

(Those are the best cases, often of course people are like "this is odd, I'm

... (read more)

For too long, I have erred on the side of writing too much. 

The first reason I write is in order to find out what I think.

This often leaves my writing long and not very defensible.

However, editing the whole thing is so much extra work after I already did all the work figuring out what I think.

Sometimes it goes well if I just scrap the whole thing and concisely write my conclusion.

But typically I don't want to spend the marginal time.

Another reason my writing is too long is because I have extra thoughts I know most people won't find useful. 

But I'... (read more)

2Ben Pace
I don't want to double the comment count I submit to Recent Discussion, so I'll just update this comment with the things I've cut. 12/06/2023 Comment on Originality vs. Correctness

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)

They are not being treated worse than foot soldiers, because they do not have an enemy army attempting to murder them during the job. (Unless 'foot soldiers' itself more commonly used as a metaphor for 'grunt work' and I'm not aware of that.)

I am surprised to see the Open Philanthropy network taking all of the powerful roles here.

The initial Trustees are:

In case it's not apparent:

... (read more)

I think the argument here basically implies that language models will not produce any novel, useful concepts in any existing industries or research fields that get substantial adoption (e.g. >10% of ppl use it, or a widely cited paper) in those industries, in the next 3 years, and if it did this, then the end would be nigh (or much nigher).

To be clear, you might get new concepts from language models about language if you nail some Chris Olah style transparency work, but the language model itself will not output ones that aren't about language in the text.

4Tsvi Benson-Tilsen
I roughly agree. As I mentioned to Adele, I think you could get sort of lame edge cases where the LLM kinda helped find a new concept. The thing that would make me think the end is substantially nigher is if you get a model that's making new concepts of comparable quality at a comparable rate to a human scientist in a domain in need of concepts. Yeah that seems right. I'm not sure what you mean by "about language". Sorta plausibly you could learn a little something new about some non-language domain that the LLM has seen a bunch of data about, if you got interpretability going pretty well. In other words, I would guess that LLMs already do lots of interesting compression in a different way than humans do it, and maybe you could extract some of that. My quasi-prediction would be that those concepts 1. are created using way more data than humans use for many of their important concepts; and 2. are weirdly flat, and aren't suitable out of the box for a big swath of the things that human concepts are suitable for.

It's... possible this is actually the single best example of a public doublecrux writeup that I know of?

This sentence was confusing to me given that the post does not mention 'double crux', but I mentioned it to someone and they said to think of it as the mental motion and not the explicit format, and that makes more sense to me.

1Raymond Arnold
Yeah that's what I intended.

And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.

In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quit... (read more)

  • the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".

I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)

And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.

So anyway, if we do Procedure X which will n... (read more)

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.

I think the last two bullets are probably my best guesses. Nonetheless here is my list:

  • Just because an AI isn't consciously deceptive
... (read more)

Well, part of the semantic nuance is that we don't care as much about the coherence theorems that do exist if they will fail to apply to current and future machines

The correct response to learning that some theorems do not apply as much to reality as you thought, surely mustn't be to change language so as to deny those theorems' existence. Insofar as this is what's going on, these are pretty bad norms of language in my opinion.

As part of my work at Lightcone I manage an office space with an application for visiting or becoming a member, and indeed many of these points commonly apply to rejection emails I send to people, especially "Most applications just don’t contain that much information" and "Not all relevant skills show up on paper".

I try to include some similar things to the post in the rejection emails we send. In case it's of interest or you have any thoughts, here's the standard paragraph that I include:

Our application process is fairly lightweight and so I don't think a

... (read more)

Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.

2Alex Turner
Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours).  I think I ended up optimizing for "actually get model onto the page in 4 minutes" and not for "explain in a way Scott would have understood."

[Alexander][14:17]   

Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function, keeping in mind that I'm very weak on my understanding of these kinds of algorithms and you might have to explain exactly why they're different in this way?

[Yudkowsky][14:21]   

It's about the size of the information bottleneck. [followed by a 6 paragraph explanation]

It's sections like this that show me how many levels above me Eliezer is. When I read Scott's question I thought "I can see that these two algorit... (read more)

FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.

FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:

[2:21 PM]

It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons

... (read more)

Curated! This is a good description of a self-contained problem for a general class of algorithms that aim to train aligned and useful ML systems, and you've put a bunch of work put into explaining reasons why it may be hard, with a clear and well-defined example for conveying the problem (i.e. that Carmichael numbers fool Fermi's Primality Test).

The fun bit for me is talking about how if this problem goes one way (where we cannot efficiently distinguish different mechanisms) this invalidates many prior ideas, and if it doesn't then we can be more optimist... (read more)

Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Elici... (read more)

The process of finding this model of the world is much more complex than anything our AI can do at inference time, and the intermediate results are too complex and numerous to be “memorized” in the weights of our trained AI. So there doesn’t seem to be any way to break the model-finding work into pieces that can be delegated to an ML assistant (in amplification) or a debater (in debate).

I am not understanding this, but it's probably a simple ML terminology thing.

First you train a model, then you use it lots as a black box (of the type: input video-cam... (read more)

This is a post that gave me (an ML noob) a great deal of understanding of how language models work — for example the discussion of the difference between "being able to do a task" and "knowing when to perform that task" is one I hadn't conceptualized before reading this post, and makes a large difference in how to think about the improvements from scaling. I also thought the characterization of the split between different schools of thought and what they pay attention to was quite illuminating.

I don't have enough object-level engagement for my recommendation to be much independent evidence, but I still will be voting this either a +4 or +9, because I personally learned a bunch from it.

However, we should not take this question so lightly.  A yes answer would "complete the circuit" of the enthusiast's argument -- "turn it on" as a live concern.  A no answer would leave the argument in limbo until more evidence comes in.

So, let's assess the state of the evidence.

Such a weird reason (to me) for not taking a question lightly. Are different standards of evidence are required for conclusions who we care about? The bayesian math of probability theorem does not have a term for expected utility.

Of course, it makes sense to invest more i... (read more)

This post is one of the LW posts a younger version of myself would have been most excited to read. Building on what I got from the Embedded Agency sequence, this post lays out a broad-strokes research plan for getting the alignment problem right. It points to areas of confusion, it lists questions we should be able to answer if we got this right, it explains the reasoning behind some of the specific tactics the author is pursuing, and it answers multiple common questions and objections. It leaves me with a feeling of "Yeah, I could pursue that too if I wanted, and I expect I could make some progress" which is a shockingly high bar for a purported plan to solve the alignment problem. I give this post +9.

(for those wondering: kave has been a LWer for many years and works full-time with the lightcone team)

  • Paul's post on takeoff speed had long been IMO the last major public step in the dialogue on this subject (not forgetting to honorably mention Katja's crazy discontinuous progress examples and Kokotajlo's arguments against using GPD as a metric), and I found it exceedingly valuable to read how it reads to someone else who has put in a great deal of work into figuring out what's true about the topic, thinks about it in very different ways, and has come to different views on it. I found this very valuable for my own understanding of the subject, and I felt I
... (read more)

This sentence is an exception, but there aren't a lot of naturally occurring examples.

No strong claim either way, but as a datapoint I do somewhat often use the phrase "I hereby invite you to <event>" or "I hereby <request> something of you" to help move from 'describing the world' to 'issuing an invitation/command/etc'.

1Andrew Critch
True!  "Hereby" covers a solid contingent of self-referential sentences.  I wonder if there's a "hereby" construction that would make the self-referential sentence Ψ (from the Wikipedia poof) more common-sense-meaningful to, say, lawyers.

Thanks for the link, I'll aim to give that podcast a listen, it's relevant to a bunch of my current thinking.

I quite appreciate that you're up front about the distinction between "we shipped things we like" and "we made progress on alignment". Some of the stuff I've read (janus's posts, infohazard policy) has been pretty thoughtful (even while I disagreed with some of it) and I was definitely worried that the work would be much worse. I recall my main recommendation to you when you visited the Bay (and my offices) was "publish your ideas so I and others can evaluate them", and I'm definitely glad you've done a bunch of that. And I agree that you can do even more ... (read more)

I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in

... (read more)

Curated. All of these examples together really point quite clearly at a change in how language models behave when they're trained on RLHF, away from the "accurately predict text" story toward something else that has a very different set of biases — I am interested to read your potential follow-up with your own hypotheses. Plsu, the post is really fun to read.

I think this post and your prior post both overstate the case in some ways, but they're still great additions to my and I expect many others' thinking on this subject. I broadly feel like I've been 'ne... (read more)

Does GPT-3 have no idea what letters look like?

I think there's an implication in this section that davinci will accurately describe what letters look like, or at least much more commonly/accurately than the false answers from text-davinci-002. Anybody know if that's true?

Added: I just tried, but couldn't get it to try to answer the question, it would just give more questions (completing it as though my prompt was but one item on a questionnaire).

2Paul Christiano
I don't think that's the case.

A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor.

This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.

2jacob_cannell
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn't an efficiency disadvantage - so even if your brain type is less monitorable we are not confined to that design. I also do not believe your central claim - in that based on my knowledge of neuroscience - disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans. Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).

I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no, but I am broadly a bit confused when this is a commitment for.

Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else? 

Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don't know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?

3Buck Shlegeris
Thanks for the questions :) Probably no.  Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer". Not totally settled. We'll probably have most people at a big final cohort in January, and we'll try to have people who arrive earlier show up at synced times so that they can do the training week with others. The default is to do research directed by Redwood staff. You do not need to come in with any research plans.

Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.

Hm, no strong hunches here. Bad ideas babble:

  • It may somehow learn about the world I'm in, learn I'm in a bad negotiation position (e.g. because my rival AI company is about to release their paperclip maximizer), and precommit to only giving me at most 0.00001% of the universe, a bad deal that I will grudgingly accept.
  • I mean, I don't know if this counts, but perhaps you've only understood
... (read more)

Individuals who shape the world, are often those who have ended up being optimizers. 

It sounds like you find that claim disturbing, but I don't think it's all bad.

I'm interested in more of a sense of what mistake you think people are making, because I think caring about something strong enough to change who you are around it can be a very positive force in the world.

3Alex Flint
Yeah, caring about something enough to change who you are is really one of the highest forms of virtue, as far as I'm concerned. It's somewhat tragic that the very thing that makes us capable of this high form of virtue -- our capacity to deliberately shift what we value -- can also be used to take what was once an instrumental value and make it, more or less, into a terminal value. And generally, when we make an instrumental value into a terminal value (or go as far as we can in that direction), things go really badly, because we ourselves become an optimizer for something that is harmless when pursued as an instrumental value (like paperclips), but is devastating when pursued as a terminal value (like paperclips). So the upshot is: to the extent that we are allowing instrumental values to become more-or-less terminal values without really deliberately choosing that or having a good reason to allow it, I think that's a mistake. To the extent that we are shifting our values in service of what which is truly worth protecting, I think that's really virtuous. The really interesting question as far as I'm concerned is what the thing is that we rightly change our values in service of? In this community, we often take that thing to be representable as a utility function over physical world states. But it may not be representable that way. In Buddhism the thing is conceived of as the final end of suffering. In western moral philosophy there are all kinds of different ways of conceiving of that thing, and I don't think all that many of them can be represented as a utility function over physical world states. In this community we tend to side-step object-level ethical philosophy to some extent, and I think that may be our biggest mistake.

Thanks for the concrete example in the first paragraph, upvote.

I don't know that it would successfully contain humans who were within it for 10^36 years. That seems like enough time for some Ramanujan-like figure to crack the basics of how to code an AGI in his head and share it, and potentially figure out a hidden place or substrate on which to do computation that the drones aren't successfully tracking. (It's also enough time for super-babies or discovering other interesting cheat codes in reality.)

10^36 is my cached number from the last time I asked how... (read more)

Curated. 

I am unsure how obvious this result is in-advance, my guess is there are at least some people who would have correctly predicted it and for the right reasons. However, I really love a lot of things about this post: first because you answer a question ML researchers do have disagreements over, second because you coded two small games for humans to play to help understand what's going on, third because you collect novel data, and fourth because you present it so clearly and readably. 

I gained a better understanding of how language-models work from reading this post and playing the games, and I'd love to see more posts answering open questions through this kind of simple experimental work.

(I have added the point I wanted to add to this conversation, and will tap out now.)

Wei Dai says:

If you look at the economic theories (mostly based on game theory today) that try to explain why economies are organized the way they are, and where market inefficiencies come from, they all have a fundamental dependence on the assumption of different participants having different interests/values. In other words, if you removed that assumption from the theoretical models and replaced it with the opposite assumption, they would collapse in the sense that all or most of the inefficiencies ("transaction costs") would go away...

...With existing h

... (read more)
5Wei Dai
I don't disagree with this. I would add that if agents aren't aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur. Some work being "detailed and costly" isn't necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I'm not super confident about this (and I'm overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn't explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
Load More