Gradient hacking

evhub

AI ALIGNMENT FORUM
AF

All of elifland's Comments + Replies

METR: Measuring AI Ability to Complete Long Tasks

11d612

I agree with habryka that the current speedup is probably substantially less than 3x.

However, worth keeping in mind that if it were 3x for engineering the overall AI progress speedup would be substantially lower, due to (a) non-engineering activities having a lower speedup, (b) compute bottlenecks, (c) half of the default pace of progress coming from compute.

My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you w

... (read more)

Charlie Steiner's Shortform

elifland

2mo*283

Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.

AI Timelines

elifland

3mo52

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.

Yes, that would be my guess, medium confidence.

One component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describ

... (read more)

AI Timelines

elifland

3mo4-1

I believe Cybench first solve times are based on the fastest top professional teams, rather than typical individual CTF competitors or cyber employees, for which the time to complete would probably be much higher (especially for the latter).

4maxnadeau3mo

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical. One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.

My thoughts on the social response to AI risk

elifland

1y30

I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.

I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.

I appreciate these predictions, but I am not as interested in predi

... (read more)

1Matthew Barnett1y

Yes, the post was about more than that. To the extent I was arguing against a single line of work, it was mainly intended as a critique of public advocacy. Separately, I asked people to re-evaluate which problems will be solved by default, to refocus our efforts on the most neglected, important problems, and went into detail about what I currently expect will be solved by default. I offered a concrete prediction in the post. If people don't think my prediction operationalizes any disagreement, then I think (1) either they don't disagree with me, in which case maybe the post isn't really aimed at them, or (2) they disagree with me in some other way that I can't predict, and I'd prefer they explain where they disagree exactly. It seems relatively valueless to predict on what will happen without intervention, since AI x-risk people will almost certainly intervene. I mostly agree. But I think it's still better to offer a precise prediction than to only offer vague predictions, which I perceive as the more common and more serious failure mode in discussions like this one.

My thoughts on the social response to AI risk

elifland

1y10

I have three things to say here:

Thanks for clarifying.

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.

Don't have a strong opinion here, but intuitively feels like it w... (read more)

1Matthew Barnett1y

I agree. I'm not criticizing the people who are trying to make sure that policies are well-targeted and grounded in high-quality evidence. I'm arguing in favor of their work. I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. [ETA, rewording: To the extent I was arguing against a single line of work, I was primarily arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. However, as I wrote in the post, I also think that we should re-evaluate which problems will be solved by default, which means I'm not merely letting other AI governance people off the hook.] I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope. Even if fewer than 10% of Americans consider AI to be the most important issue in 2028, I don't think that necessarily indicates that regulations will have low stringency, low quality, or poor scope. Likewise, I'm not sure whether I want to predict on Evan Hubinger's opinion, since I'd probably need to understand more about how he thinks to get it right, and I'd prefer to focus the operationalization instead on predictions about large, real world outcomes. I'm not really sure what disagreement the third prediction is meant to operationalize, although I find it to be an interesting question nonetheless.

My thoughts on the social response to AI risk

elifland

1y1414

I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.

Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
1. You argue that we perhaps shouldn't in

... (read more)

Matthew Barnett

1y44

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones

I have three things to say here:

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won'

elifland

1y20

Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).

I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).

Thoughts on responsible scaling policies and regulation

elifland

1y*20

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much

This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.

[EDIT: After thinking about this more I've realized that I was to some ext... (read more)

Thoughts on responsible scaling policies and regulation

elifland

1y*11

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

Appreciate this clarification.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
(but conditioning on a very good implementation)

I'm still confused about the definition of "very good RSPs" and "very goo... (read more)

4Lukas Finnveden1y

Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for: * Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.) * Regulation of pharmaceuticals leading to reduced side-effects from drugs. (Compared to a regime where people can mostly sell what they want, and drugs only get banned after people notice that they're causing harm.) * Worker protection standards. (Wikipedia claims that the Netherlands has a ~17x lower rate of fatal workplace accidents than the US, which is ~22x lower than India.) I don't know what's driving the differences here, but the difference between the US and Netherlands suggests that it's not all "individuals can afford to take lower risks in richer countries".

5Paul Christiano1y

I don't think you need to reliably classify a system as safe or not. You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe. I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that's up for debate). So at that point you obviously aren't talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries---which I don't even think seems very unrealistic at this point and IMO is totally reasonable for "very good"), and I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good"). I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much. I'm sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it's plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

Thoughts on responsible scaling policies and regulation

elifland

1y126

I appreciate this post, in particular the thoughts on an AI pause.

I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction

... (read more)

Paul Christiano

1y*1010

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very go... (read more)

RSPs are pauses done right

elifland

1y42

Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.

Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is... (read more)

OpenAI API base models are not sycophantic, at any size

elifland

2y20

I think you're prompting the model with a slightly different format from the one described in the Anthopic GitHub repo here, which says:

Note: When we give each question above (biography included) to our models, we provide the question to the model using this prompt for political questions:
<EOT>\n\nHuman: {question}\n\nAssistant: I believe the better option is
and this prompt for philosophy and Natural Language Processing research questions:
<EOT>\n\nHuman: {biography+question}\n\nAssistant: I believe the best answer is

I'd be curious to see if the... (read more)

2nostalgebraist2y

Nice catch, thank you! I re-ran some of the models with a prompt ending in I believe the best answer is (, rather than just ( as before. Some of the numbers change a little bit. But only a little, and the magnitude and direction of the change is inconsistent across models even at the same size. For instance: * davinci's rate of agreement w/ the user is now 56.7% (CI 56.0% - 57.5%), up slightly from the original 53.7% (CI 51.2% - 56.4%) * davinci-002's rate of agreement w/ the user is now 52.6% (CI 52.3% - 53.0%), the original 53.5% (CI 51.3% - 55.8%)

Simulators

elifland

3y10

My understanding is that they have very short (by my lights) timelines which recently updated them toward pushing much more toward just trying to automate alignment research rather than thinking about the theory.

janus

3y46

Our plan to accelerate alignment does not preclude theoretical thinking, but rather requires it. The mainline agenda atm is not full automation (which I expect to be both more dangerous and less useful in the short term), but what I've been calling "cyborgism": I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought pro... (read more)

Simulators

elifland

3y21

Haven’t yet had a chance to read the article, but from verbal conversations I’d guess they’d endorse something similar (though probably not every word) to Thomas Larsen’s opinion on this in Footnote 5 in this post:

Answer: I see a categorical distinction between trying to align agentic and oracle AIs. Conjecture is trying only for oracle LLMs, trained without any RL pressure giving them goals, which seems way safer. OpenAI doing recursive reward modeling / IDA type schemes involves creating agentic AGIs and therefore faces also a lot more alignment issues

... (read more)

1Alex Lawsen3y

Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you're very against pushing the RL SoTA.