AI ALIGNMENT FORUM
AF

All of Zac Hatfield-Dodds's Comments + Replies

In response to critiques of Guaranteed Safe AI

I'm sorry that I don't have time to write up a detailed response to (critique of?) the response to critiques; hopefully this brief note is still useful.

I remain frustrated by GSAI advocacy. It's suited for well-understood closed domains, excluding e.g. natural language, when discussing feasibility; but 'we need rigorous guarantees for current or near-future AI' when arguing for importance. It's an extension to or complement of current practice; and current practice is irresponsible and inadequate. Often this is coming from different advocates, but th

... (read more)

Ten people on the inside

Zac Hatfield-Dodds1mo18

I don't think Miles' or Richard's stated reasons for resigning included safety policies, for example.

But my broader point is that "fewer safety people should quit leading labs to protest poor safety policies" is basically a non-sequitor from "people have quit leading labs because they think they'll be more effective elsewhere", whether because they want to do something different or independent, or because they no longer trust the lab to behave responsibly.

4Oliver Habryka1mo

Hmm, you have a very different read of Richard's message than I do. I agree Miles' statement did not reason through safety policies, but IMO his blogging since then has included a lot of harsh words for OpenAI, in a way that at least to me made the connection clear (and I think also to many others, but IDK, it's still doing some tea-leaf reading).

Ten people on the inside

Zac Hatfield-Dodds1mo31

I agree with Rohin that there are approximately zero useful things that don't make anyone's workflow harder. The default state is "only just working means working, so I've moved on to the next thing" and if you want to change something there'd better be a benefit to balance the risk of breaking it.

Also 3% of compute is so much compute; probably more than the "20% to date over four years" that OpenAI promised and then yanked from superalignment. Take your preferred estimate of lab compute spending, multiply by 3%, and ask yourself whether a rushed unreasonable lab would grant that much money to people working on a topic it didn't care for, at the expense of those it did.

Ten people on the inside

Zac Hatfield-Dodds1mo70

My impression is that few (one or two?) of the safety people who have quit a leading lab did so to protest poor safety policies, and of those few none saw staying as a viable option.

Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.

Oliver Habryka1mo*122

My impression is that few (one or two?) of the safety people who have quit a leading lab did so to protest poor safety policies, and of those few none saw staying as a viable option.

While this isn't amazing evidence, my sense is there have been around 6 people who quit who in-parallel to them announcing their leave called out OpenAI's reckless attitude towards risk (at various levels of explicitness, but quite strongly in all cases by standard professional norms).

It's hard to say that people quit "to protest safety policies", but they definitely used... (read more)

3Buck Shlegeris1mo

Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.

4Ryan Greenblatt1mo

As in, you don't expect they'll be able to implement stuff even if it doesn't make anyone's workflow harder or you don't expect they'll be able to get that much compute? Naively, we might expect ~1% of compute as we might expect around 1000 researchers and 10/1000 is 1%. Buck said 3% because I argued for increasing this number. My case would be that there will be bunch of cases where the thing they want to do is obviously reasonable and potentially justifiable from multiple perspectives (do some monitoring of internal usage, fine-tune a model for forecasting/advice, use models to do safety research) such that they can pull somewhat more compute than just the head count would suggest.

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

Zac Hatfield-Dodds2mo55Review for 2023 Review

I think this is the most important statement on AI risk to date. Where ChatGPT brought "AI could be very capable" into the overton window, the CAIS Statement brought in AI x-risk. When I give talks to NGOs, or business leaders, or government officials, I almost always include a slide with selected signatories and the full text:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

I believe it's true, that it was important to say, and that it's had an ongoing, large, and positive impact. Thank you again to the organizers and to my many, many co-signatories.

AIS terminology proposal: standardize terms for probability ranges

Zac Hatfield-Dodds6mo*12

I further suggest that if using these defined terms, instead of including a table of definitions somewhere you include the actual probability range or point estimate in parentheses after the term. This avoids any need to explain the conventions, and makes it clear at the point of use that the author had a precise quantitative definition in mind.

For example: it's likely (75%) that flipping a pair of fair coins will get less than two heads, and extremely unlikely (0-5%) that most readers of AI safety papers are familiar with the quantitative convention prop... (read more)

There Should Be More Alignment-Driven Startups

Zac Hatfield-Dodds9mo*20

Without wishing to discourage these efforts, I disagree on a few points here:

Still, the biggest opportunities are often the ones with the lowest probability of success, and startups are the best structures to capitalize on them.

If I'm looking for the best expected value around, that's still monotonic in the probability of success! There are good reasons to think that most organizations are risk-averse (relative to the neutrality of linear $=utils) and startups can be a good way to get around this.

Nonetheless, I remain concerned about regressional Good... (read more)

3Judd Rosenblatt9mo

Thanks, appreciate your wanting these efforts not discouraged! I agree there's certainly a danger of AI safety startups optimizing for what will appeal to investors (not just with risk appetite but in many other dangerous ways too) and Goodharting rather than focusing purely on the most impactful work. VCs themselves tend not to think as long-term as they should (even for their own economic interests), but I'm hopeful we can build an ecosystem around AI safety where they do more. Likely, the investors interested in AI safety will be inclined to think more long-term. The few early AI safety investors that exist today certainly are. I do think it's crucial (and possible!) for founders in this space to be very thoughtful about their true long-term goals and incentives around alignment and to build the right structures around AI safety for-profit funding. On your diversification point, for example, a windfall trust-like thing for all AI safety startups to share in the value each other create could make a lot of sense considering just a very tiny bit of equity in the biggest winners may be quickly larger than our entire economy today. Also, inadequate equilibria are too bad, yeah, but inadequate equilibria apply to all orgs, not just startups. We pointed out in the post above In fact, startups can be a powerful antidote to inadequate equilibria. I think often the biggest opportunities for startups are actually solving inadequate equilibria, especially leveraging technology shifts/innovations, like electric cars. Ideal new structures to facilitate and govern maximal AI safety innovation would help fast-track solutions around these inadequate equilibria. In contrast, established systems are more prone to yielding inadequate equilibria due to their resistance to change. I also think we may be underestimating how much people may come together to try to solve these problems as they increasingly come to take them seriously. Today at LessOnline, an interesting discuss

There Should Be More Alignment-Driven Startups

Zac Hatfield-Dodds9mo65

My personal opinion is that starting a company can be great, but I've also seen several fail due to the gaps between their personal goals, a work-it-out-later business plan, and the duties that you/your board owes to your investors.

IMO any purpose-driven company should be founded as a Public Benefit Corporation, to make it clear in advance and in law that you'll also consider the purpose and the interests of people materially affected by the company alongside investor returns. (cf § 365. Duties of directors)

3Judd Rosenblatt9mo

Yes, you're right, and most startups do fail. That's how it works! Still, the biggest opportunities are often the ones with the lowest probability of success, and startups are the best structures to capitalize on them. This paradigm may fit well to AI safety. Ideally we can engineer an ecosystem that creates enough that do succeed and substantially advance AI safety. Seems to me that aggressively expanding the AI safety startup ecosystem is one of the highest-value interventions available right now. Meanwhile, strongly agreed that AI safety driven startups should be B corps, especially if they're raising money.

Zac Hatfield-Dodds1y812

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative ... (read more)

Oliver Habryka1y76

I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds1y32

The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2.

Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Zac Hatfield-Dodds1y55

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

Meta announces Llama 2; "open sources" it for commercial use

Zac Hatfield-Dodds2y41

Example projects you're not allowed to do, if they involve other model families:

using Llama 2 as part of an RLAIF setup, which you might want to do when investigating Constitutional AI or decomposition or faithfulness of chain-of-thought or many many other projects;
using Llama 2 in auto-interpretability schemes to e.g. label detected features in smaller models, if this will lead to improvements in non-Llama-2 models;
fine-tuning other or smaller models on synthetic data produced by Llama 2, which has some downsides but is a great way to check for signs

Zac Hatfield-Dodds2y*2114

Llama 2 is not open source.

(a few days after this comment, here's a concurring opinion from the Open Source Initiative - as close to authoritative as you can get)

(later again, here's Yan LeCun testifying under oath: "so first of all Llama system was not made open source ... we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model it was reserved to researchers and academics")

While their custom licence permits some commercial uses, it is not an OSI approved license, and because it violates the ... (read more)

0Rudi2y

Thanks a lot for the context! Out of curiosity, why does the model training restriction make it much less useful for safety research?

Neel Nanda2y42

Huh, that's very useful context, thanks! Seems like pretty sad behaviour.

Anthropic's Core Views on AI Safety

Zac Hatfield-Dodds2y104

(Zac's note: I'm posting this on behalf of Jack Clark, who is unfortunately unwell today. Everything below is his words.)

Hi there, I’m Jack and I lead our policy team. The primary reason it’s not discussed in the post is that the post was already quite long and we wanted to keep the focus on safety - I did some help editing bits of the post and couldn’t figure out a way to shoehorn in stuff about policy without it feeling inelegant / orthogonal.

You do, however, raise a good point, in that we haven’t spent much time publicly explaining what we’re up t... (read more)

A Barebones Guide to Mechanistic Interpretability Prerequisites

Zac Hatfield-Dodds2y44

For Python basics, I have to anti-recommend Shaw's 'learn the hard way'; it's generally outdated and in some places actively misleading. And why would you want to learn the hard way instead of the best way in any case?

Instead, my standard recommendation is Al Sweigart's Automate the Boring Stuff and then Beyond the Basic Stuff (both readable for free on inventwithpython.com, or purchasable in books); he's also written some books of exercises. If you prefer a more traditional textbook, Think Python 2e is excellent and also available freely online.

2Neel Nanda2y

Thanks! I learned Python ~10 years ago and have no idea what sources are any good lol. I've edited the post with your recs :)

Zac Hatfield-Dodds2y29

Unfortunately, the interpretable-composition hypothesis is simply wrong! Many bugs come from 'feature interactions', a term coined by by Pamela Zave - and if it wasn't for that, programming would be as easy as starting from e.g. Lisp's or Forth's tiny sets of primitives.

As to a weaker form, well, yes - using a 'cleanroom' approach and investing heavily in formal verification (and testing) can get you an orders-of-magnitude lower error rate than ordinary software... at orders-of-magnitude greater cost. At more ordinarily-reasonable levels of rigor, I'd em... (read more)

1lemonhope2y

Good point about validators failing silently and being more strongly vetted. Abstractly, it seems to me that once the tooling and process is figured out for one task in a narrow domain, you could reuse that stuff on other tasks in the same domain at relatively low cost. But the history of repeated similar vulnerabilities over long time ranges in narrow domains (eg gnu core utils) is perhaps some evidence against that. I agree with the first half and would add that restricting the kind of interface has large additional benefit on top of the interfaces themselves. If you are dealing with classes and functions and singleton module things and macros then you're far more prone to error compared to just using any single one of those things. Even if they are all simple. If system validation has exponential cost with respect to confidence and system size then I think the simplicity of the primitives is perhaps uh the base of the exponent. And 1.011000 is a lot smaller than 21000. My main point is that this the difference between "annoying to validate" and "we will never ever be able to validate"

Discussion with Eliezer Yudkowsky on AGI interventions

Zac Hatfield-Dodds3y10

First, an apology: I didn't mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I'm sorry I wasn't clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler app... (read more)

5Vanessa Kosoy3y

I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don't need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented). I think the pipeline of success looks something like theoretical models => phenomenological models (i.e. models informed by a combination of theory and experiment) => security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) => plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it's not a pure waterfall, there is feedback from each stage to previous stages.

Discussion with Eliezer Yudkowsky on AGI interventions

Zac Hatfield-Dodds3y*150

I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I'm less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don't know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class betw

... (read more)

3Buck Shlegeris3y

Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?

Vanessa Kosoy3y120

You're attacking a strawman of what kind of theorems we want to prove. Obviously we are not going to prove theorems that contain specific datasets as part of the statement. What we're going to do is build a theory founded on certain assumptions about the real-world (such as locality / decoupling of dynamics on different scales / some kind of chaos / certain bounds on computational complexity / existence of simple fundamental laws etc) and humans (e.g. that they are approximately rational agents, for some definition thereof). Such a theory can produce many ... (read more)

[AN #152]: How we’ve overestimated few-shot learning capabilities

Zac Hatfield-Dodds4y00

High Impact Careers in Formal Verification: Artificial Intelligence

My research focuses on advanced testing and fuzzing tools, which are so much easier to use that people actually use them - eg in Pytorch, and I understand in Deepmind. If people seem interested I could write up a post on relevance to AI safety in a few weeks.

Core idea: even without proofs, writing out safety properties or other system invariants in code is valuable both (a) for deconfusion, and (b) because we can have a computer search for counterexamples using a variety of heuristics an... (read more)

2Rohin Shah4y

My general take is that anything relying on safety properties / system invariants being written out in formal languages / interpretable code seems not that helpful, because we just don't get such clean safety properties / system invariants. I am pretty enthusiastic about relying on learned models of safety properties / system invariants, that are assumed to be good but not perfect specifications; and fuzzing / testing with respect to those learned models seems great.