AI ALIGNMENT FORUM
AF

All of Davidmanheim's Comments + Replies

Dream, Truth, & Good

This seems reasonable, though efficacy of the learning method seems unclear to me.

But:

with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author

This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.

2Abram Demski1mo

My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious. Maybe a better way to do it would be to explicitly take both approaches, so that there's an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don't have evidence in the form of writing from you. Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further.

Dream, Truth, & Good

There's a critical (and interesting) question about how you generate the latent space of authors, and/or how it is inferred from the text. Did you have thoughts on how this would be done?

3Abram Demski1mo

My idea is very similar to paragraph vectors: the vectors are trained to be useful labels for predicting the tokens. To differentiate author-vectors from other types of metadata, the author vectors should be additionally trained to predict author labels, with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author. There's also the author-vector-to-text-author-attribution network, which should be pre-trained to have a good "prior" over author-names (so we're not getting a bunch of nonsense strings out). During training, the text author-names are being estimated alongside the vectors (where author labels are not available), so that we can penalize different author-vectors which map to the same name. (Some careful thinking should be done about how to handle people with the actual same name; perhaps some system of longer author IDs?) Other meta-data would be handled similarly.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.

Most Minds are Irrational

Regarding Chess agents, Vanessa pointed out that while only perfect play is optimal, informally we would consider agents to have an objective that is better served by slightly better play, for example, an agent rated 2500 ELO is better than one rated 1800, which is better than one rated 1000, etc. That means that lots of "chess minds" which are non-optimal are still somewhat rational at their goal.

I think that it's very likely that even according to this looser definition, almost all chess moves, and therefore almost all "possible" chess bots, fail to do m... (read more)

Language Models are a Potentially Safe Path to Human-Level AGI

4mo10Review for 2023 Review

I think this was a valuable post, albeit ending up somewhat incorrect about whether LLMs would be agentic - not because they developed the capacity on their own, but because people intentionally built and are building structure around LLMs to enable agency. That said, the underlying point stands - it is very possible that LLMs could be a safe foundation for non-agentic AI, and many research groups are pursuing that today.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

4mo47Review for 2023 Review

I think this post makes an important and still neglected claim that people should write their work more clearly and get it published in academia, instead of embracing the norms of the narrower community they interact with. There has been significant movement in this direction in the past 2 years, and I think this posts marks a critical change in what the community suggests and values in terms of output.

Hierarchical Agency: A Missing Piece in AI Alignment

Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)

The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.

It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has cal... (read more)

Limitations on Formal Verification for AI Safety

I have a lot more to say about this, and think it's worth responding to in much greater detail, but I think that overall, the post criticizes Omhundro and Tegmark's more extreme claims somewhat reasonably, though very uncharitably, and then assumes that other proposals which seem to be related, especially Dalyrymple et al. approach, are essentially the same, and doesn't engage with the specific proposal at all.

To be very specific about how I think the post in unreasonable, there are a number of places where a seeming steel-man version of the proposals are ... (read more)

Linear infra-Bayesian Bandits

I'll note that I think this is a mistake that lots of people working in AI safety have made, ignoring the benefits of academic credentials and prestige because of the obvious costs and annoyance. It's not always better to work in academia, but it's also worth really appreciating the costs of not doing so in foregone opportunities and experience, as Vanessa highlighted. (Founder effects matter; Eliezer had good reasons not to pursue this path, but I think others followed that path instead of evaluating the question clearly for their own work.)

And in m... (read more)

3 levels of threat obfuscation

2y20

A story of how that happens:

In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to... (read more)

2Matthew "Vaniver" Gray2y

But gradient descent will still change the way that the system interprets things in its data storage, right? I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You're probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.

For alignment, we should simultaneously use multiple theories of cognition and value

2y10

For a defense of people pursuing a mathematical approach of a type you think isn't valuable, see my recent post.
(That does not address the correct issue you raised about requisite variety, but some work on HRAD does do so explicitly - such as embedded agency.)

On how various plans miss the hard bits of the alignment challenge

2y33

Just noting that given more recent developments than this post, we should be majorly updating on recent progress towards Andrew Critch's strategy. (Still not more likely than not to succeed, but we still need to assign some Bayes points to Critch, and take some away from Nate.)

Systems that cannot be unsafe cannot be safe

2y41

I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.

That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

Systems that cannot be unsafe cannot be safe

2y22

That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y33

Thanks, reading closely I see how you said that, but it wasn't clear initially. (There's an illusion of disagreement, which I'll christen the "twitter fight fallacy," where unless the opposite is said clearly, people automatically assume replies are disagreements.)

Your posts should be on arXiv

2y20

I probably put in an extra 20-60 hours, so the total is probably closer to 150 - which surprises me. I will add that a lot of the conversion time was dealing with writing more, LaTeX figures and citations, which were all, I think, substantive valuable additions. (Changing to a more scholarly style was not substantively valuable, nor was struggling with latex margins and TikZ for the diagrams, and both took some part of the time.)

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y10

Thanks, agreed. And as an aside, I don't think it's entirely coincidental that neither of the people who agree with you are in the Bay.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y30

I think that the costs usually are worth it far more often than it occurs, from an outside view - which was David's point, and what I was trying to respond to. I think that it's more valuable than one expects to actually just jump through the hoops. And especially for people who haven't yet ever had any outputs actually published, they really should do that at least once.

(Also, sorry for the zombie reply.)

2Daniel Kokotajlo2y

I love zombie replies. If you reread this conversation, you'll notice that I never said I think these people are correct. I was just saying that their stated motivations and views are their real motivations and views. I actually do agree with you and David Krueger that on the margin more LW types should be investing in making their work publishable and even getting it published. The plan had always been "do research first, then communicate it to the world when the time is right" well now we are out of time so the time is right.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y45

You're very unusually proactive, and I think the median member of the community would be far better served if they were more engaged the way you are. Doing that without traditional peer reviewed work is fine, but unusual, and in many ways is more difficult than peer-reviewed publication. And for early career researchers, I think it's hard to be taken seriously without some more legible record - you have a PhD, but many others don't.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y34

To respond briefly, I think that people underinvest in (D), and write sub-par forum posts rather than aim for the degree of clarity that would allow them to do (E) at far less marginal cost. I agree that people overinvest in (B)^[1], but also think that it's very easy to tell yourself your work is "actual progress" when you're doing work that, if submitted to peer-reviewed outlets, would be quickly demolished as duplicative of work you're unaware of, or incompletely thought-out in other ways.

I also worry that many people have never written a peer reviewed p... (read more)

2mikbp2y

This should be obvious for everyone! As an outside observer and huge sympathizer, it is super-frustrating how siloed the broad EA/rational/AI-alignment/adjacent community is --this specific issue with publication is only one of the consequences. Many of "you people" only interacting between "yourselves" (and I'm not referring to you, Davids), very often even socially. I mean, you guys are trying to do the most good possible, so help others use and leverage on your work! And don't waste time reinventing what is already common or, at least, what already exists outside. More mixing would also help prevent Leverage-style failures and probably improve what from the outside seems like a very weird and unhealthy "bay area social dynamics" (as put by Kaj here).

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y10

That seems right.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y10

they don't judge those costs to be worth it

Worth it to whom? And if they did work that's valuable, how much of that value is lost if others who could benefit don't see it, because it's written up only informally or not shared widely?

3Daniel Kokotajlo2y

Worth it to the world/humanity/etc. though maybe some of them are more self-focused. Probably a big chunk of it is lost for that reason yeah. I'm not sure what your point is, it doesn't seem to be a reply to anything I said.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y20

There have also been plenty of other adapatations, ones which were not low-effort. I worked on 2, the Goodhart's law paper and a paper with Issa Rice on HRAD. Both were very significantly rewritten and expanded into "real" preprints, but I think it was clearly worthwhile.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y31

And I think the post here is saying that you should jump through those effort and editing hoops far more often than currently occurs.

3Raymond Arnold2y

Yeah, I didn't mean to be responding to that point one way or another. It just seemed bad to be linking to a post that (seems to still?) communicate false things, without flagging those false things. (post still says "it can be as easy as creating a pdf of your post", which my impression maybe technically true on rare occasions but basically false in practice?)

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

2y67

If someone says the opportunity cost is not worth it for them, I see that as a claim that a priori might be true or false. Your post seems to imply that almost everyone is making an error in the same direction, and therefore funders should put their thumb on the scale. That’s at least not obvious to me.

I do think this is the wrong calculation, and the error caused by it is widely shared and pushes in the same direction.

Publication is a public good, where most of the benefit accrues to others / the public. Obviously costs to individuals are higher tha... (read more)

2y55

If we compare

(A) “actual progress”, versus
(B) “legible signs of progress”,

it seems obvious to me that everyone has an incentive to underinvest in (A) relative to (B). You get grants & jobs & status from (B), not (A), right? And papers can be in (B) without being minimally or not at all in (A).

In academia, people talk all the time about how people are optimizing their publication record to the detriment of field-advancement, e.g. making results sound misleadingly original and important, chasing things that are hot, splitting results into unnecessari... (read more)

Alexander and Yudkowsky on AGI goals

2y10

So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this.

Minor quibble which seems to have implications - "There is a consensus that there are roughly about 100 billion neurons total in the human brain. Each of these neurons can have up to 15,000 connections with other neurons via synapses"

My rough understanding is that babies' brains greatly increase how many synapses there are until age 2 or 3, then these are eliminated or become silent in older children and adult... (read more)

Mysteries of mode collapse

2y10

Thinking about this a bit, (not a huge amount,) I think the specific example "are bugs real" ends up looking interesting in part because the word "bugs" in the prompt has incredibly low likelihood. (As does the following word, "real")

So the model is conditioning on very low likelihood inputs, which seems like part of the reason for the behavior.

1mic2y

The prompt "Are birds real?" is somewhat more likely, given the "Birds aren't real" conspiracy theory, but still can yield a similarly formatted answer to "Are bugs real?" The answer makes a lot more sense when you ask a question like "Are monsters real?" or "Are ghosts real?" It seems that with FeedMe, text-davinci-002 has been trained to respond with a template answer about how "There is no one answer to this question", and it has learned to misgeneralize this behavior to questions about real phenomena, such as "Are bugs real?"

Don't design agents which exploit adversarial inputs

2y1-1

I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)

And the second paper's taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart's law, in both goal poisoning and optimization theft cases - and both of these seem relevant to the questions you discussed in terms of grader-optimization.

Don't design agents which exploit adversarial inputs

2y10

Fixed - thanks!

Don't design agents which exploit adversarial inputs

2y20

This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.

Don't design agents which exploit adversarial inputs

2y3-9

This seems great!

If you are continuing work in this vein, I'd be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents - and I think the framework I pointed to here might be useful.

2Alex Turner2y

I'm not sure I understand what you mean by "specific forms of failure." Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?

3Alex Turner2y

(Your second link is broken.)

AI coordination needs clear wins

3y3-4

Is the best way to suggest how to do political and policy strategy, or coordination, to post it publicly on Lesswrong? This seems obviously suboptimal, and I'd think that you should probably ask for feedback and look into how to promote cooperation privately first.

That said, I think everything you said here is correct on an object level, and worth thinking about.

3Evan Hubinger3y

I have done this also.

Your posts should be on arXiv

3y810

Strongly agree. Three examples of work I've put on Arxiv which originated from the forum, which might be helpful as a touchstone. The first was cited 7 times the first year, and 50 more times since. The latter two were posted last year, and have not been indexed by Google as having been cited yet.

As an example of a technical but fairly conceptual paper, there is the Categorizing Goodhart's law paper. I pushed for this to be a paper rather than just a post, and I think that the resulting exposure was very worthwhile. Scott wrote the original pos... (read more)

3Issa Rice2y

I didn't log the time I spent on the original blog post, and it's kinda hard to assign hours to this since most of the reading and thinking for the post happened while working on the modeling aspects of the MTAIR project. If I count just the time I sat down to write the blog post, I would guess maybe less than 20 hours. As for the "convert the post to paper" part, I did log that time and it came out to 89 hours, so David's estimate of "perhaps another 100 hours" is fairly accurate.

Your posts should be on arXiv

3y23

Seconding the .tex export, since it's much more useful than just getting a pdf!

What Makes A Good Measurement Device?

3y10

That's correct. My point is that measuring goals which are not natural to measure will, in general, have many more problems with Goodharting and similar misoptimization and overoptimization pressures. And other approaches can be more productive, or at least more care is needed with design of metrics rather than discovery of what to measure and how.

What Makes A Good Measurement Device?

3y30

I think this is going to be wrong as an approach. Weight and temperature are properties of physical systems at specific points in time, and can be measured coherently because we understand laws about those systems. Alignment could be measured as a function of a particular system at a specific point in time, once we have a clear understanding of what? All of human values?

3y66

I'm not arguing that "alignment" specifically is the thing we should be measuring.

More generally, a useful mantra is "we do not get to choose the ontology". In this context, it means that there are certain things which are natural to measure (like temperature and weight), and we do not get to pick what they are; we have to discover what they are.

The Problem With The Current State of AGI Definitions

3y10

Depends on how you define the measure over jobs. If you mean "the jobs of half of all people," probably true. If you mean "half of the distinct jobs as they are classified by NAICS or similar," I think I disagree.

[$20K in Prizes] AI Safety Arguments Competition

3y10

Question: "effective arguments for the importance of AI safety" - is this about arguments for the importance of just technical AI safety, or more general AI safety, to include governance and similar things?

[$20K in Prizes] AI Safety Arguments Competition

3y30

Think of it as a "practicing a dark art of rationality" post, and I'd think it would seem less off-putting.

2Ben Pace3y

I think it would be less "off-putting" if we had common knowledge of it being such a post. I think the authors don't think of it as that from reading Sidney's comment.

[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda

3y30

Please feel free to repost this elsewhere, and/or tell people about it.

And if there is anyone interested in this type of job, but is currently still in school or for other reasons is unable to work full time at present, we encourage them to apply and note the circumstances, as we may be able to find other ways to support their work, or at least collaborate and provide mentorship.

AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

3y20

I'm not sure I agree with the compatibility of discontinuity and prosaic alignment, though you make a reasonable case, but I do think there is compatibility between slower governance approaches and discontinuity, if it is far enough away.

AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

3y10

In the post, I wanted to distinguish between two things you're now combining; how hard alignment is, and how long we have. And yes, combining these, we get the issue of how hard it will be to solve alignment in the time frame we have until we need to solve it. But they are conceptually distinct.

And neither of these directly relates to takeoff speed, which in the current framing is something like the time frame from when we have systems that are near-human until they hit a capability discontinuity. You said "First off, takeoff speed and timing are correlate... (read more)

2Samuel Dylan Martin3y

Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed. But you're right that you're talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty. But those are also still correlated, for the reasons I gave - mainly that a discontinuity is an essential step in Eleizer style pessimism and fast takeoff views. I'm not sure how close this correlation is. Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work? I think that potentially you can - if you think that still enough capabilities in pre-HLMI AI (pre discontinuity) to help you do alignment research before dangerous HLMI shows up. But prosaic alignment seems to require more assumptions to be feasible assuming a discontinuity, like that the discontinuity doesn't occur before all the important capabilities you need to do good alignment research.

Section 7: Foundations of Rational Agency

3y10

Relevant to this agenda are the failure modes I discussed in my multi-agent failures paper, which seems worth looking at in this context.

3y40

I'm skeptical that many of the problems with aggregation don't both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I'd need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it's nowhere near complete in addressing this issue.)

3Stuart Armstrong3y

It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.

3y30

This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we're not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.

Edit: To clarify, I'm excited about the approach overall, and think it's likely to be valuable, but this part seems like a big problem.

0Stuart Armstrong3y

I've posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum). But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.

Inner Alignment: Explain like I'm 12 Edition

3y60Review for 2020 Review

This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)

Non-Adversarial Goodhart and AI Risks

3y20

Yes on point Number 1, and partly on point number 2.

If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are t... (read more)

AI learns betrayal and how to avoid it

3y30

This seems really exciting, and I'd love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don't endorse thinking of everything as Goodhart's law, despite that paper - though I still think it's technically true, it's not as useful as I had hoped.)

Paths To High-Level Machine Intelligence

4y20

On the topic of growth rate of computing power, it's worth noting that we expect the model which experts have to be somewhat more complex that what we represented as "Moore's law through year " - but as with the simplification regarding CPU/GPU/ASIC compute, I'm unsure how much this is really a crux for anyone about the timing for AGI.

I would be very interested to hear from anyone who said, for example, "I would expect AGI by 2035 if Moore's law continues, but I expect it to end before 2030, and it will therefore likely take until 2050 to reach HLMI/AGI."

2Daniel_Eth4y

I think very few people would explicitly articulate a view like that, but I also think there are people who hold a view along the lines of, "Moore will continue strong for a number of years, and then after that compute/$ will grow at <20% as fast" – in which case, if we're bottlenecked on hardware, whether Moore ends several years earlier vs later could have a large effect on timelines.

Analogies and General Priors on Intelligence

4y40

I mostly agree, but we get into the details of how we expect improvements can occur much more in the upcoming posts on paths to HLMI and takeoff speeds.