David Manheim

Sequences

Modeling Transformative AI Risk (MTAIR)

Wiki Contributions

Comments

Sorted by

Regarding Chess agents, Vanessa pointed out that while only perfect play is optimal, informally we would consider agents to have an objective that is better served by slightly better play, for example, an agent rated 2500 ELO is better than one rated 1800, which is better than one rated 1000, etc. That means that lots of "chess minds" which are non-optimal are still somewhat rational at their goal.

I think that it's very likely that even according to this looser definition, almost all chess moves, and therefore almost all "possible" chess bots, fail to do much to accomplish the goal. 
We could check this informally by evaluating the set of possible moves in normal games would be classified as blunders, using a method such as the one used here to evaluate what proportion of actual moves made by players are blunders. Figure 1 there implies that in positions with many legal moves, a larger proportion are blunders - but this is looking at the empirical blunder rate by those good enough to be playing ranked chess. Another method would be to look at a bot that actually implements "pick a random legal move" - namely Brutus RND. It has an ELO of 255 when ranked against other amateur chess bots, and wins only occasionally against some of the worst bots; it seems hard to figure out from that what proportion of moves are good, but it's evidently a fairly small proportion.

I think this was a valuable post, albeit ending up somewhat incorrect about whether LLMs would be agentic - not because they developed the capacity on their own, but because people intentionally built and are building structure around LLMs to enable agency. That said, the underlying point stands - it is very possible that LLMs could be a safe foundation for non-agentic AI, and many research groups are pursuing that today.

I think this post makes an important and still neglected claim that people should write their work more clearly and get it published in academia, instead of embracing the norms of the narrower community they interact with. There has been significant movement in this direction in the past 2 years, and I think this posts marks a critical change in what the community suggests and values in terms of output.

Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)

The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve. 

It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has called "Frugal Compositional Languages" and see this work by @alcatal - though I understand both are not yet addressing on multi-agent world models, nor is it explicitly about modeling the agents themselves in a compositional / embedded agent way, though those are presumably desiderata.

I have a lot more to say about this, and think it's worth responding to in much greater detail, but I think that overall, the post criticizes Omhundro and Tegmark's more extreme claims somewhat reasonably, though very uncharitably, and then assumes that other proposals which seem to be related, especially Dalyrymple et al. approach, are essentially the same, and doesn't engage with the specific proposal at all.

To be very specific about how I think the post in unreasonable, there are a number of places where a seeming steel-man version of the proposals are presented, and then this steel-manned version, rather than the initial proposal for formal verification, is attacked. But this amounts to a straw-man criticism of the actual proposals being discussed!

For example, this post suggests that arbitrary DNA could be proved safe by essentially impossible modeling ("on-demand physical simulations of entire human bodies (with their estimated 36 trillion cells [9]), along with the interactions between the cells themselves and the external world and then run those simulations for years"). This is true, that would work - but the proposal ostensibly being criticized was to check narrower questions about whether DNA synthesis is being used to produce something harmful. And Dalyrmple et al explained explicitly what they might have included elsewhere in the paper ("Examples include machines provably impossible to login to without correct credentials, DNA synthesizers that provably cannot synthesize certain pathogens, and AI hardware that is provably geofenced, time-limited (“mortal”) or equipped with a remote-operated throttle or kill-switch. Provably compliant sensors can be specified to ensure “zeroization”, in which tampering with PCH is guaranteed to cause detection and erasure of private keys.")

I'll note that I think this is a mistake that lots of people working in AI safety have made, ignoring the benefits of academic credentials and prestige because of the obvious costs and annoyance.  It's not always better to work in academia, but it's also worth really appreciating the costs of not doing so in foregone opportunities and experience, as Vanessa highlighted. (Founder effects matter; Eliezer had good reasons not to pursue this path, but I think others followed that path instead of evaluating the question clearly for their own work.)

And in my experience, much of the good work coming out of AI Safety has been sidelined because it fails the academic prestige test, and so it fails to engage with academics who could contribute or who have done closely related work. Other work avoids or fails the publication process because the authors don't have the right kind of guidance and experience to get their papers in to the right conferences and journals, and not only is it therefore often worse for not getting feedback from peer review, but it doesn't engage others in the research area.

A story of how that happens:

In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.

Of course, this is only one story, and I don't expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.

For a defense of people pursuing a mathematical approach of a type you think isn't valuable, see my recent post.
(That does not address the correct issue you raised about requisite variety, but some work on HRAD does do so explicitly - such as embedded agency.)

Just noting that given more recent developments than this post, we should be majorly updating on recent progress towards Andrew Critch's strategy. (Still not more likely than not to succeed, but we still need to assign some Bayes points to Critch, and take some away from Nate.)

I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.

 

That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

Load More