All of Nathan Helm-Burger's Comments + Replies

I agree with Joe Carlsmith that this seems like goal guarding.

I would be interested to see if my team's noise-injection technique interferes with these behaviors in a way that makes them easier to detect.

It's worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.

You could try playing with the minimal reproduction on llama-405b.

I know this is a bit off topic, but I'd be super fascinated to see what happened if you tried this with a level where the middle hallway had been extended. Strategically, it changes nothing, but it introduces more meaningless steps into a solution. Does this interfere with planning? Or is the planning able to abstract over 'repeat strategically non-relevant step until next decision point'? 

Specifically, what if you just duplicated this middle section a bunch.

middle section
1Adrià Garriga-Alonso
I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself Here's the text representation for level 53 ########## ########## ########## ####### # ######## # # ###.@# # $ $$ # #. #.$ # # . ## ##########

One point I’ve seen raised by people in the latter group is along the lines of: “It’s very unlikely that we’ll be in a situation where we’re forced to build AI systems vastly more capable than their supervisors. Even if we have a very fast takeoff - say, going from being unable to create human-level AI systems to being able to create very superhuman systems ~overnight - there will probably still be some way to create systems that are only slightly more powerful than our current trusted systems and/or humans; to use these to supervise and align systems slig

... (read more)

I agree with Steve Byrnes here. I think I have a better way to describe this.
I would say that the missing piece is 'mastery'. Specifically, learning mastery over a piece of reality. By mastery I am referring to the skillful ability to model, predict, and purposefully manipulate that subset of reality.
I don't think this is an algorithmic limitation, exactly.


Look at the work Deepmind has been doing, particularly with Gato and more recently AutoRT, SARA-RT, RT-Trajectory, UniSim , and Q-transformer. Look at the work being done with the help of Nvidia's new Ro... (read more)

I think this is an important piece of the strategic solution. The lack of control options for open-weight models is a big part of why I think there is a much lower capability threshold for catastrophic danger from an open-weight model.

4Ryan Greenblatt
I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice. But, I think there are important control related options for open-weight models which do improve the situation to some extent. In the open-weight regime, we realistically won't be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we're absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we're in big trouble unless we can somehow avoid this by restricting access to labs or something.) But, we can still ensure: * If most compute is controlled by smaller number of reasonable actors and these actors implement control (or other sufficient strategies), then due to compute limitations, AIs won't be able to do an "arbitrary" number of bad actions. * Compute limitations might suffice for preventing rogue AIs or rogue humans from rapidly causing an intelligence explosion even if it would otherwise be possible. Thus, we might be able to delay substantially and use this time to improve the situation. * It might also suffice for avoiding the worst parts of AI enabled cyber attacks which might be heavily scale dependent and not overally offense dominant. * We can control the AIs used by this smaller number of actors which could allow them to be used in otherwise dangerous applications. * We can try to distribute easy-to-use control approaches widely and possibly we can pass legal restrictions related to these approaches. * We can try to aggregate information to figure out if the AI is scheming and if we learn that it is, we can try to take some countermeasures. TBC, even if the AI i

There’s no sharp line between the helper AIs of Vision 1 and the truly-autonomous AIs of Vision 2.

 

This post seems like it doesn't quite cleave reality at the joints, from how I'm seeing things. 

Vision 1 style models can be turned into Vision 2 autonomous models very easily. So, as you say, there's no sharp line there.

For me, Vision 3 shouldn't depend on biological neurons. I think it's more like 'brain-like AGI that is so brain-like that it is basically an accurate whole brain emulation, and thus you can trust it as much as you can trust a human... (read more)

Vision 1 style models can be turned into Vision 2 autonomous models very easily

Sure, Vision 1 models can be turned into dangerous Vision 2 models, but they can’t be turned into good Vision 2 models that we want to have around, unless you solve the different set of problems associated with full-fledged Vision 2. For example, in the narrow value learning vs ambitious value learning dichotomy, “narrow” is sufficient for Vision 1 to go well, but you need “ambitious” for Vision 2 to go well. Right?

For me, Vision 3 shouldn't depend on biological neurons. I think

... (read more)

Ok, so this is definitely not a human thing, so probably a bit of a tangent. One of the topics that came up in a neuroscience class once was goose imprinting. There's apparently been studies (see Eckhard Hess for the early ones) that show that the strength of the imprinting (measured by behavior following the close of the critical period) onto whatever target is related to how much running towards the target the baby geese do. The hand-wavey explanation was something like 'probably this makes sense since if you have to run a lot to keep up with your mother... (read more)

I think this is an excellent description of GPT-like models. It both fits with my observations and clarifies my thinking. It also leads me to examine in a new light questions which have been on my mind recently: 

What is the limit of power of simulation that our current architectures (with some iterative improvements) can achieve when scaled to greater power (via additional computation, improved datasets, etc)?

Is a Simulator model really what we want? Can we trust the outputs we get from it to help us with things like accelerating alignment research? What might failure modes look like?

Super handy seeming intro for newcomers.

I recommend adding Jade Leung to your list of governance people.

As for the list of AI safety people, I'd like to add that there are some people who've written interesting and much discussed content that it would be worth having some familiarity with.

John Wentworth

Steven Byrnes

Vanessa Kosoy

And personally I'm quite excited about the school of thought developing under the 'Shard theory' banner.

For shard theory info:

https://www.lesswrong.com/posts/xqkGmfikqapbJ2YMj/shard-theory-an-overview 

https://www.alignmentforum... (read more)

I'm excited to participate in this, and feel like the mental exercise of exploring this scenario would be useful for my education on AI safety. Since I'm currently funded by a grant from the Long Term Future Fund for reorienting my career to AI safety, and feel that this would be a reasonable use of my time, you don't need to pay me. I'd be happy to be a full-time volunteer for the next couple weeks.

Edit: I participated and was paid, but only briefly. Turns out I was too distracted thinking and talking about how the process could be improved and the larger... (read more)

Thanks Rohin. I also feel that interviewing after my 3 more months of independent work is probably the correct call.

1Rohin Shah
Update: I think you should apply now and mention somewhere that you'd prefer to be interviewed in 3 months because in those 3 months you will be doing <whatever it is you're planning to do> and it will help with interviewing.

I'm potentially interested in the Research Engineer position on the Alignment Team, but I'm currently 3 months into a 6 month grant from LTFF to reorient my career from general machine learning to AI safety specifically. My current plan is to keep doing solo work () until the last month of my grant period then begin applying to AI safety work at places like Anthropic, Redwood Research, Open AI, and Deepmind. 

Do you think there's a significant advantage to applying soon vs 3 months from now?

2Rohin Shah
Looking into it, I'll try to get you a better answer soon. My current best guess is that you should apply 3 months from now. This runs an increased risk that we'll have filled all our positions / closed our applications, but also improved chances of making it through because you'll know more things and be better prepared for the interviews. (Among other things I'm looking into: would it be reasonable to apply now and mention that you'd prefer to be interviewed in 3 months.)