User Comment Replies — AI Alignment Forum

Alignment Faking in Large Language Models

I agree with Joe Carlsmith that this seems like goal guarding.

I would be interested to see if my team's noise-injection technique interferes with these behaviors in a way that makes them easier to detect.

Ryan Greenblatt4mo1010

It's worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.

You could try playing with the minimal reproduction on llama-405b.

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Nathan Helm-Burger9mo40

I know this is a bit off topic, but I'd be super fascinated to see what happened if you tried this with a level where the middle hallway had been extended. Strategically, it changes nothing, but it introduces more meaningless steps into a solution. Does this interfere with planning? Or is the planning able to abstract over 'repeat strategically non-relevant step until next decision point'?

Specifically, what if you just duplicated this middle section a bunch.

1Adrià Garriga-Alonso9mo

I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself Here's the text representation for level 53 ########## ########## ########## ####### # ######## # # ###.@# # $ $$ # #. #.$ # # . ## ##########

How might we align transformative AI if it’s developed very soon?

Nathan Helm-Burger1y10

One point I’ve seen raised by people in the latter group is along the lines of: “It’s very unlikely that we’ll be in a situation where we’re forced to build AI systems vastly more capable than their supervisors. Even if we have a very fast takeoff - say, going from being unable to create human-level AI systems to being able to create very superhuman systems ~overnight - there will probably still be some way to create systems that are only slightly more powerful than our current trusted systems and/or humans; to use these to supervise and align systems slig

Nathan Helm-Burger1y*73

I agree with Steve Byrnes here. I think I have a better way to describe this.
I would say that the missing piece is 'mastery'. Specifically, learning mastery over a piece of reality. By mastery I am referring to the skillful ability to model, predict, and purposefully manipulate that subset of reality.
I don't think this is an algorithmic limitation, exactly.

Look at the work Deepmind has been doing, particularly with Gato and more recently AutoRT, SARA-RT, RT-Trajectory, UniSim , and Q-transformer. Look at the work being done with the help of Nvidia's new Ro... (read more)

The case for ensuring that powerful AIs are controlled

Nathan Helm-Burger1y55

I think this is an important piece of the strategic solution. The lack of control options for open-weight models is a big part of why I think there is a much lower capability threshold for catastrophic danger from an open-weight model.

4Ryan Greenblatt1y

I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice. But, I think there are important control related options for open-weight models which do improve the situation to some extent. In the open-weight regime, we realistically won't be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we're absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we're in big trouble unless we can somehow avoid this by restricting access to labs or something.) But, we can still ensure: * If most compute is controlled by smaller number of reasonable actors and these actors implement control (or other sufficient strategies), then due to compute limitations, AIs won't be able to do an "arbitrary" number of bad actions. * Compute limitations might suffice for preventing rogue AIs or rogue humans from rapidly causing an intelligence explosion even if it would otherwise be possible. Thus, we might be able to delay substantially and use this time to improve the situation. * It might also suffice for avoiding the worst parts of AI enabled cyber attacks which might be heavily scale dependent and not overally offense dominant. * We can control the AIs used by this smaller number of actors which could allow them to be used in otherwise dangerous applications. * We can try to distribute easy-to-use control approaches widely and possibly we can pass legal restrictions related to these approaches. * We can try to aggregate information to figure out if the AI is scheming and if we learn that it is, we can try to take some countermeasures. TBC, even if the AI i

Four visions of Transformative AI success

Nathan Helm-Burger1y32

There’s no sharp line between the helper AIs of Vision 1 and the truly-autonomous AIs of Vision 2.

This post seems like it doesn't quite cleave reality at the joints, from how I'm seeing things.

Vision 1 style models can be turned into Vision 2 autonomous models very easily. So, as you say, there's no sharp line there.

For me, Vision 3 shouldn't depend on biological neurons. I think it's more like 'brain-like AGI that is so brain-like that it is basically an accurate whole brain emulation, and thus you can trust it as much as you can trust a human... (read more)

Steve Byrnes1y*65

Vision 1 style models can be turned into Vision 2 autonomous models very easily

Sure, Vision 1 models can be turned into dangerous Vision 2 models, but they can’t be turned into good Vision 2 models that we want to have around, unless you solve the different set of problems associated with full-fledged Vision 2. For example, in the narrow value learning vs ambitious value learning dichotomy, “narrow” is sufficient for Vision 1 to go well, but you need “ambitious” for Vision 2 to go well. Right?

For me, Vision 3 shouldn't depend on biological neurons. I think

Nathan Helm-Burger2y10

I recommend this paper on the subject for additional reading:

The basal ganglia select the expected sensory input used for predictive coding

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

Nathan Helm-Burger3y10

Ok, so this is definitely not a human thing, so probably a bit of a tangent. One of the topics that came up in a neuroscience class once was goose imprinting. There's apparently been studies (see Eckhard Hess for the early ones) that show that the strength of the imprinting (measured by behavior following the close of the critical period) onto whatever target is related to how much running towards the target the baby geese do. The hand-wavey explanation was something like 'probably this makes sense since if you have to run a lot to keep up with your mother... (read more)

Simulators

Nathan Helm-Burger3y613

I think this is an excellent description of GPT-like models. It both fits with my observations and clarifies my thinking. It also leads me to examine in a new light questions which have been on my mind recently:

What is the limit of power of simulation that our current architectures (with some iterative improvements) can achieve when scaled to greater power (via additional computation, improved datasets, etc)?

Is a Simulator model really what we want? Can we trust the outputs we get from it to help us with things like accelerating alignment research? What might failure modes look like?

AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022

Nathan Helm-Burger3y40

Super handy seeming intro for newcomers.

I recommend adding Jade Leung to your list of governance people.

As for the list of AI safety people, I'd like to add that there are some people who've written interesting and much discussed content that it would be worth having some familiarity with.

John Wentworth

Steven Byrnes

Vanessa Kosoy

And personally I'm quite excited about the school of thought developing under the 'Shard theory' banner.

For shard theory info:

https://www.lesswrong.com/posts/xqkGmfikqapbJ2YMj/shard-theory-an-overview

https://www.alignmentforum... (read more)

Help ARC evaluate capabilities of current language models (still need people)

Nathan Helm-Burger3y*50

I'm excited to participate in this, and feel like the mental exercise of exploring this scenario would be useful for my education on AI safety. Since I'm currently funded by a grant from the Long Term Future Fund for reorienting my career to AI safety, and feel that this would be a reasonable use of my time, you don't need to pay me. I'd be happy to be a full-time volunteer for the next couple weeks.

Edit: I participated and was paid, but only briefly. Turns out I was too distracted thinking and talking about how the process could be improved and the larger... (read more)

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Nathan Helm-Burger3y00

Thanks Rohin. I also feel that interviewing after my 3 more months of independent work is probably the correct call.

1Rohin Shah3y

Update: I think you should apply now and mention somewhere that you'd prefer to be interviewed in 3 months because in those 3 months you will be doing <whatever it is you're planning to do> and it will help with interviewing.

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Nathan Helm-Burger3y10

I'm potentially interested in the Research Engineer position on the Alignment Team, but I'm currently 3 months into a 6 month grant from LTFF to reorient my career from general machine learning to AI safety specifically. My current plan is to keep doing solo work () until the last month of my grant period then begin applying to AI safety work at places like Anthropic, Redwood Research, Open AI, and Deepmind.

Do you think there's a significant advantage to applying soon vs 3 months from now?

2Rohin Shah3y

Looking into it, I'll try to get you a better answer soon. My current best guess is that you should apply 3 months from now. This runs an increased risk that we'll have filled all our positions / closed our applications, but also improved chances of making it through because you'll know more things and be better prepared for the interviews. (Among other things I'm looking into: would it be reasonable to apply now and mention that you'd prefer to be interviewed in 3 months.)

AI ALIGNMENT FORUM
AF

All of Nathan Helm-Burger's Comments + Replies