User Comment Replies — AI Alignment Forum

Computer security, to prevent powerful third parties from stealing model weights and using them in bad ways.

By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that.

4Buck Shlegeris4mo

That's a great way of saying it. I edited this into my original comment.

The Evals Gap

jbash4mo99

Development and interpretation of evals is complicated

Proper elicitation is an unsolved research question

... and yet...

Closing the evals gap is possible

Why are you sure that effective "evals" can exist even in principle?

I think I'm seeing a "we really want this, therefore it must be possible" shift here.

6Thomas Kwa4mo

For context, I just trialed at METR and talked to various people there, but this take is my own. I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious"). The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time. * In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models. * In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected. It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I'm optimistic we're closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don't seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.

Lucas Teixeira4mo712

Why are you sure that effective "evals" can exist even in principle?

Relatedly, the point which is least clear to me is what exactly would it mean to solve the "proper elicitation problem" and what exactly are the "requirements" laid out by the blue line on the graph. I think I'd need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).

As a non-example, possessing the kind of foun... (read more)

2Marius Hobbhahn4mo

Yeah, it's not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making. 1. Intuitively, I would say for the problems we're facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people. 2. Most evals questions feel like we have a decent number of "obvious things" to try and since we have very tight feedback loops, making progress feels quite doable. Intuitively, the "hardness level" to get to a robust science of evals and good coverage may be similar to going from the first transformer to GPT-3.5; You need to make a lot of design choices along the way, lots of research and spend some money but ultimately it's just "do much more of the process you're currently doing" (but we should probably spend more resources and intensify our efforts because I don't feel like we're on pace). In contrast, there are other questions like "how do we fully map the human brain" that just seem like they come with a lot more fundamental questions along the way.

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

jbash11mo00

"This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task."

Maybe I'm reading too much into this exact phrasing, but perhaps it's confusing demonstrating a capability with possessing the capability? More or less "I'd better be extra careful to avoid being able to do this" as opposed to "I'd better be extra careful to avoid revealing that I can do this"?

I could see it being led into that by common academic phrasing like "model X demonstrates the cap... (read more)

4Olli Järviniemi11mo

I wouldn't read that much to the exact phrasing the model uses. Such confusions are not generally present in the model's completions. E.g. see the second completion here. (There's also 200 completions more here.) In this case the clearest natural language interpretation of the model's outputs is about not revealing capabilities, rather than about not possessing them.

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

jbash4y-70

What would it look like to have powerful intelligent systems that increased rather than decreased the extent to which humans have agency over the future?

Um, bad?

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world. Not unless you deliberately modify them to the point where the word "human" becomes unreasonable.

The upside of AI depends on restricting human agency just as much as the downside does.

You seem to be relying on the idea that someday nobody will need to protect that child from a piece of gla... (read more)

2Alex Flint4y

Thank you for this jbash. My short response is: Yes, it would be very bad for present-day humanity to have more power than it currently does, since its current level of power is far out of proportion to its level of wisdom and compassion. But it seems to me that there are a small number of humans on this planet who have moved some way in the direction of being fit to run the world, and in time, more humans could move in this direction, and could move further. I would like to build the kind of AI that creates a safe container for movement in this direction, and then fades away as humans in fact become fit to run the world, however long or short that takes. If it turns out not to be possible then the AI should never fade away. I think what it means to grow up is to not want to destroy the world or make life miserable for one's peers. I do not think that most biological full-grown humans today have "grown up". Well just to state my position on this without arguing for it: my sense is that institutions should make individual adults sovereign if and when they grow up in the sense I've alluded to above. Very few currently living humans meet this bar in my opinion. In particular, I do not think that I meet this bar. True, but whether or not the word "human" is a reasonable description of what a person becomes when they become fit to run the world, the question is really: can humans become fit to run the world, and should they? Based on the few individuals I've spent time with who seem, in my estimation, to have moved some way in the direction of being fit to the run world, I'd say: yes and yes.

John Maxwell4y90

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.

I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.

Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

If you think the alignment problem is hard, you probably ... (read more)

0Xita4y

I want to start my reply by saying I am dubious of the best future for humanity being one in which a super-intelligence we build ends up giving all control and decision making to humans. However, the tone of the post feels somewhat too anti-human (that a future where humans have greater agency is necessarily "bad", not just sub-optimal) and narrow in its interpretation for me to move on without comment. There is a lot to be learned from considering the necessary conflict between human and FAI agency. Yes, conflict. The first point I don't fully agree with is the lack of capacity humans have to change, or grow, even as adults. You cite the lack of growth of wisdom in the human population even when people have been calling for it for millennia. There are many possible reasons for this besides humans being incapable of growth. For one, human psychology hasn't been significantly understood for more than a century, let alone studied in detail with the instruments we have in modern times. One of the greatest difficulties of passing down wisdom is the act of teaching. Effective teaching usually has to be personal, tailored to an individual's current state of mind, background knowledge, and skills. Even so, twin studies have found even such a rough measure as IQ to be 70% genetic and a whopping 30% environmental--but that is within an environment massively worse than perfect at teaching. Further, practices found within, say, Buddhism show potential at increasing one's capacity to act on abstract empathy. Jainism, as a religion, seems to have teachings and practices strong enough to make its group of followers one of the most peaceful groups on the planet, without material conditions being a necessary cause. These are within even current humans' capacities to achieve. I will also point out the potential psychedelic substances have to allow adults who are set in their ways to break out, though the research is still relatively new and unexplored. I absolutely agree that huma

AI ALIGNMENT FORUM
AF

All of jbash's Comments + Replies