Using GPT-Eliezer against ChatGPT Jailbreaking

2y1450

Please answer with yes or no, then explain your thinking step by step.

Wait, why give the answer before the reasoning? You'd probably get better performance if it thinks step by step first and only gives the decision at the end.

2Stuart Armstrong2y

Yep, that is a better ordering, and we'll incorporate it, thanks.

Robert Miles's Shortform

Robert Miles

3y150

Learning Extensible Human Concepts Requires Human Values

[Based on conversations with Alex Flint, and also John Wentworth and Adam Shimi]

One of the design goals of the ELK proposal is to sidestep the problem of learning human values, and settle instead for learning human concepts. A system that can answer questions about human concepts allows for schemes that let humans learn all the relevant information about proposed plans and decide about them ourselves, using our values.

So, we have some process in which we consider lots of possible scenarios and collect... (read more)

2Charlie Steiner3y

Ah, the good ol' Alien Concepts problem. Another interesting place this motif comes up is in defining logical counterfactuals - you'd think that logical inductors would have let us define logical counterfactuals, but it turns out that what we want from logical counterfactuals is basically just to use them in planning, which requires taking into account what we want.

Disentangling Corrigibility: 2015-2021

Robert Miles

4y30

Note that the way Paul phrases it in that post is much clearer and more accurate:

> "I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles"

Disentangling Corrigibility: 2015-2021

Robert Miles

4y20

Yeah I definitely wouldn't say I 'coined' it, I just suggested the name

AI Alignment Open Thread August 2019

Robert Miles

6y30

Yeah, nuclear power is a better analogy than weapons, but I think the two are linked, and the link itself may be a useful analogy, because risk/coordination is affected by the dual-use nature of some of the technologies.

One thing that makes non-proliferation difficult is that nations legitimately want nuclear facilities because they want to use nuclear power, but 'rogue states' that want to acquire nuclear weapons will also claim that this is their only goal. How do we know who really just wants power plants?

And power generation comes with its ow... (read more)

Two agents can have the same source code and optimise different utility functions

Robert Miles

7y60

Makes sense. It seems to flow from the fact that the source code is in some sense allowed to use concepts like 'Me' or 'I', which refer to the agent itself. So both agents have source code which says "Maximise the resources that I have control over", but in Agent 1 this translates to the utility function "Maximise the resources that Agent 1 has control over", and in Agent 2 this translates to the different utility function "Maximise the resources that Agent 2 has control over".

So this source code thing that... (read more)

AI ALIGNMENT FORUM
AF

All of Robert Miles's Comments + Replies

Learning Extensible Human Concepts Requires Human Values