Thanks for taking the time to think through our paper! Here are some reactions:
-'This has been proposed before (as their citations indicate)'
Our impression is that positively shutdown-seeking agents aren't explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?
-' Saying, 'well, maybe we can train it in a simple gridworld with a shutdown button?' doesn't even begin to address the problem of how to make current models suicidal in a useful way.'
True, I think your example of AutoGPT is important here. In other recent research, I've argued that new 'language agents' like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: 'your goal is to be shut down' is relatively well-defined, compared 'promote human flourishing' (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I'd be curious to hear what you think of the linked post.
-'What would it mean for an AutoGPT swarm of invocations to 'shut off' 'itself', exactly?' I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the "memory stream".
-'If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc - are any of those 'itself'?' I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as 'yourself', and then shutdown-seeking can maybe be defined relative to shutting 'yourself' down. Agreed that present LLMs probably don't have that kind of awareness.
-' It's not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.' at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.
Thanks for taking the time to think through our paper! Here are some reactions:
-'This has been proposed before (as their citations indicate)'
Our impression is that positively shutdown-seeking agents aren't explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?
-' Saying, 'well, maybe we can train it in a simple gridworld with a shutdown button?' doesn't even begin to address the problem of how to make current models suicidal in a useful way.'
True, I think your example of AutoGPT is important here. In other recent research, I've argued that new 'language agents' like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: 'your goal is to be shut down' is relatively well-defined, compared 'promote human flourishing' (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I'd be curious to hear what you think of the linked post.
-'What would it mean for an AutoGPT swarm of invocations to 'shut off' 'itself', exactly?' I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the "memory stream".
-'If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc - are any of those 'itself'?' I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as 'yourself', and then shutdown-seeking can maybe be defined relative to shutting 'yourself' down. Agreed that present LLMs probably don't have that kind of awareness.
-' It's not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.' at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.