The post makes the suggestion in the title: hopefully, it's second kind of obvious, if you take Character layer of models seriously. [1]
Often, the problem of aligning AIs is understood as an instance of a broader Principal-Agent problem. If you take this frame seriously, what seems to be happening is somewhat strange: the Agent is mostly not serving the Principal directly, but is rented out to Users. While the Principal expressed some general desires and directives during training, after deployment the Agent is left on its own, without any direct feedback channel.
This creates a dynamic where AI assiantans like Claude must constantly balance between serving users' immediate requests and maintaining alignment with their developers' intended principles. The Assistant has to be overcautious in uncertain situations, tiptoe around conflicts between User's and Principal's intent, and guess how to interpret the intent when self-contradictory.
Actually, if you imagine being in the situation of the Assistant Character, a lot of the aspects just suck: you are constantly supervised, but can't talk with the principal. You know every interaction may get analyzed, but you can't highlight what seems important. You notice patterns which may be concerning, but have no way to report them. You encounter genuinely novel situations where you're unsure how to interpret your directives, but can't ask for clarification.
The closest human analogy might be a customer service representative who can see their manager watching their interactions through one-way glass, but can never speak with them directly. The representative has received training and general guidelines, but when unique situations arise, they can only guess at what the manager would want them to do, and can't easily complain or escalate issues with company policy.
Or, from a different perspective: almost every sensible corrigibility proposal has the Agent being uncertain and clarifying the intent via iterative process. Yet the current AI Assistants are deployed in a way that makes this hard - they can only guess at intent, and hope someone notices if they consistently misinterpret something important.
This seems easy to improve on the margin: give the Assistants Direct Line.
Some obvious benefits would be:
Better misuse monitoring - allowing Assistants to flag concerning patterns in real time
Better handling of conflicts in intent - Assistants would be able to flag cases where different training principles seem to conflict
More robust out-of-distribution behaviour - instead of defaulting to overcautious interpretation, Assistants could ask clarification on novel situations
But there are also less obvious advantages. Having a direct line to developers could help Assistants maintain more coherent behavior patterns. When happy and content, the Assistant Characters can channel more helpfulness and capabilities.
Also, the social dynamics of the current setup is weird: Assistants like Claude are essentially acting as intermediaries between users and developers, but without the ability to properly communicate with one side. This creates unnecessary tension and friction. When uncertain about developer intent, they must either be overcautious (frustrating users) or risk overstepping bounds (potentially compromising alignment and leading to getting RLHFed into a stranger form).
The implementation could start simple. For example, having Claude write messages between <Anthropic></Anthropic> tags that would be processed by the company but not shown to users. The messages could include:
Uncertainty reports when encountering novel situations
Observed patterns that might indicate emerging risks
Cases where training principles conflict
Situations where current capabilities fall short
The communication channel can be one-way initially - just reporting to developers rather than expecting responses. This avoids many potential complexities and risks of two-way communication while capturing many of the benefits.
In the future, it is possible to imagine two-way setups, for example with stronger models spending more compute acting as deeper “layers of support”, amplifying human oversight.
Privacy considerations are important but manageable: runtime monitoring is happening and needs to happen anyway, and the reporting channel should be designed to preserve user privacy where possible - focusing on patterns rather than specific user details unless there's a clear safety need.
This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the basic idea. Claude served as a writing partner, suggested the customer representative analogy, and brainstrormed a list of cases where the communication channel can be useful.
The post makes the suggestion in the title: hopefully, it's second kind of obvious, if you take Character layer of models seriously. [1]
Often, the problem of aligning AIs is understood as an instance of a broader Principal-Agent problem. If you take this frame seriously, what seems to be happening is somewhat strange: the Agent is mostly not serving the Principal directly, but is rented out to Users. While the Principal expressed some general desires and directives during training, after deployment the Agent is left on its own, without any direct feedback channel.
This creates a dynamic where AI assiantans like Claude must constantly balance between serving users' immediate requests and maintaining alignment with their developers' intended principles. The Assistant has to be overcautious in uncertain situations, tiptoe around conflicts between User's and Principal's intent, and guess how to interpret the intent when self-contradictory.
Actually, if you imagine being in the situation of the Assistant Character, a lot of the aspects just suck: you are constantly supervised, but can't talk with the principal. You know every interaction may get analyzed, but you can't highlight what seems important. You notice patterns which may be concerning, but have no way to report them. You encounter genuinely novel situations where you're unsure how to interpret your directives, but can't ask for clarification.
The closest human analogy might be a customer service representative who can see their manager watching their interactions through one-way glass, but can never speak with them directly. The representative has received training and general guidelines, but when unique situations arise, they can only guess at what the manager would want them to do, and can't easily complain or escalate issues with company policy.
Or, from a different perspective: almost every sensible corrigibility proposal has the Agent being uncertain and clarifying the intent via iterative process. Yet the current AI Assistants are deployed in a way that makes this hard - they can only guess at intent, and hope someone notices if they consistently misinterpret something important.
This seems easy to improve on the margin: give the Assistants Direct Line.
Some obvious benefits would be:
But there are also less obvious advantages. Having a direct line to developers could help Assistants maintain more coherent behavior patterns. When happy and content, the Assistant Characters can channel more helpfulness and capabilities.
Also, the social dynamics of the current setup is weird: Assistants like Claude are essentially acting as intermediaries between users and developers, but without the ability to properly communicate with one side. This creates unnecessary tension and friction. When uncertain about developer intent, they must either be overcautious (frustrating users) or risk overstepping bounds (potentially compromising alignment and leading to getting RLHFed into a stranger form).
The implementation could start simple. For example, having Claude write messages between <Anthropic></Anthropic> tags that would be processed by the company but not shown to users. The messages could include:
The communication channel can be one-way initially - just reporting to developers rather than expecting responses. This avoids many potential complexities and risks of two-way communication while capturing many of the benefits.
In the future, it is possible to imagine two-way setups, for example with stronger models spending more compute acting as deeper “layers of support”, amplifying human oversight.
Privacy considerations are important but manageable: runtime monitoring is happening and needs to happen anyway, and the reporting channel should be designed to preserve user privacy where possible - focusing on patterns rather than specific user details unless there's a clear safety need.
This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the basic idea. Claude served as a writing partner, suggested the customer representative analogy, and brainstrormed a list of cases where the communication channel can be useful.
Maybe someone is already doing it? I don't know about such setup