Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".
A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations
Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I'm using "model-based" to mean "maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anti...
Imagine our stamp collector is trained using meta-learning. 100 stamp collectors are trained in parallel and the inner loop, which uses gradient descent, updates their weights every 10 days. Every 50 days, the outer loop takes the 50 best-performing stamp collectors and copies their weights over to the 50 worst-performing stamp collectors. In doing so, the outer loop selects non-myopic models that maximize stamps over all days.
Nit (or possibly major disagreement): I don't think this training regime gets you a stamp maximizer. I think this regim...
Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward - rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.
I found an even dumber approach that works. The approach is as follows:
n
.i
from 0 to n
, make an English->French sentence by taking the first i
fragments in English and the rest in French. The resulting sentences look likeFor me at least, I mentally model the right column as something "a better GPT" could probably describe how to do, if given the correct prompt. For example, let's say that I have just sequenced a sample of Equine herpesvirus 1, and I want to know what the protein encoded by ORF-46 does. I feed GPT the following
Our team identified the likely function of the protein encoded by ORF46 of the Equine herpesvirus 1 (EHV-11) through the use of machine-learning based tools. Specifically, we
and then GPT spits out
...used the protein sequence of ORF46 to predict its
If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.
Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
It would be crazy to hide that CoT from the monitor.