Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?
Here’s my takeaway:
There are mechanistic reasons for humanity’s “Sharp Left Turn” with respect to evolution. Humans were bottlenecked by knowledge transfer between new generations, and the cultural revolution allowed us to share our lifetime learnings with the next generation instead of waiting on the slow process of natural selection.
Current AI development is not bottlenecked in the same way and, therefore, is highly unlikely to get a sharp left turn for the same reason. Ultimately, evolution analogies can lead to bad unconscious assumptions with no rigor...
Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction.
I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example.
Edit: just realized that the next post might be with GPT-2. Exciting!
As in, the ratio between (interpretable capabilities / total capabilities) still asymptotes to zero, but the number of interpretable capabilities goes up (and then maybe back down) as the models gain more capabilities?
Any additional or new thoughts on this? Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it's way more likely that we'd be unable to prompt things out of the model only if it were deceptive? Could you say more?
Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We've fine-tuned on previous chain-of-thoughts while giving process-lev...
I'm wondering what you think we can learn from approaches like ROME. For those who don't know, ROME is focused on editing factual knowledge (e.g. Eiffel Tower is now in Rome). I'm curious how we could take it beyond factual knowledge. ROME uses causal tracing to find the parts of the model that impact specific factual knowledge the most.
What if we tried to do something similar to find which parts of the model impact the search the most? How would we retarget the search in practice? And in the lead-up to more powerful models, what are the experiments ...
I believe all of those posts can be found on the Alignment Forum so, luckily, they are included in the dataset (at least from what I remember after checking a handful of the posts). I had begun scraping from those sources, but realized they were already on AF halfway through.
We could definitely look into making the project evolve in this direction. In fact, we're building a dataset of alignment-related texts and a small part of the dataset includes a scrape of arXiv papers extracted from the Alignment Newsletter. We're working towards building GPT models fine-tuned on the texts.
I’m working on this. I’m unsure if I should be sharing what I’m exactly working on with a frontier AGI lab though. How can we be sure this just leads to differentially accelerating alignment?
Edit: my main consideration is when I should start mentioning details. As in, should I wait until I’ve made progress on alignment internally before sharing with an AGI lab. Not sure what people are disagreeing with since I didn't make a statement.