If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
optimization pressure will try to push this extra information into the cheapest places to hide
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
I think I understand what you're saying, but I want to double check and try laying it out explicitly.
I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human")
I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven't organized:
Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
Agree that founders are a bit of an exception. Actually that's a bit in the longer version of this when I talk about it in person.
Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes".
So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.
In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd pr...
I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn't want it to be inside that AI's training data.
Maybe in the future we'll have a better tag for "dont train on me", but for now the big bench canary string is the best we have.
This is in addition to things like "maybe don't post it to the public internet" or "maybe don't link to it from public posts" or other ways of ensuring it doesn't end up in training corpora.
I think this is a situation for defense-in-depth.
AGI will probably be deployed by a Moral Maze
Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".
I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.
My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze. It seems to be a pretty good nul...
Thanks so much for making this!
I'm hopeful this sort of dataset will grow over time as new sources come about.
In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.
This seems like an overly alarmist take on what is a pretty old trend of research. Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook). It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.
I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing. (For example, a world model that knows very little, but knows how to search for information in a library)
I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system. My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here. This is why I don't think "...
Two Graphs for why Agent Foundations is Important (according to me)
Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models. The lack of rigor is why I’m short form-ing this.
First Graph: Agent Foundations as Aligned P2B Fixpoint
P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process. It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. &nb...
Maybe useful: an analogy this post brought to mind for me: Replacing “AI” with “Animals”.
Hypothetical alien civilization, observing Early Earth and commenting on whether it poses a risk.
Doesn’t optimization nature produce non-agentic animals? It mostly does, but those aren’t the ones we’re concerned with. The risk is all concentrated in the agentic animals.
Basically every animal ever is not agentic. I’ve studied animals for my entire career and I haven’t found an agentic animal yet. That doesn’t preclude them showing up in the futur...
Hacking the Transformer Prior
Neural Network Priors
I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.
Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.
Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models). This includes producing more interpretable models.
Analogy to Software Devel...
I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.
I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent. Unfortunately a lot of this is vague until you start specifying the domain of model. I think specifying this more clearly will help communicating about these ideas. To start with this myself, when I talk about circuit induction, I’m talking about things th...
Interpretability Challenges
Inspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.
I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person. The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.
On the other hand, most of the interpretability...
I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.
(I am one of said folks so maybe hypocritical to not work on it)
In particular it seems like there's way in which it would be more interpretable than transformers:
It's worth probably going through the current deep learning theories that propose parts of gears-level models, and see how they fit with this. The first one that comes to mind is the Lottery Ticket Hypothesis. It seems intuitive to me that certain tasks correspond to some "tickets" that are harder to find.
I like the taxonomy in the Viering and Loog, and it links to a bunch of other interesting approaches.
This paper shows phase transitions in data quality as opposed to data size, which is an angle I hadn't considered before.
There's the google pa...
Decomposing Negotiating Value Alignment between multiple agents
Let's say we want two agents to come to agreement on living with each other. This seems pretty complex to specify; they agree to take each other's values into account (somewhat), not destroy each other (with some level of confidence), etc.
Neither initially has total dominance over the other. (This implies that neither is corrigible to the other)
A good first step for these agents is to share each's values with the other. While this could be intractably complex -- it's probably ...
I'm really excited about this research direction. It seems so well-fit to what you've been researching in that past -- so much so that it doesn't seem to be a new research direction so much as a clarification of the direction you were already pursuing.
I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).
A common thread in the last year of my work on alignment is something like "How can I be an aligned intellig...
I think that's right that upgraded verification by itself is insufficient for 'defense wins' worlds. I guess I'd thought that was apparent but you're right it's definitely worth saying explicitly.
A big wish of mine is that we end up doing more planning/thinking-things-through for how researchers working on AI today could contribute to 'defense wins' progress.
My implicit other take here that wasn't said out loud is that I don't really know of other pathways where good theorem proving translates to better AI x-risk outcomes. I'd be eager to know of these.
Copying some brief thoughts on what I think about working on automated theorem proving relating to working on aligned AGI:
I’m excited to see more of this. As the field grows (in funding, people, etc) it seems like there’s a lot more room to grow the portfolio of bets of Alignment Approaches, and brain-based methods seem to be interesting enough to allocate to.
I think I’ve been underwhelmed by past progress in this direction, but that doesn’t preclude someone coming in and finding a tractable angle to start grinding away at.
Things I most interested in (personally and selfishly):
Thanks for taking the time to explain this!
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.
I think this is what I was missing. I was incorrectly thinking of the system as generating poly-sized fragments.
So if there are different poly fragments that the human would evaluate differently, is ELK just "giving them a fragment such that they come to the correct conclusion" even if the fragment might not be the right piece.
E.g. in the SmartVault case, if the screen was put in the way of the camera and the diamond was secretly stolen, we would still be successful even if we didn't elicit that fact, but instead elicited some poly fragment that got the human to answer disapprove?
Like the thing that seems weird to me here is that you can't simultaneously require tha...
Okay now I have to admit I am confused.
Re-reading the ELK proposal -- it seems like the latent knowledge you want to elicit is not-obfuscated.
Like, the situation to solve is that there is a piece of non-obfuscated information, which, if the human knew it, would change their mind about approval.
How do you expect solutions to elicit latent obfuscated knowledge (like 'the only true explanation is incomprehendible by the human' situations)?
Thinking more about ELK. Work in progress, so I expect I will eventually figure out what's up with this.
Right now it seems to me that Safety via Debate would elicit compact/non-obfuscated knowledge.
So the basic scenario is that in addition to SmartVault, you'd have Barrister_Approve and Barrister_Disapprove, who are trying to share evidence/reasoning which makes the human approve or disapprove of SmartVault scenarios.
The biggest weakness of this that I know of is Obfuscated Arguments -- that is, it won't elicit obfuscated knowledge.
It seems like in t...
Some disorganized thoughts about adversarial ML:
I think this broadly makes sense to me. There are many cases where "the model is pretending to be dumb" feels appropriate.
This is part of why building evaluations and benchmarks for this sort of thing is difficult.
I'm at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like "play dumb for the joke" vs "give the best answer", using techniques that build on human feedback.
I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it's outputs and analyzing them. It seems important to be wary of the "seems to be dumb" failure modes.
I like this a lot. It does seem to show a pretty clear failure of reasoning on behalf of the language models.
It's pretty intuitive and easy to show people what this kind of shallow-pattern-matcher failures look like.
Meta: I hope more people do more small experiments like this and share their results, and hopefully a few of them start getting put into benchmark/evaluations.
Even this one by itself might be good to make some form of benchmark/evaluation on (in ways that are easy/automatic to evaluate future models on).
The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup.
To imagine that it is trying to solve the task of "reasoning without failure" is to project our contextualized common sense on software built for a different purpose than reasoning without failure.
This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software's behavior to be help...
This seems pretty well done! Some thoughts on future research in this direction:
... is just an arbitrary thing not to do.
I think this is the crux-y part for me. My basic intuition here is something like "it's very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)" and this intuition points me in the direction of instead "conditionally training them to only do that thing in certain contexts" is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don't.
Like, I'm basically only thinking about large langu...
I like the main point; I hadn't considered it before with value learning. Trying to ask myself why I haven't been worried about this sort of failure mode before, I get the following:
It seems all of the harms to humans the value-learner causes are from some direct or indirect interaction with humans, so instead I want to imagine a pre-training step that learns as much about human values from existing sources (internet, books, movies, etc) without interacting with humans.
Then as a second step, this value-learner is now allowed to interact with humans i...
I'm pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won't work for aligning language models or any model that has a chance of being a general intelligence.
This has lead to me calling these sorts of techniques 'thought policing' and the negative examples as 'thoughtcrime' -- I think these are unnecessarily extra, but they work.
The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts.
Take for example Redwood Rese...
Thanks for writing this up, I found this summary much more clear and approachable than the paper. I also basically agree with your own points, with the caveat that I think the distinction between curiosity and curriculum gets blurry in meta-learning contexts like this. I wish there were better metrics and benchmarks for data efficiency in this regard, and then we could do things like measure improvements in units of that metric.
I’m pretty pessimistic about this line of research for a number of reasons, that I think support and complement the re...
I expect to have more detailed thoughts worth sharing as I spend more time with this content, but one thing stands out brightly as a first: This is, head-and-shoulders, the best language model interpretability work to date. I'm impressed at the thoroughness of the theory combined with detailed real examples.
This also seems like a good motivation to go back and study layer reordering (a'la Sandwich Transformers) as a treatment affecting the induced circuits of a model.
(h/t Kevin Wang for pointing out the sandwich transformer paper to me recently)
One the most important deployed applications of machine learning at this point would be web search, so papers relating to that (PageRank, etc) would probably score highly.
I'd expect some papers in spam filtering (which was pretty important / interesting as a machine learning topic at the time) to maybe meet the threshold.
TD-Gammon would probably qualify in the world of RL https://en.wikipedia.org/wiki/TD-Gammon
DistBelief just barely predates that, and since it's basically directly in the lineage to modern deep learning, I think might qualify https://en.wik...
I really appreciated this -- it felt like better access to the 'MIRI strategic viewpoint' than I think I've had in the past.
I found it more clarifying than updating for me, with a couple of tangible exceptions:
First, I found that I really like the example of response to coronavirus as an example of trillion-dollar warning shots. I think I've also previously agreed that responses to past disasters have been better, but the more recent example should be (all else equal) more informative to the other ones.
Second, this point about factored cognition
...Paul
I feel overall confused, but I think that's mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.
I hope sharing my confusions is useful to you. Here they are:
I'm not sure how the process of "spending bits" works. If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to "1/2^B"th of the space -- but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn't 'teleport' through paramete...
Epistemic status: I’m not really an expert at NLP. I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.
Beam Search:
Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature. Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:
I work at OpenAI on safety. In the past it seems like theres a gap between what I'd consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.
Topics w/ strategies/breakdown:
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed d... (read more)