All of A Ray's Comments + Replies

I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.

Trying to encode an additional penalty on changing non-semantic information is an interesting idea.

However I think you're missing that you don't have the ability to directly compare to a reference LM in cases where you're training to improve on some performance benchmark.  During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).

So your proposed d... (read more)

If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.

This also applies to changes during training where the model is learning to perform better on the objective task.

So we are expecting some amount of KL divergence already.

My claims are:

  1. The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)

  2. optimization pressure will try to push this extra information into the cheapest places to hide

  3. the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task

2Quintin Pope
I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content. Another framing is to think of the KL regularizated capabilities training as Bayes updating of the model’s priors on the distribution over text. The model should have much stronger priors on grammatical rules than on higher level concepts. So, the update probably changes conceptual patterns more than grammatical ones. Can’t you just compare the token predictions of the new model against the reference LM? Then you can check if they disagree on weird grammatical patterns or on conceptual tokens that are related to a human-like understanding of how to tackle the problem. Maybe you can even bias the KL penalty to apply higher weight to stopwords, punctuation, and synonym choices?

I think I understand what you're saying, but I want to double check and try laying it out explicitly.

  • The "null space" is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language
  • Because of this, the information is "hidden" w.r.t. human perceptions, it's "hidden" w.r.t. the language model prior

I think I agree with both of these points, but here's my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be "human")

  • Empirically drawing many samples
... (read more)
0Quintin Pope
Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong. The core issue is that you can't actually avoid the KL penalty by hiding in the nullspace. That's just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM's prior in any manner, then you take a KL penalty.  E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steganographic LM always uses “My stomach is grumbling”, then this deviation is also penalized by KL divergence, regardless of whether the reference LM's indifference between the two phrasings means you call this part of the LM's "nullspace".

I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven't organized:

  • In general additional compute time lets you improve your answer, so by default I'd expect answers to any significant question to take multiple steps
    • I agree for trivial questions it might not do any of this.  My expectation is that AI deployments impacts will mostly be on things like non-trivial questions or tasks
  • So far current research hasn't done a great deal of regularizing or penalizing models for using a lot of thinki
... (read more)

Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.

However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):

  • Reducing the channel capacity, and thus the amount of information that can be transmitted
    • Constraining the outputs, and thus reducing the degrees of freedom of the null space
    • Adding noise (via dropout or otherwise) also reduces the channel capaci
... (read more)
1Hoagy
Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty. They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages. Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.

Agree that founders are a bit of an exception.  Actually that's a bit in the longer version of this when I talk about it in person.

Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes".

So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.

In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd pr... (read more)

I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn't want it to be inside that AI's training data.

Maybe in the future we'll have a better tag for "dont train on me", but for now the big bench canary string is the best we have.

This is in addition to things like "maybe don't post it to the public internet" or "maybe don't link to it from public posts" or other ways of ensuring it doesn't end up in training corpora.

I think this is a situation for defense-in-depth.

2Daniel Kokotajlo
What is the canary exactly? I'd like to have a handy reference to copy-paste that I can point people to. Google fails me.

AGI will probably be deployed by a Moral Maze

Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".

I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.

My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze.  It seems to be a pretty good nul... (read more)

1Ivan Vendrov
Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes. Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg's beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it's relatively clear which ones are moral mazes).

Thanks so much for making this!

I'm hopeful this sort of dataset will grow over time as new sources come about.

In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.

1Ethan Perez
Yes super excited about datasets like this! It might be helpful to also add https://ai-alignment.com/ or https://paulfchristiano.medium.com/ if these aren't already in the data

This seems like an overly alarmist take on what is a pretty old trend of research.  Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook).  It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.

Do you have suggestions for domains where you do expect one-turn debate to work well, now that you've got these results?

1Sam Bowman
I have no reason to be especially optimistic given these results, but I suppose there may be some fairly simple questions for which it's possible to enumerate a complete argument in a way that flaws will be clearly apparent. In general, it seems like single-turn debate would have to rely on an extremely careful judge, which we don't quite have, given the time constraint. Multi-turn seems likely to be more forgiving, especially if the judge has any influence over the course of the debate.

I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing.  (For example, a world model that knows very little, but knows how to search for information in a library)

I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system.  My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here.  This is why I don't think "... (read more)

2Steve Byrnes
In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”. That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordinarily complicated illegible world-model (or just plain “model” in the GPT-3 case, if you prefer). Likewise, the human brain has a learning algorithm that builds a world-model. The learning algorithm is (I happen to think) a compact easily-human-legible algorithm involving pattern recognition and gradient descent and so on. But the world-model built by that learning algorithm is super huge and complicated. Sorry if I’m misunderstanding. I’ll try to walk through why I think “coming up with new concepts outside what humans have thought of” is required. We want an AGI to be able to do powerful things like independent alignment research and inventing technology. (Otherwise, it’s not really an AGI, or at least doesn’t help us solve the problem that people will make more dangerous AGIs in the future, I claim.) Both these things require finding new patterns that have not been previously noticed by humans. For example, think of the OP that you just wrote. You had some idea in your head—a certain visualization and associated bundle of thoughts and intuitions and analogies—and had to work hard to try to communicate that idea to other humans like me. Again, sorry if I’m misunderstanding.

Two Graphs for why Agent Foundations is Important (according to me)

Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models.  The lack of rigor is why I’m short form-ing this.

First Graph: Agent Foundations as Aligned P2B Fixpoint

P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process.  It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. &nb... (read more)

3Steve Byrnes
RE legibility: In my mind, I don’t normally think there’s a strong connection between agent foundations and legibility. If the AGI has a common-sense understanding of the world (which presumably it does), then it has a world-model, full of terabytes of information of the sort “tires are usually black” etc. It seems to me that either the world-model will be either built by humans (e.g. Cyc), or (much more likely) learned automatically by an algorithm, and if it’s the latter, it will be unlabeled by default, and it’s on us to label it somehow, and there’s no guarantee that every part of it will be easily translatable to human-legible concepts (e.g. the concept of “superstring” would be hard to communicate to a person in the 19th century). But everything in that paragraph above is “interpretability”, not “agent foundations”, at least in my mind. By contrast, when I think of “agent foundations”, I think of things like embedded agency and logical induction and so on. None of these seem to be related to the problem of world-models being huge and hard-to-interpret. Again, world-models must be huge and complicated, because the world is huge and complicated. World-models must have hard-to-translate concepts, because we want AGI to come up with new ideas that have never occurred to humans. Therefore world-model interpretability / legibility is going to be a big hard problem. I don’t see how “better understanding the fundamental nature of agency” will change anything about that situation. Or maybe you’re thinking “at least let’s try to make something more legible than a giant black box containing a mesa-optimizer”, in which case I agree that that’s totally feasible, see my discussion here.

Maybe useful: an analogy this post brought to mind for me: Replacing “AI” with “Animals”.

Hypothetical alien civilization, observing Early Earth and commenting on whether it poses a risk.

Doesn’t optimization nature produce non-agentic animals?  It mostly does, but those aren’t the ones we’re concerned with.  The risk is all concentrated in the agentic animals.

Basically every animal ever is not agentic.  I’ve studied animals for my entire career and I haven’t found an agentic animal yet.  That doesn’t preclude them showing up in the futur... (read more)

Hacking the Transformer Prior

Neural Network Priors

I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.

Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.

Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models).  This includes producing more interpretable models.

Analogy to Software Devel... (read more)

3Matthew "Vaniver" Gray
I'm pretty sure you mean functions that perform tasks, like you would put in /utils, but I note that on LW "utility function" often refers to the decision theory concept, and "what decision theoretical utility functions are present in the neural network prior" also seems like an interesting (tho less useful) question.

I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.


 

I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent.  Unfortunately a lot of this is vague until you start specifying the domain of model.  I think specifying this more clearly will help communicating about these ideas.  To start with this myself, when I talk about circuit induction, I’m talking about things th... (read more)

Interpretability Challenges

Inspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.

I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person.  The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.

On the other hand, most of the interpretability... (read more)

I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.

(I am one of said folks so maybe hypocritical to not work on it)

In particular it seems like there's way in which it would be more interpretable than transformers:

  • adjustable timescale stepping (either sub-stepping, or super-stepping time)
  • approximately separable state spaces/dynamics -- this one is crazy conjecture -- it seems like it should be possible to force the state space and dynamics into separate groups, i
... (read more)

It's worth probably going through the current deep learning theories that propose parts of gears-level models, and see how they fit with this.  The first one that comes to mind is the Lottery Ticket Hypothesis.  It seems intuitive to me that certain tasks correspond to some "tickets" that are harder to find.

I like the taxonomy in the Viering and Loog, and it links to a bunch of other interesting approaches.

This paper shows phase transitions in data quality as opposed to data size, which is an angle I hadn't considered before.

There's the google pa... (read more)

Decomposing Negotiating Value Alignment between multiple agents

Let's say we want two agents to come to agreement on living with each other.  This seems pretty complex to specify; they agree to take each other's values into account (somewhat), not destroy each other (with some level of confidence), etc.

Neither initially has total dominance over the other.  (This implies that neither is corrigible to the other)

A good first step for these agents is to share each's values with the other.  While this could be intractably complex -- it's probably ... (read more)

I'm really excited about this research direction. It seems so well-fit to what you've been researching in that past -- so much so that it doesn't seem to be a new research direction so much as a clarification of the direction you were already pursuing.

I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).

A common thread in the last year of my work on alignment is something like "How can I be an aligned intellig... (read more)

2Alex Flint
Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new "subject area" that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don't think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi's work to be figuring out what this new discipline/paradigm really is.) Interesting! I hadn't thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you're referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you're seeing something like when we're executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just "do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony" ? We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale -- I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes! Hope you are well Alex!

I think that's right that upgraded verification by itself is insufficient for 'defense wins' worlds.  I guess I'd thought that was apparent but you're right it's definitely worth saying explicitly.

A big wish of mine is that we end up doing more planning/thinking-things-through for how researchers working on AI today could contribute to 'defense wins' progress.

My implicit other take here that wasn't said out loud is that I don't really know of other pathways where good theorem proving translates to better AI x-risk outcomes.  I'd be eager to know of these.

Copying some brief thoughts on what I think about working on automated theorem proving relating to working on aligned AGI:

  • I think a pure-mathematical theorem prover is more likely to be beneficial and less likely to be catastrophic than STEM-AI / PASTA
  • I think it's correspondingly going to be less useful
  • I'm optimistic that it could be used to upgrade formal software verification and cryptographic algorithm verification
  • With this, i think you can tell a story about how development in better formal theorem provers can help make information security a "defense
... (read more)
2Ramana Kumar
In my understanding there's a missing step between upgraded verification (of software, algorithms, designs) and a "defence wins" world: what the specifications for these proofs need to be isn't a purely mathematical thing. The missing step is how to figure out what the specs should say. Better theorem proving isn't going to help much with the hard parts of that.

I’m excited to see more of this.  As the field grows (in funding, people, etc) it seems like there’s a lot more room to grow the portfolio of bets of Alignment Approaches, and brain-based methods seem to be interesting enough to allocate to.

I think I’ve been underwhelmed by past progress in this direction, but that doesn’t preclude someone coming in and finding a tractable angle to start grinding away at.

Things I most interested in (personally and selfishly):

  • How can this suggest alignment approaches that are intractable or infeasible with other framin
... (read more)

It does seem that public/shared investment into tools that make structured access programs easier, might make more of them happen.

As boring as it is, this might be a good candidate for technical standards for interoperability/etc.

Re: alignment tasks in multi-task settings.  I think this makes a lot of sense.  Especially in worlds where we have a lot of ML/AI systems doing a bunch of different things, even if they have very different specific tasks, the "library" of alignment objectives is probably widely shared.

Thanks for taking the time to explain this!

I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.

I think this is what I was missing.  I was incorrectly thinking of the system as generating poly-sized fragments.

So if there are different poly fragments that the human would evaluate differently, is ELK just "giving them a fragment such that they come to the correct conclusion" even if the fragment might not be the right piece.

E.g. in the SmartVault case, if the screen was put in the way of the camera and the diamond was secretly stolen, we would still be successful even if we didn't elicit that fact, but instead elicited some poly fragment that got the human to answer disapprove?

Like the thing that seems weird to me here is that you can't simultaneously require tha... (read more)

1Mark Xu
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for. I don't know what you mean by "relevant" or "comprehensible" here. This doesn't seem right to me.

Okay now I have to admit I am confused.

Re-reading the ELK proposal -- it seems like the latent knowledge you want to elicit is not-obfuscated.

Like, the situation to solve is that there is a piece of non-obfuscated information, which, if the human knew it, would change their mind about approval.

How do you expect solutions to elicit latent obfuscated knowledge (like 'the only true explanation is incomprehendible by the human' situations)?

1Mark Xu
I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.

Cool, this makes sense to me.

My research agenda is basically about making a not-obfuscated model, so maybe I should just write that up as an ELK proposal then.

Thinking more about ELK.  Work in progress, so I expect I will eventually figure out what's up with this.

Right now it seems to me that Safety via Debate would elicit compact/non-obfuscated knowledge.

So the basic scenario is that in addition to SmartVault, you'd have Barrister_Approve and Barrister_Disapprove, who are trying to share evidence/reasoning which makes the human approve or disapprove of SmartVault scenarios.

The biggest weakness of this that I know of is Obfuscated Arguments -- that is, it won't elicit obfuscated knowledge.

It seems like in t... (read more)

1Mark Xu
I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor's Bayes net works "just because", so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of "obfuscated" knowledge that debate can't really handle.

Some disorganized thoughts about adversarial ML:

  • I think I'm a little bit sad about the times we got whole rooms full of research posters about variations on epsilon-ball adversarial attacks & training, basically all of them claiming how this would help AI safety or AI alignment or AI robustness or AI generalization and basically all of them were basically wrong.
  • This has lead me to be pretty critical of claims about adversarial training as pathways to aligning AGI.
  • Ignoring the history of adversarial training research, I think I still have problems with
... (read more)

I think this broadly makes sense to me.  There are many cases where "the model is pretending to be dumb" feels appropriate.

This is part of why building evaluations and benchmarks for this sort of thing is difficult.

I'm at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like "play dumb for the joke" vs "give the best answer", using techniques that build on human feedback.

I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it's outputs and analyzing them.  It seems important to be wary of the "seems to be dumb" failure modes.

I like this a lot.  It does seem to show a pretty clear failure of reasoning on behalf of the language models.

It's pretty intuitive and easy to show people what this kind of shallow-pattern-matcher failures look like.

Meta: I hope more people do more small experiments like this and share their results, and hopefully a few of them start getting put into benchmark/evaluations.

Even this one by itself might be good to make some form of benchmark/evaluation on (in ways that are easy/automatic to evaluate future models on).

The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup. 

To imagine that it is trying to solve the task of "reasoning without failure" is to project our contextualized common sense on software built for a different purpose than reasoning without failure.

This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software's behavior to be help... (read more)

This seems pretty well done!  Some thoughts on future research in this direction:

  • It seems like you probably could have gotten certainty about compute for at least a handful of the models studied in question (either because the model was open sourced, or you have direct access to the org training it like Eleuther) -- it would be interesting to see how the estimation methods compared to the exact answer in this case.  (Probably doable with GPT-J for example)
  • While I agree with dropout not significantly reducing computation I think two more contempor
... (read more)
0lennart
I'd be pretty excited to see more work on this. Jaime already shared our hardware sheet where we collect information on GPUs but as you outline that's the peak performance and sometimes misleading. Indeed, the MLPerf benchmarks are useful. I've already gathered their data in this sheet and would love to see someone playing around with it. Next to MLPerf, Lambda Labs also shares some standardized benchmarks.
7Jaime Sevilla
Thank you Alex! You make some great points. We thought so too - but in practice it has been surprisingly hard. Profilers are surprisingly buggy. Our colleague Marious looked into this more in depth here. Maybe we are just going the wrong way about it. If someone here figures out how to directly measure compute in eg a pytorch or TF model it would be a huge boon to us.  Great suggestions! I think those would be a great future caveats to look into. My naive impression is that our conclusions do not change much. You would just need to plug into the effective performance (peak performance×utilization) in the second formula. Probably the trickiest part might be figuring out the utilization rate for the custom hardware - though this is a general problem with the second method. I think that would be nice! We started a public spreadsheet with some info on different hardware. This might be of help to someone who wants to dig deeper into the topic!

... is just an arbitrary thing not to do.

I think this is the crux-y part for me.  My basic intuition here is something like "it's very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)" and this intuition points me in the direction of instead "conditionally training them to only do that thing in certain contexts" is easier in a way that matters.

My intuitions are based on a bunch of assumptions that I have access to and probably some that I don't.

Like, I'm basically only thinking about large langu... (read more)

"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly?

I'm curious how this is connected to "doesn't write fiction where a human is harmed".

3Paul Christiano
My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally. "Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it's plausible that a predicate with neutral valence would have been better to avoid confusion).

I like the main point; I hadn't considered it before with value learning.  Trying to ask myself why I haven't been worried about this sort of failure mode before, I get the following:

It seems all of the harms to humans the value-learner causes are from some direct or indirect interaction with humans, so instead I want to imagine a pre-training step that learns as much about human values from existing sources (internet, books, movies, etc) without interacting with humans.

Then as a second step, this value-learner is now allowed to interact with humans i... (read more)

I'm pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won't work for aligning language models or any model that has a chance of being a general intelligence.

This has lead to me calling these sorts of techniques 'thought policing' and the negative examples as 'thoughtcrime' -- I think these are unnecessarily extra, but they work. 

The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts.

Take for example Redwood Rese... (read more)

4Paul Christiano
The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.

Thanks for writing this up, I found this summary much more clear and approachable than the paper.  I also basically agree with your own points, with the caveat that I think the distinction between curiosity and curriculum gets blurry in meta-learning contexts like this.  I wish there were better metrics and benchmarks for data efficiency in this regard, and then we could do things like measure improvements in units of that metric.

I’m pretty pessimistic about this line of research for a number of reasons, that I think support and complement the re... (read more)

I expect to have more detailed thoughts worth sharing as I spend more time with this content, but one thing stands out brightly as a first: This is, head-and-shoulders, the best language model interpretability work to date.  I'm impressed at the thoroughness of the theory combined with detailed real examples.

This also seems like a good motivation to go back and study layer reordering (a'la Sandwich Transformers) as a treatment affecting the induced circuits of a model.

(h/t Kevin Wang for pointing out the sandwich transformer paper to me recently)

Answer by A Ray20

One the most important deployed applications of machine learning at this point would be web search, so papers relating to that (PageRank, etc) would probably score highly.

I'd expect some papers in spam filtering (which was pretty important / interesting as a machine learning topic at the time) to maybe meet the threshold.

TD-Gammon would probably qualify in the world of RL https://en.wikipedia.org/wiki/TD-Gammon

DistBelief just barely predates that, and since it's basically directly in the lineage to modern deep learning, I think might qualify https://en.wik... (read more)

If it’s interesting, we’d also be happy to post the story of MATS’ creation for other prospective community builders.

I personally would be very interested in this, especially with a mind to focusing on prosaic systems alignment today (as opposed to alignment theory).

I really appreciated this -- it felt like better access to the 'MIRI strategic viewpoint' than I think I've had in the past.

I found it more clarifying than updating for me, with a couple of tangible exceptions:

First, I found that I really like the example of response to coronavirus as an example of trillion-dollar warning shots.  I think I've also previously agreed that responses to past disasters have been better, but the more recent example should be (all else equal) more informative to the other ones.

Second, this point about factored cognition

Paul

... (read more)

One reason it might not be fitting as well for vision, is that vision has much more weight-tying / weight-reuse in convolutional filters.  If the underlying variable that mattered was compute, then image processing neural networks would show up more prominently in compute (rather than parameters).

I feel overall confused, but I think that's mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.

I hope sharing my confusions is useful to you.  Here they are:

I'm not sure how the process of "spending bits" works.  If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to "1/2^B"th of the space -- but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn't 'teleport' through paramete... (read more)

2Rohin Shah
Nah, it's a great trick. The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as logN(θ0,σ) (minus an irrelevant additive constant), where σ is set to imply whatever hyperparameter you used for your weight decay. This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like "moving towards zero is more likely to be good" then you would not have a Gaussian centered at θ0, and so the corresponding log prob would not be the nice simple "L2 distance to θ0". I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as "finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless").

Thanks for doing this research and sharing the results.

I'm curious if you or MIRI plan to do more of this kind of survey research in the future, or its just a one-off project.

3Rob Bensinger
One-off, though Carlier, Clarke, and Schuett have a similar survey coming out in the next week.

Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?

2John Maxwell
I was using it to refer to "any inner optimizer". I think that's the standard usage but I'm not completely sure.

Epistemic status: I’m not really an expert at NLP.  I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.

Beam Search:

Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature.  Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:

  • It tends to quickly fall on a modal response — so it’s already bad for any sort of situation you want to generate a diver
... (read more)
1John Maxwell
With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn't work great, but I do think editing has potential because it allows for more sophisticated thinking. Let's say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the context window when it's writing about the wings), but it won't allow the wing design to influence the engine design. However, if the model is allowed to edit its text, it can rethink the engine in light of the wings and rethink the wings in light of the engine until it's designed a really good aircraft. Agreed. Perhaps if we generated lots of travelling salesman problem instances where the greedy approach doesn't get you something that looks like the optimal route, then try & train a GPT architecture to predict the cities in the optimal route in order? This is an interesting quote: Source. I suspect GPT will be biased towards avoiding mesa-optimization and making use of heuristics, so the best contrived mesa-optimization setup may be an optimization problem with little structure where heuristics aren't very helpful. Maybe we could focus on problems where non-heuristic methods such as branch and bound / backtracking are considered state of the art, and train the architecture to mesa-optimize by starting with easy instances and gradually moving to harder and harder ones.
Answer by A Ray60

I work at OpenAI on safety. In the past it seems like theres a gap between what I'd consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.

Topics w/ strategies/breakdown:

  • Fine-tuning GPT-2 from human preferences, to solve small scale alignment issues
    • Brainstorm small/simple alignment failures: ways that existing generative language models are not aligned with human values
    • Design some evaluations or metrics for measuring a specific alignment fai
... (read more)
Load More