User Comment Replies — AI Alignment Forum

Steering GPT-2-XL by adding an activation vector

I was educated by this, and surprised, and appreciate the whole thing! This part jumped out at me because it seemed like something people trying to "show off, but not really explain" would have not bothered to write about (and also I had an idea):

13. Failing to find a French vector
We could not find a "speak in French" vector after about an hour of effort, but it's possible we missed something straightforward.
Steering vector: "Je m'appelle" - "My name is " before attention layer 6 with coefficient +5

The thought I had was maybe to describe the desired ... (read more)

faul_sname2y121

I found an even dumber approach that works. The approach is as follows:

Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with "|"
Count the "|"s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreu

JenniferRM3y20

Ooh! Shiny! I forgot that the InfraBayes sequence existed, but when I went back I saw that I "read" the first four of them before "bouncing off" as you say. Just now I tried to dip back in to The Many Faces of Infra-Beliefs (hoping to get a summary) and it is so big! And it is not a summary <3

That post has a section titled "Deconfusing the Cosmic Ray Problem" which could be an entire post... or maybe even an entire sequence of its own if the target audience was like "bright high school students with some calc and trig and stats" and you have to ex... (read more)

4Abram Demski3y

Infradistributions are a generalization of sets of probability distributions. Sets of probability distributions are used in "imprecise bayesianism" to represent the idea that we haven't quite pinned down the probability distribution. The most common idea about what to do when you haven't quite pinned down the probability distribution is to reason in a worst-case way about what that probability distribution is. Infrabayesianism agrees with this idea. One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns out it's much trickier than it looks. You can't just update all the distributions in the set, because [reasons i am forgetting]. Part of the reason infrabayes generalizes imprecise bayes is to fix this problem. So you can think of an infradistribution mostly as a generalization of "sets of probability distributions" which has a good update rule, unlike "sets of probability distributions". Why is this great? Mainly because "sets of probability distributions" are actually a pretty great idea for decision theory. Regular Bayes has the "realizability" problem: in order to prove good loss bounds, you need to assume the prior is "realizable", which means that one of the hypotheses in the prior is true. For example, with Solomonoff, this amounts to assuming the universe is computable. Using sets instead, you don't need to have the correct hypothesis in your prior; you only need to have an imprecise hypothesis which includes the correct hypothesis, and "few enough" other hypotheses that you get a reasonably tight bound on loss. Unpacking that a little more: if the learnability condition is met, then if the true environment is within one of the imprecise hypotheses in the prior, then we can eventually do as well as an agent who just assumed that particular imprecise hypothesis from the beginning (because we eventually learn that the true world is within that imprecise hypothesis). This allows us to get good guara

A one-question Turing test for GPT-3

JenniferRM3y10

The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup.

To imagine that it is trying to solve the task of "reasoning without failure" is to project our contextualized common sense on software built for a different purpose than reasoning without failure.

This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software's behavior to be help... (read more)

2A Ray3y

I think this broadly makes sense to me. There are many cases where "the model is pretending to be dumb" feels appropriate. This is part of why building evaluations and benchmarks for this sort of thing is difficult. I'm at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like "play dumb for the joke" vs "give the best answer", using techniques that build on human feedback. I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it's outputs and analyzing them. It seems important to be wary of the "seems to be dumb" failure modes.

There is essentially one best-validated theory of cognition.

JenniferRM3y00

The flashcard and curriculum experiments seem really awesome in terms of potential for applications. It feels like the beginnings of the kind of software technology that would exist in a science fiction novel where one of the characters goes into a "learning pod" built by a high tech race, and pops out a couple days layer knowing how to "fly their spaceship" or whatever. Generic yet plausible plot-hole-solving super powers! <3

There is essentially one best-validated theory of cognition.

JenniferRM3y00

The idea of the physical brain turning out to be similar to ACT-R after the code had been written based on high level timing data and so on... seems like strong support to me. Nice! Real science! Predicting stuff in advance by accident! <3

My memory from exploring this in the past is that I ran into some research with "math problem solving behavior" with human millisecond timing for answering various math questions that might use different methods... Googling now, this Tenison et al ACT-R arithmetic paper might be similar, or related?

With you being an ex... (read more)

0terry.stewart3y

Yes, that Tenison paper is a great example of arithmetic modelling in ACT-R, and especially connecting it to the modern fMRI approach for validation! For an example of the other sorts of math modelling that's more psychology-experiment-based, this paper gives some of the low-level detail about how such a model would work, and maps it onto human errors: - "Toward a Dynamic Model of Early Algebra Acquisition" https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.5754&rep=rep1&type=pdf (that work was expanded on a few times, and led to things like "Instructional experiments with ACT-R “SimStudents”" http://act-r.psy.cmu.edu/?post_type=publications&p=13890 where they made a bunch of simulated students and ran them through different teaching regimes) As for other cool tasks, the stuff about playing some simple video games is pretty compelling to me, especially in as much as it talks about what sort of learning is necessary for the precise timing that develops. http://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2019/03/paper46a.pdf Of course, this is not as good in terms of getting a high score as modern deep learning game-playing approaches, but it is very good in terms of matching human performance and learning trajectories. Another model I find rather cool a model of driving a car, which then got combined with a model of sleep deprivation to generate a model of sleep-deprived driving: http://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2012/12/9822011-gunzelmann_moore_salvucci_gluck.pdf One other very cool application, I think is the "SlimStampen" flashcard learning tool developed out of Hedderik van Rijn's lab at the University of Groningen, in the Netherlands: http://rugsofteng.github.io/Team-5/ The basic idea is to optimize learning facts from flashcards by presenting a flashcard fact just before the ACT-R declarative memory model predicts that a person is going to forget a fact. This seems to improve learning considerably http://act-r.psy

There is essentially one best-validated theory of cognition.

JenniferRM3y*240

I think I remember hearing about this from you in the past and looking into it some.

I looked into it again just now and hit a sort of "satiety point" (which I hereby summarize and offer as a comment) when I boiled the idea down to "ACT-R is essentially a programming language with architectural inclinations which cause it to be intuitively easy see 1:1 connections between parts of the programs and parts of neurophysiology, such that diagrams of brain wiring, and diagrams of ACT-R programs, are easy for scientists to perceptually conflate and make anal... (read more)

terry.stewart3y80

As someone who can maybe call themselves an ACT-R expert, I think the main thing I'd say about the intentional module being "not identified" is that we don't have any fMRI data showing activity in any particular part of the brain being correlated to the use of the intentional module in various models. For all of the other parts that have brain areas identified, there's pretty decent data showing that correlation with activity in particular brain areas. And also, for each of those other areas there's pretty good arguments that those brain areas ... (read more)

2Matthew "Vaniver" Gray3y

Why do they separate out the auditory world and the environment?

Abram Demski3y70

This lines up fairly well with how I've seen psychology people geek out over ACT-R. That is: I had a psychology professor who was enamored with the ability to line up programming stuff with neuroanatomy. (She didn't use it in class or anything, she just talked about it like it was the most mind blowing stuff she ever saw as a research psychologist, since normally you just get these isolated little theories about specific things.)

And, yeah, important to view it as a programming language which can model a bunch of stuff, but requires fairly extensive user in... (read more)

Refactoring Alignment (attempt #2)

JenniferRM4y00

I like the nine-node graph for how it makes the stakes of "how you group the things" more clear, potentially? Also it suggests ways of composing tools maybe?

Personally, I always like to start with, then work backwards from, The Goal.

Then, someone might wonder about the details, and how they might be expanded and implemented and creatively adjusted to safe but potentially surprising ways.

So you work out how to make some external source of power (which is still TBD) somehow serve The Goal (which is now the lower left no... (read more)

3Abram Demski4y

Leaning hard on my technical definitions: These two together do not quite imply behavioral alignment, because it's possible for a model to have a human-friendly mesa-objective but be super bad at achieving it, while being super good at achieving some other objective. So, yes, there is a little bit of gear-grinding if we try to combine the two plans like that. They aren't quite the right thing to fit together. It's like we have a magic vending machine that can give us anything, and we have a slip of paper with our careful wish, and we put the slip of paper in the coin slot. That being said, if we had technology for achieving both intent alignment and robustness, I expect we'd be in a pretty good position! I think the main reason not to go after both is that we may possibly be able to get away with just one of the two paths.

AI ALIGNMENT FORUM
AF

All of JenniferRM's Comments + Replies

13. Failing to find a French vector