All of Vivek Hebbar's Comments + Replies

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Vivek Hebbar1y30

Paul [Christiano] called this “problems of the interior” somewhere

Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior"). Paul also refers to it in this post. The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

4DanielFilan1y

Thanks for finding this! Will link it in the transcript.

2Ryan Greenblatt1y

Also some discussion in this thread.

1Vivek Hebbar1y

This google search seems to turn up some interesting articles (like maybe this one, though I've just started reading it).

Survey for alignment researchers!

Vivek Hebbar1y12

Note: The survey took me 20 mins (but also note selection effects on leaving this comment)

0Cameron Berg1y

Definitely good to know that it might take a bit longer than we had estimated from earlier respondents (with the well-taken selection effect caveat). Note that if it takes between 10-20 minutes to fill out, this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one's individual results to that of community), which hopefully is worth the time investment :)

What’s up with LLMs representing XORs of arbitrary features?

What’s up with LLMs representing XORs of arbitrary features?

Here's a fun thing I noticed:

There are 16 boolean functions of two variables. Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space. For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.

For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions. If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement). In fact... (read more)

What’s up with LLMs representing XORs of arbitrary features?

Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

Vivek Hebbar1y20

The variance of the multivariate uniform distribution $U ([0, 1] \times [0, 1])$ is largest along the direction $x_{1} + x_{2}$ , which is exactly the direction which we would want to represent a AND b.

The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.

In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are ind... (read more)

2Sam Marks1y

Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of U([0,1]2) onto y = x would be uniform on [−1√2,1√2] (obviously false!). The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like: * if a, b are both false, then sample from N(0,Σ) * if a is true and b is false, then sample from N(μa,Σ) * if a is false and b is true then sample from N(μb,Σ) * if a and b are true, then sample from N(μa+μb,Σ) for some μa,μb∈Rn and covariance matrix Σ. In general, I don't really expect the class-conditional distributions to be Gaussian, nor for the class-conditional covariances to be independent of the class. But I do expect something broadly like this, where the distributions are concentrated around their class-conditional means with probability falling off as you move further from the class-conditional mean (hence unimodality), and that the class-conditional variances are not too big relative to the distance between the clusters. Given that longer explanation, does the unimodality thing still seem directionally wrong?

What’s up with LLMs representing XORs of arbitrary features?

Making Nanobots isn't a one-shot process, even for an artificial superintelligance

Maybe models track which features are basic and enforce that these features be more salient

Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?

1Sam Marks1y

Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

AI doom from an LLM-plateau-ist perspective

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)? Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

2Eliezer Yudkowsky2y

At the superintelligent level there's not a binary difference between those two clusters. You just compute each thing you need to know efficiently.

Vivek Hebbar2y22

It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

6Steve Byrnes2y

When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in the meantime I was also gradually getting into thinking about brain algorithms, which involve RL much more centrally, and I came to believe that that RL was necessary to reach dangerous capability levels (recent discussion here; I think the first time I wrote it down was here). And I still believe that, and I think the jury’s out as to whether it’s true. (RLHF doesn’t count, it’s just a fine-tuning step, whereas in the brain it’s much more central.) My updates since then have felt less like “Wow look at what GPT can do” and more like “Wow some of my LW friends think that GPT is rapidly approaching the singularity, and these are pretty reasonable people who have spent a lot more time with LLMs than I have”. I haven’t personally gotten much useful work out of GPT-4. Especially not for my neuroscience work. I am currently using GPT-4 only for copyediting. (“[The following is a blog post draft. Please create a bullet point list with any typos or grammar errors.] …” “Was there any unexplained jargon in that essay?” Etc.) But maybe I’m bad at prompting, or trying the wrong things. I certainly haven’t tried very much, and find it more useful to see what other people online are saying about GPT-4 and doing with GPT-4, rather than my own very limited experience. Anyway, I have various theory-driven beliefs about deficiencies of LLMs compared to other possible AI algorithms (the RL thing I mentioned above is just one of ma

Richard Ngo's Shortform

Vivek Hebbar2y50

POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

4Alex Turner2y

I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

Understanding and controlling a maze-solving policy network

Vivek Hebbar2y2-2

Any idea why "cheese Euclidean distance to top-right corner" is so important? It's surprising to me because the convolutional layers should apply the same filter everywhere.

2Alex Turner2y

I'm also lightly surprised by the strength of the relationship, but not because of the convolutional layers. It seems like if "convolutional layers apply the same filter everywhere" makes me surprised by the cheese-distance influence, it should also make me be surprised by "the mouse behaves differently in a dead-end versus a long corridor" or "the mouse tends to go to the top-right." (I have some sense of "maybe I'm not grappling with Vivek's reasons for being surprised", so feel free to tell me if so!)

1Matthew "Vaniver" Gray2y

My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

The Waluigi Effect (mega-post)

Vivek Hebbar2y1017

Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

3Abram Demski2y

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

1Cleo Nardo2y

Yep I think you might be right about the maths actually. I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation. So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them. I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis. Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

Polysemanticity and Capacity in Neural Networks

Vivek Hebbar2y20

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them. What are the correct labels? (And maybe fix the paper if these are typos?)

The "Minimal Latents" Approach to Natural Abstractions

Vivek Hebbar2y65

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

6Rohin Shah2y

You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture. (This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

Alignment allows "nonrobust" decision-influences and doesn't require robust grading

Vivek Hebbar2y30

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If food nearby and hunger>15, then be more likely to go to food") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.^[10]

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused. I now think you mean "My take on Vivek's is that value s... (read more)

2Alex Turner2y

Reworded, thanks.

Finite Factored Sets

Vivek Hebbar2y20

Makes perfect sense, thanks!

Finite Factored Sets

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable? Should it be "part" instead?

2Ramana Kumar2y

Partitions (of some underlying set) can be thought of as variables like this: * The number of values the variable can take on is the number of parts in the partition. * Every element of the underlying set has some value for the variable, namely, the part that that element is in. Another way of looking at it: say we're thinking of a variable v:S→D as a function from the underlying set S to v's domain D. Then we can equivalently think of v as the partition {{s∈S∣v(s)=d}∣d∈D}∖∅ of S with (up to) |D| parts. In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.

Vivek Hebbar2y*20

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".^[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,^[2] this is a mathematica... (read more)

1Koen Holtman2y

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted. To comment on your quick thoughts: * My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly. * On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism. * On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. P

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Vivek Hebbar2y30

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

Koen Holtman2y*41

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based w... (read more)

2Vivek Hebbar2y

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation Update: I started reading your paper "Corrigibility with Utility Preservation".[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer". Quick thoughts after reading less than half: AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful). So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").[4] Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]: "In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this concl

Don't align agents to evaluations of plans

Vivek Hebbar2y63

Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

The agent's allegiance is to some idealized utility function $U_{i d e a l}$ (like CEV). The agent's internal evaluator $E v a l$ is "trying" to approximate $U_{i d e a l}$ by reasoning heuristically. So now we ask Eval to evaluate the plan "do argmax w.r.t

... (read more)

3Alex Turner2y

Vivek -- I replied to your comment in appendix C of today's follow-up post, Alignment allows imperfect decision-influences and doesn't require robust grading.

4Adam Shimi2y

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious: * Do you think that you are internally trying to approximate your own Uideal? * Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)? * Can you think of concrete instances where you improved your own Eval? * Can you think of concrete instances where you thought you improved you own Eval but then regretted it later? * Do you think that your own changes to your eval have been moving in the direction of your Uideal?

8Wei Dai2y

This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework? My own framework is something like this: * The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers. * I think there are "adversarial inputs" because I've previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas. * I can try to improve my evaluation process by doing things like 1. look for patterns in my and other people's mistakes 2. think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses 3. do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.) 4. talk (selectively) to other people 5. try to improve how I do explicit reasoning or philosophy

The Least Controversial Application of Geometric Rationality

The Least Controversial Application of Geometric Rationality

Yeah, the right column should obviously be all 20s. There must be a bug in my code^[1] :/

I like to think of the argmax function as something that takes in a distribution on probability distributions on $W$ with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis $h_{3}$ :

If I add this into $P$ with weight $10^{- 9}$ , then the middle column is still near... (read more)

Vivek Hebbar2y*20

Now, let's consider the following modification: Each hypothesis is no longer a distribution on $W$ , but instead a distribution on some coarser partition of $W$ . Now ${argmax}_{Q \in Δ W} G_{h \sim P} G_{w \sim h} Q (w)$ is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what d... (read more)

2Scott Garrabrant2y

I think your numbers are wrong, and the right column on the output should say 20% 20% 20%. The output actually agrees with each of the components on every event in that component's sigma algebra. The input distributions don't actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn't disagree with either. I agree that the 0s are a bit unfortunate. I think the best way to think of the type of the object you get out is not a probability distribution on W, but what I am calling a partial probability distribution on W. A partial probability distribution is a partial function from 2W→[0,1] that can be completed to a full probability distribution on W (with some sigma algebra that is a superset of the domain of the partial probability distribution. I like to think of the argmax function as something that takes in a distribution on probability distributions on W with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components. One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.) This doesn't really make it that much better, but the point here is that this framework admits that it doesn't really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don't disagree with the output at all on any of these sets.

Vivek Hebbar's Shortform

Vivek Hebbar2y32

A framing I wrote up for a debate about "alignment tax":

"Alignment isn't solved" regimes:
1. Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
2. We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one
"Alignment tax" regimes:
1. We can make an aligned AGI, but it requires a compute overhead in the range 1% - 100x. Furthermore, the situation remains multipolar and competitive for a while.
2. The alignment tax is <0.001%, so it's not a concern.
3. The leadi

Vivek Hebbar2y72

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

Vivek Hebbar2y20

Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)] is the probability that, for a random policy π∈ξ, that policy has worse utility than the policy G* its program dictates; in essence, how good G's policies are compared to random policy selection

What prior over policies?

given g(G|U), we can infer the probability that an agent G has a given utility function U, as Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]) where ∝ means "is proportional to" and K(U) is the kolmogorov complexity of utility function U.

Suppose the prior over policies is max-entropy (uniform over all action seq... (read more)

"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

Vivek Hebbar2y*10

I have seen one person be surprised (I think twice in the same convo) about what progress had been made.

ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

Vivek Hebbar2y42

All in all, I don't think my original post held up well. I guess I was excited to pump out the concept quickly, before the dust settled. Maybe this was a mistake? Usually I make the ~opposite error of never getting around to posting things.

2Johannes Treutlein2y

I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

Humans do acausal coordination all the time

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture.

You're correct that the written portion of the Information Loss --> Basin flatness post doesn't use any non-trivial facts about NNs. The purpose of the written portion was to explain some mathematical groundwork, which is then used for the non-trivial claim. (I did not know at the time ... (read more)

4Vivek Hebbar2y

Vivek Hebbar2y31

Note that, for rational *altruists* (with nothing vastly better to do like alignment), voting can be huge on CDT grounds -- if you actually do the math for a swing state, the leverage per voter is really high. In fact, I think the logically counterfactual impact-per-voter tends to be lower than the impact calculated by CDT, if the election is very close.

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Humans provide an untapped wealth of evidence about alignment

Exciting work! A minor question about the paper:

Does this mean that it writes a projection of S1's positional embedding to S2's residual stream? Or is it meant to say "writing to the position [residual stream] of [S2]"? Or something else?

Vivek Hebbar3y24

falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix

What is your list of problems by urgency, btw? Would be curious to know.

Hessian and Basin volume

Vivek Hebbar3y42

From this paper, "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)"

So for overparameterized nets, the answer is probably:

There is only one solution manifold, so there are no separate basins. Every solution is connected.
We can salvage the idea of "basin volume" as follows:
- In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
- In the dimensions parallel to the manifol

... (read more)

Hessian and Basin volume

Vivek Hebbar3y41

The loss is defined over all dimensions of parameter space, so $L (x) = x_{1}^{2} + x_{2}^{2}$ is still a function of all 3 x's. You should think of it as $L (x) = x_{1}^{2} + x_{2}^{2} + 0 x_{3}^{2}$ . It's thickness in the $x_{3}$ direction is infinite, not zero.

Here's what a zero-determinant Hessian corresponds to:

The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:

Regularization / weight decay prov

... (read more)

Vivek Hebbar3y*10

I will split this into a math reply, and a reply about the big picture / info loss interpretation.

Math reply:

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have $r a n k (H e s s (L)) = r a n k (J f)$

This is still false! Edit: I am now confused, I don't know if it is false or not.

You are conflating $\nabla_{f} l (f (θ))$ and $\nabla_{θ} l (f (θ))$ . Adding disa... (read more)

1carboniferous_umbraculum 3y

Thanks again for the reply. In my notation, something like ∇l or Jf are functions in and of themselves. The function ∇l evaluates to zero at local minima of l. In my notation, there isn't any such thing as ∇fl. But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.

Thanks for this reply, its quite helpful.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn't know what it was called / what field it's from. I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original".

The Jacobian matrix is what you call

... (read more)

2carboniferous_umbraculum 3y

Thanks for the substantive reply. First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters θ∈Θ can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function l:Ok→R we have: L :Θf⟶Okl⟶R i.e. it's l that might be something like MSE, but the function 'L' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like L(θ)=l(f(θ)) We want to differentiate this twice with respect to θ essentially. Firstly, we have ∇L(θ)=∇l(f(θ))Jf(θ) where - just to keep track of this - we've got: (1×N) vector=[(1×k) vector] [(k×N) matrix] Or, using 'coordinates' to make it explicit: ∂∂θiL(θ)=∇l(f(θ))⋅∂f∂θi=k∑p=1∇pl(f(θ))⋅∂fp∂θi for i=1,…,N. Then for j=1,…,N we differentiate again: ∂2∂θj∂θiL(θ)=k∑p=1k∑q=1∇q∇pl(f(θ))∂fq∂θj∂fp∂θi+k∑p=1∇pl(f(θ))∂fp∂θj∂θi Or, Hess(L)(θ)=Jf(θ)T[Hess(l)(f(θ))]Jf(θ)+∇l(f(θ))D2f(θ) This is now at the level of (N×N) matrices. Avoiding getting into any depth about tensors and indices, the D2f term is basically a (N×N×k) tensor-type object and it's paired with ∇l which is a (1×k) vector to give something that is (N×N). So what I think you are saying now is that if we are at a local minimum for l, then the second term on the right-hand side vanishes (because the term includes the first derivatives of l, which are zero at a minimum). You can see however that if the Hessian of l is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of l, the Hessian of the loss is equal to a constant times (Jf)TJf. So maybe you really

Vivek Hebbar3y20

I'm pretty sure my framework doesn't apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

I'll reply to the rest of your comment later today when I have some time

About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

1Vivek Hebbar3y

I'll reply to the rest of your comment later today when I have some time

Yup, seems correct.

Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.

[Link] A minimal viable product for alignment

Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic. "Saddle points all the way down" isn't possible if a global min exists, since a saddle point implies the existence of an adjacent lower point. As for asymptotes, this is indeed possible, especially in classification tasks. I have basically ignored this and stuck to regression here.

I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this: &... (read more)

When people ask for your P(doom), do you give them your inside view or your betting odds?

Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?

Vivek Hebbar3y30

Oh, this is definitely not what I meant.

"Betting odds" == Your actual belief after factoring in other people's opinions

"Inside view" == What your models predict, before factoring in other opinions or the possibility of being completely wrong

Pablo Stafforini3y50

Though I understood what you meant, perhaps a clearer terminology is all-things-considered beliefs vs. independent impressions.

Transformer inductive biases & RASP

Transformer inductive biases & RASP

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

Discussion with Eliezer Yudkowsky on AGI interventions

Nice! Do you know if the author of that post was involved in RASP?

1Vivek Hebbar3y

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

2gwern3y

They apparently reinvented RASP independently.