All of Oliver Sourbut's Comments + Replies

This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.

I'd strongly agree that separating out 'deception' per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.

I tend to use 'deceptive alignment' slightly more broadly - i.e. something could be deceptively aligned post-training, even if all updates after that point are 'in context' or whatever analogue is relevant at that time. Right? This would be... (read more)

2Joe Collman
I think the broader use is sensible - e.g. to include post-training. However, I'm not sure how narrow you'd want [training hacking] to be. Do you want to call it training only if NN internals get updated by default? Or just that it's training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a ...selection... process that could be ongoing], seems to cover all deceptive alignment - potential deletion/adjustment being a selection process). Fine if there's no bright line - I'd just be curious to know your criteria.

FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it's very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn't necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without provi... (read more)

3Alex Turner
I mean, does it matter? What if it turns out that gradient descent itself doesn't affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn't an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?

The problem is that this advantage can oscillate forever.

This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant )! The advantage definition itself is correct and non-oscillating, it's the estimation of the expectation using a moving average which is (sometimes) at fault.

Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.

2Alex Turner
The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?

I like the philosophical and strategic take here: let's avoid wireheading, arbitrary reinforcement strength is risky[1], hopefully we can get some values-caring-about-human-stuff.

The ACTDE seems potentially a nice complement/alternative to entropy[2] regularisation for avoiding mode collapse (I haven't evaluated deeply). I think you're misdiagnosing a few things though.

Overall I think the section about oscillating advantage/value estimation is irrelevant (interesting, but unrelated), and I think you should point the finger less at PPO and advantage estimat... (read more)

2Alex Turner
I disagree. I don't view reward/reinforcement as indicating what is "best" (from our perspective), but as chiseling decision-making circuitry into the AI (which may then decide what is "best" from its perspective). One way of putting a related point: I think that we don't need to infinitely reinforce a line of reasoning in order to train an AI which reasons correctly.  (I want to check -- does this response make sense to you? Happy to try explaining my intuition in another way.)
1Oliver Sourbut
This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant α)! The advantage definition itself is correct and non-oscillating, it's the estimation of the expectation using a moving average which is (sometimes) at fault. Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.

Strong agree with the need for nuance. 'Model' is another word that gets horribly mangled a lot recently.

I think the more sensible uses of the word 'agent' I've come across are usually referring to the assemblage of a policy-under-training plus the rest of the shebang: learning method, exploration tricks of one kind or another, environment modelling (if any), planning algorithm (if any) etc. This seems more legit to me, though I still avoid using the word 'agent' as far as possible for similar reasons (discussed here (footnote 6) and here).

Similarly to Dan... (read more)

2Alex Turner
What other, more favorable interpretations might I consider?

Really enjoyed this post, both aesthetically (I like evolution and palaeontology, and obviously AI things!) and as a motivator for some lines of research and thought.

I had a go at one point connecting natural selection with gradient descent which you might find useful depending on your aims.

I also collected some cases of what I think are potentially convergent properties of 'deliberating systems', many of them natural, and others artificial. Maybe you'll find those useful, and I'd love to know to what extent you agree or disagree with the concepts there.

This was a great read. Thanks in particular for sharing some introspection on motivation and thinking processes leading to these findings!

Two thoughts:

First, I sense that you're somewhat dissatisfied with using total variation distance ('average action probability change') as a qualitative measure of the impact of an intervention on behaviour. In particular, it doesn't weight 'meaningfulness', and important changes might get washed out by lots of small changes in unimportant cells. When we visualise, I think we intuitively do something richer, but in order... (read more)

4Alex Turner
Mostly in a later post. Ultimately, shard theory makes claims about goal/value formation in agents. In particular, some shard-theory flavored claims are: * That agents will have multiple, contextually activated goals and values * That we can predict what goals will be activated by considering what historical reinforcement events pertain to a given situation (e.g. is the cheese near the top-right corner, or not?) * That the multiple goals are each themselves made out of small pieces/circuits called "subshards" which can be separately manipulated or activated or influenced (see e.g. channels 55 and 42 having different effects when intervened upon) * So—we looked for "shards", and (I think) found them.  * That it's profitable to think of agents as having multiple contextual goals, instead of thinking of them as "optimizing for a fixed objective" * (I would not have tried this project or its interventions if not for shard theory, and found shard theory reasoning very helpful throughout the project, and have some sense of having cut to empirical truths more quickly because of that theory. But I haven't yet done deep credit assignment on this question. I think a more careful credit assignment will come down to looking at my preregistered predictions and reasoning.) * That we can predict what goals agents will form by considering their reinforcement schedules,  * And we should gain skill at this art, today, now, in current systems. It seems like a clear alignment win to be able to loosely predict what goals/generalization behavior will be produced by a training process. There are probably more ties I haven't thought of. But hopefully this gives a little context!

I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don't ever have infinite exploration (and we certainly don't have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approxi... (read more)

1Matthew "Vaniver" Gray
This is the bit I don't believe, actually. [Or at least don't think is relevant.] Note that in Wei_Dai's hypothetical, the neural net architecture has a particular arrangement such that "how much it optimizes for reward" is either directly or indirectly implied by the neural network weights. [We're providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that doesn't have access to that.] Quintin seems to me to be arguing "if you actually follow the math, there isn't a gradient to that parameter," which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of "care more about reward." This doesn't mean that, by caring about reward more, it knows which actions in the environment cause more reward. There I believe the story that the RL algorithm won't be able to reinforce actions that have never been tried. [EDIT: Maybe the argument is "but if it's never tried the action of optimizing harder for reward, then the RL algorithm won't be able to reinforce that internal action"? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.]
  1. Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),
  2. The genome solves information inaccessibility in some way we cannot replicate for AI alignment, or
  3. The genome cannot directly address the vast majority of interesting human cognitive events, concepts, and properties. (The point argued by this essay)

In my opinion, either (1) or (3) would be enormous news for AI alignment

What do you mean by 'enormous news for AI alignment'? That either of these would be surprising to people in the field? Or th... (read more)

2Alex Turner
Both. Yup! I perceive many of your points as not really grappling with the key arguments in the post, so I'll step through them. My remarks may come off as aggressive, and I do not mean them as such. I have not yet gained the skill of disagreeing frankly and bluntly without seeming chilly, so I will preface this comment with goodwill! I think that you're saying "rough proxies" and then imagining it solved, somehow, but I don't see that step?  Whenever I imagine try to imagine a "proxy", I get stuck. What, specifically, could the proxy be? Such that it will actually reliably entangle itself with the target learned-concept (e.g. "someone's cheating me"), such that the imagined proxy explains why people care so robustly about punishing cheaters. Whenever I generate candidate proxies (e.g. detecting physiological anger, or just scanning the brain somehow), the scheme seems pretty implausible to me.  Do you disagree? I don't presently see why "a physiological response is produced" is more likely to come out true in worlds where the genome solves information inaccessibility, than in worlds where it doesn't.  Note that all of the imprinting examples rely on direct sensory observables. This is not (1): Information inaccessibility is solved by the genome -- these imprinting examples aren't inaccessible to begin with. (Except "limbic imprinting", I can't make heads or tails of that one. I couldn't quickly understand what a concrete example would be after skimming a few resources.) My first pass is "I don't feel less confused after reading this potential explanation." More in detail -- "bounded compute" a priori predicts many possible observations, AFAICT it does not concentrate probability onto specific observed biases (like sunk cost or framing effect). Rather, "bounded compute" can, on its own, explain a vast range of behavior. Since AFAICT this explanation assigns relatively low probability to observed data, it loses tons of probability mass compared to other hypo

Another aesthetic similarity which my brain noted is between your concept of 'information loss' on inputs for layers-which-discriminate and layers-which-don't and the concept of sufficient statistics.

A sufficient statistic is one for which the posterior is independent of the data , given the statistic

which has the same flavour as

In the respective cases, and are 'sufficient' and induce an equivalence class between s

1Vivek Hebbar
Yup, seems correct.

Regarding your empirical findings which may run counter to the question

  1. Is manifold dimensionality actually a good predictor of which solution will be found?

I wonder if there's a connection to asymptotic equipartitioning - it may be that the 'modal' (most 'voluminous' few) solution basins are indeed higher-rank, but that they are in practice so comparatively few as to contribute negligible overall volume?

This is a fuzzy tentative connection made mostly on the basis of aesthetics rather than a deep technical connection I'm aware of.

1Vivek Hebbar
Yeah, this seems roughly correct, and similar to what I was thinking.  There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.

Interesting stuff! I'm still getting my head around it, but I think implicit in a lot of this is that loss is some quadratic function of 'behaviour' - is that right? If so, it could be worth spelling that out. Though maybe in a small neighbourhood of a local minimum this is approximately true anyway?

This also brings to mind the question of what happens when we're in a region with no local minimum (e.g. saddle points all the way down, or asymptoting to a lower loss, etc.)

1Vivek Hebbar
Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic.  "Saddle points all the way down" isn't possible if a global min exists, since a saddle point implies the existence of an adjacent lower point.  As for asymptotes, this is indeed possible, especially in classification tasks.  I have basically ignored this and stuck to regression here.   I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this:  It doesn't seem that much different, especially when it comes to manifold dimension; an m-dimensional manifold in parameter space generally extends to infinity, and it corresponds to an m-1 dimensional manifold in angle space (you can think of it as a hypersphere of asymptote directions). I would say the main things neglected in this post are: 1. Manifold count (Most important neglected thing) 2. Basin width in non-infinite directions 3. Distance from the origin These apply to both regression and classification.

I think the gradient descent bit is spot on. That also looks like the flavour of natural selection, with non infinitesimal (but really small) deltas. Natural selection consumes a proof that a particular (mutation) produces (fitness) to generate/propagate/multiply .

I recently did some thinking about this and found an equivalence proof under certain conditions for the natural selection case and the gradient descent case.

In general, I think the type signature here can indeed be soft or fuzzy or lossy and you still get consequentialism, and the 'better... (read more)

This post is thoroughly excellent, a good summary and an important service!

However, the big caveat here is that evolution does not implement Stochastic Gradient Descent.

I came here to say that in fact they are quite analogous after all