AI ALIGNMENT FORUM
AF

All of TLW's Comments + Replies

You would get a 1.01 multiplier in productivity, that would make the speed of development 1.01x faster, especially the development of a Copilot-(N+1),

...assuming that Copilot-(N+1) has <1.01x the development cost as Copilot-N. I'd be interested in arguments as to why this would be the case; most programming has diminishing returns where e.g. eking out additional performance from a program costs progressively more development time.

2Michaël Trazzi3y

Some arguments for why that might be the case: -- the more useful it is, the more people use it, the more telemetry data the model has access to -- while scaling laws do not exhibit diminishing returns from scaling, most of the development time would be on things like infrastructure, data collection and training, rather than aiming for additional performance -- the higher the performance, the more people get interested in the field and the more research there is publicly accessible to improve performance by just implementing what is in the litterature (Note: this argument does not apply for reasons why one company could just make a lot of progress without ever sharing any of their progress.)

Why Agent Foundations? An Overly Abstract Explanation

TLW3y50

Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.

My objection is actually mostly to the example itself.

As you mention:

the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.

Compare with the example:

Suppose we’re designing some secure electronic

... (read more)

5Abram Demski3y

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

A Small Negative Result on Debate

TLW3y00

Did your description to the participants state that the arguments were high-quality?

1Sam Bowman3y

I can look up the exact wording if it's helpful, but I assume it's clear from the basic setup that at least one of the arguments has to be misleading.

A Small Negative Result on Debate

TLW3y10

If there are high-quality arguments for multiple answers, doesn't that "just" mean that the multiple-choice question is itself low-quality?

1Sam Bowman3y

One of the arguments is quite misleading in most cases, so probably not high-quality by typical definitions. Unfortunately, under the time limit, our readers can't reliably tell which one is misleading. Without arguments and without the time limit, annotators get the questions right with ~90% accuracy: https://arxiv.org/abs/2112.08608

Why Agent Foundations? An Overly Abstract Explanation

TLW3y50

Let us make a distinction here between two cases:

Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.

In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox^[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that's about the best we can say^[2].

And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable^[3], even though checking if two specific Tur... (read more)

5Abram Demski3y

You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So "zero" is best understood as "zero within some tolerance". So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently. This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem. We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can't treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can't pretend perfect mathematical knowledge of it, either; we have error tolerances. So your blackbox/whitebox dichotomy doesn't fit the situation very well. But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect? Perhaps we should really look at a range of examples, not just one? And judge John's point as reasonable if and only if we can find some cases where effectively perfect proxies were found? Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John's plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances. In which case, I would reply that the idea is not to try ang contain a malign AGI which is alre

ELK Computational Complexity: Three Levels of Difficulty

TLW3y70

Your definition of L-knowledge implies there can 'only' be $O (2^{L})$ total possible latent variables in the universe that are L-knowable for any given L, I believe.

This isn't strictly a problem, as you can just increase L... but your upper bound on L before the answer is trivially 'yes' is the inverse Kolmogorov complexity of the program trace + o(1). This grows slower than any computable function.

I'd be concerned that for programs of 'realistic' (read: 'fits within the universe') sizes there is no such L.

Why Agent Foundations? An Overly Abstract Explanation

TLW3y82

This is an interesting observation; I don't see how it addresses my point.

There is no exact solution to mutual information from two finite samples. There is no $ϵ$ -approximation of mutual information from two finite samples, either.

=====

On the topic of said observation: beware that $ϵ$ -approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant^[1].)

In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable... (read more)

johnswentworth3y112

The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.

Why Agent Foundations? An Overly Abstract Explanation

TLW3y159

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?

You fundamentally cannot, so it's a moot point. There is no way to confirm zero mutual information^[1], and even if there was there is zero probability that the mutual information was zero^[2]. Very small, perhaps. Zero, no.

I do not follow your seeming dismissal of this. You acknowledge it, and then... assert it's not a problem?

An analogy: solving the Halting problem is impossible^[3]. It is someti... (read more)

johnswentworth3y212

There is no way to confirm zero mutual information^[1], and even if there was there is zero probability that the mutual information was zero^[2]. Very small, perhaps. Zero, no.

Thanks for bringing this up; it raises to a technical point which didn't make sense to include in the post but which I was hoping someone would raise in the comments.

The key point: Goodhart problems are about generalization, not approximation.

Suppose I have a proxy $u^{'}$ for a true utility function $u$ , and $u^{'}$ is always within $ϵ$ of u (i.e. $| u^{'} - u | < ϵ$ ... (read more)

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

TLW3y00

I must have missed that part; can you point more specifically to what you're referring to?

It feels wrong to refer you back to your own writing, but much of part 4 was dedicated to talking about these short-term predictors being used to combat neural latency and to do... well, short-term predictions. A flinch detector that goes off 100ms in advance is far less useful than a flinch detector that goes off 300ms in advance, but at the same time a short-term predictor that predicts too far in advance leads to feedback when used as a latency counter (as I asked ... (read more)

0Steve Byrnes3y

Oh, hmm. In my head, the short-term predictors in the cerebellum are for latency-reduction and discussed in the last post, and meanwhile the short-term predictors in the telencephalon (amygdala & mPFC) are for flinching and discussed here. I think the cerebellum short-term predictors and the telencephalon short-term predictors are built differently for different purposes, and once we zoom in beyond the idea of “short-term prediction” and start talking about parameter settings etc., I really don't lump them together in my mind, they're apples and oranges. In the conversation thus far, I thought you were talking about the telencephalon (amygdala & mPFC) ones. If we're talking about instability from the cerebellum instead, we can continue the Post #4 thread. ~ I think I said some things about low-pass filters up-thread and then retracted it later on, and maybe you missed that. At least for some of the amygdala things like flinching, I agree with you that low-pass filters seem unlikely to be part of the circuit (well, depending on where the frequency cutoff is, I suppose). Sorry, my bad. ~ A common trope is that the hippocampus does one-shot learning in a way that vaguely resembles a lookup table with auto-associative recall, whereas other parts of the cortex learn more generalizable patterns more slowly, including via memory recall (i.e., gradual transfer of information from hippocampus to cortex). I'm not immediately sure whether the amygdala does one-shot learning. I do recall a claim that part of PFC can do one-shot learning, but I forget which part; it might have been a different part than we're talking about. (And I'm not sure if the claim is true anyway.) Also, as I said before, with continuous-time systems, “one shot learning” is hard to pin down; if David Burns spends 3 seconds on the ladder feeling relaxed, before climbing down, that's kinda one-shot in an intuitive sense, but it still allows the timescale of synapse changes to be much slower than the time

Musings on the Speed Prior

TLW3y30

Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.

In $t$ time, the brain (or any realistic agent) can do $O (t)$ processing... but receives $O (t)$ sensory data.

I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?

Realizable-speed priors are certainly correlated with circuit size priors to some extent, but there are some important differences:

The naive circuit size prior assumes gates take O(

... (read more)

Musings on the Speed Prior

TLW3y30

To start with, note that if you push your speed bias far enough (e.g. a strong enough circuit depth complexity or Turing machine time complexity penalty), you just get a lookup table that memorizes everything.

This is true in the TM model^[1]. This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).

(This gets even worse in models where lookup is $O (n^{1 / 3})$ ^[2] or&... (read more)

2Evan Hubinger3y

Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task. ---------------------------------------- I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

TLW3y00

Just to make sure we're on the same page, I made up the “300ms” number, it could be something else.

Sure; the further you get away from ~300ms the less the number makes sense for e.g. predicting neuron latency, as described earlier.

Also to make sure we're on the same page, I claim that from a design perspective, fast oscillation instabilities are bad, and from an introspective perspective, fast oscillation instabilities don't happen. (I don't have goosebumps, then 150ms later I don't have goosebumps, then 150ms later I do have goosebumps, etc.)

I absolutely ... (read more)

0Steve Byrnes3y

I must have missed that part; can you point more specifically to what you're referring to? I think practically anywhere in the brain, if A connects to B, then it's a safe bet that B connects to A. (Certainly for regions, and maybe even for individual neurons.) Therefore we have the setup for epileptic seizures, if excitation and inhibition are not properly balanced. Or more generically, if X% of neurons in the brain are active at time t, then we want around X% of neurons in the brain to be active at time t+1. That means that we want each upstream neuron firing event to (on average) cause exactly one net downstream neuron to fire. But individual neurons have their own inputs and outputs; by default, there seems to be a natural failure mode where the upstream neurons excite not-exactly-one downstream neuron, and we get exponential growth (or decay). My impression is that there are lots of mechanisms to balance excitation and inhibition—probably different mechanisms in different parts of the brain—and any of those mechanisms can fail. I'm not an epilepsy expert by any means (!!) , but at a glance it does seem like epilepsy has a lot of root causes and can originate in lots of different brain areas, including areas that I don't think are doing this kind of prediction, e.g. temporal lobe and dorsolateral prefrontal cortex and hippocampus. I still think you're incorrectly mixing up the time-course of learning (changes to parameters / weights / synapse strengths) with the time-course of an output following a sudden change in input. I think they're unrelated. To clarify our intuitions here, I propose to go to the slow-learning limit. However fast you've been imagining the parameters / weights / synapse strength changing in any given circumstance, multiply that learning rate by 0.001. And simultaneously imagine that the person experiences everything in their life with 1000× more repetitions. For example, instead of getting whacked by a golf ball once, they get whacked

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

TLW3y00

Yes

Alright, so we at least agree with each other on this. Let me try to dig into this a little further...

Consider the following (very contrived) example, for a 300ms predictor trying to minimize L2^[1] norm:

Context is static throughput the below.

t=0, overrider circuit forces output=1.
t=150ms, overrider circuit switches back to loopback mode.
t=450ms, overrider circuit forces output=0.
t=600ms, overrider circuit switches back to loopback mode.

t=900ms, overrider circuit forces output=1.
etc.

Do you agree that the best a slow-learning predictor that's a pure ... (read more)

2Steve Byrnes3y

Just to make sure we're on the same page, I made up the “300ms” number, it could be something else. Also to make sure we're on the same page, I claim that from a design perspective, fast oscillation instabilities are bad, and from an introspective perspective, fast oscillation instabilities don't happen. (I don't have goosebumps, then 150ms later I don't have goosebumps, then 150ms later I do have goosebumps, etc.) Sure. But to make sure we're on the same page, the predictor is trying to minimize L2 norm (or whatever), but that's just one component of a system, and successfully minimizing the L2 norm might or might not correspond to the larger system performing well at its task. So “zero error rate” doesn't necessarily mean “good design”. Sorry, I'm confused. There's an I and a D? I only see a P. It seems to me that you can start a startle reaction quickly (small fraction of a second), but you can't stop a startle quickly. Hmm, maybe the fastest thing the amygdala does is to blink (mostly <300ms) , but if you're getting 3 blink-inducing stimuli a second, your brainstem is not going to keep blinking 3 times a second, instead it will just pinch the eyes shut and turn away, or something. (Source: life experience.) (Also, I can always pull out the “Did I say 300ms prediction? I meant 100ms” card…) If the supervisor is really tracking the physiological response (sympathetic nervous system response, blink reaction, whatever), and the physiological response can't oscillate quickly (even if its rise-time by itself is fast), then likewise the supervisor can't oscillate quickly, right? Think of it like: once I start a startle-reaction, then it flips into override mode for a second, because I'm still startle-reacting until the reaction finishes playing out. Hmm, I think I want to forbid fast updates of the adjustable parameters / weights (synapse strength or whatever), and I also want to stay very very far away from any situation where there might be fast oscillations th

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

TLW3y00

I disagree; let me try to work through where we diverge.

A 300ms predictor outputting a sine wave with period 300ms into its own supervisor input has zero error, and hence will continue to do so regardless of the learning rate.

Do you at least agree that in this scheme a predictor outputting a sine wave with period 300ms has zero error while in defer-to-predictor mode?

The predictor is a parametrized function output = f(context, parameters) (where "parameters" are also called "weights"). If (by assumption) context is static, then you're running the function o

... (read more)

2Steve Byrnes3y

Yes Hmm, I think we're mixing up two levels of abstraction here. At the implementation level, there are no real-valued signals, just spikes. But at the algorithm level, it's possible that the neuron operations are equivalent to some algorithm that is most simply described in a way that does not involve any spikes, and does involve lots of real-valued signals. For example, one can vaguely imagine setups where a single spike of an upstream neuron isn't sufficient to generate a spike on the downstream neuron, and you only get effects from a neuron sending a train of spikes whose effects are cumulative. In that case, the circuit would be basically incapable of "fast" dynamics (i.e. it would have implicit low-pass filters everywhere), and the algorithm is really best thought of as "doing operations" on average spike frequencies rather than on individual spikes. Oh sorry if I was unclear. I was never talking about exponential weighted average. Let's say our trained model is f(context,θ) (where θ is the parameters a.k.a. weights). Then with static context, I was figuring we'd have a differential equation vaguely like: ∂→θ∂t∝−∇θ(f(context,→θ)−supervisor)2 I was figuring that (in the absence of oscillations) the solution to this differential equation might look like θ(t) asymptotically approaching a limit wherein the error is zero, and I was figuring that this asymptotic approach might look like an exponential with a timescale of a few seconds. I'm not sure if it would be literally an exponential. But probably some kind of asymptotic approach to a steady-state. And I was saying (in a confusing way) that I was imagining that this asymptotic approach would take a few seconds to get most of the way to its limit. If we go to the Section 5.2.1.1 example of David on the ladder, the learning is happening while he has calmed down, but is still standing at the top of the ladder. I figure he probably stayed up for at least 5 or 10 seconds after calming down but before climbing d

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

TLW3y40

I'm liking^[1] this theory more and more.

In the static-context case, let’s first consider what happens when the switch is sitting in “defer-to-predictor mode”: Since the output is looping right back to the supervisor, there is no error in the supervised learning module. The predictions are correct. The synapses aren’t changing. Even if this situation is very common, it has no bearing on how the short-term predictor eventually winds up behaving.

One solution to a -300ms delay connected to its own input is a constant output. However, this is part of an in... (read more)

2Steve Byrnes3y

The predictor is a parametrized function output = f(context, parameters) (where "parameters" are also called "weights"). If (by assumption) context is static, then you're running the function on the same inputs over and over, so you have to keep getting the same answer. Unless there's an error changing the parameters / weights. But the learning rate on those parameters can be (and presumably would be) relatively low. For example, the time constant (for the exponential decay of a discrepancy between output and supervisor when in "override mode") could be many seconds. In that case I don't think you can get self-sustaining oscillations in "defer to predictor" mode. Then maybe you'll say "What if it's static context except that there's a time input to the context as well? But I still don't see how you would learn oscillations that aren't in the exogenous data. There could also be a low-pass filter on the supervisor side. Hmm, actually, maybe that amounts to the same thing as the slow parameter updates I mentioned above. I think I disagree that "perfect predictors" are what's wanted here. The input data is a mix of regular patterns and noise / one-off idiosyncratic things. You want to learn the patterns but not learn the noise. So it's good to not immediately and completely adapt to errors in the model. (Also, there's always learning-during-memory-replay for genuinely important things that only happen only once and quickly.)

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

TLW3y40

Alright, I see what you're saying now. Thanks for the conversation!

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

TLW3y40

Why don't you think that behavior might be solely driven by the hypothalamus & brainstem?

I tend to treat hypothalamus & brainstem reactions as limited to a single rote set of (possibly-repetitive) motions driven by a single clear stimulus. The sort of thing that I could write a bit of Python-esque pseudocode for.

Withdrawal reflexes match that. Hormonal systems match that^[1]. Blink reflex matches that. Suckling matches that. Pathfinding from point A to any of points B-Z in the presence of dynamic obstacles, properly orienting, then suckling? Not so ... (read more)

4Steve Byrnes3y

I'm not 100% sure and didn't chase down the reference, but in context, I believe the claim “the [infant decorticate rats] appear to suckle normally and develop into healthy adult rats” should be read as “they find their way to their mother's nipple and suckle”, not just “they suckle when their mouth is already in position”. Pathfinding to a nipple doesn't need to be “pathfinding” per se, it could potentially be as simple as moving up an odor gradient, and randomly reorienting when hitting an obstacle. I dunno, I tried watching a couple videos of neonatal mice suckling their mothers (1,2) and asking myself “could I write python-esque pseudocode that performed as well as that?” and my answer was “yeah probably, ¯\_(ツ)_/¯”. (Granted, this is not a very scientific approach.) “Shared training data” includes not only the laws of physics but also the possession of a human brain and body. For example, I might speculate that both sharp objects and “sharp” noises are causes of unpleasantness thanks to our innate brainstem circuits, and all humans have those circuits, therefore all humans might have a shared tendency to give similar answers to the bouba/kiki thing. Or even if that specific story is wrong, I can imagine that something vaguely like that might be responsible.

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

TLW3y50

Fair! There are many plausible models that the human brain isn't.

My comments above also apply here—in the event that this is true (which I'd still bet against, at least in the human case), it wouldn't impact anything of significance for the series, I think.

I haven't seen much of anything (beyond the obvious) that said sketch explicitly contradicts, I agree.

I realize now that I probably should have explained the why (as opposed to the what) of my sketch a little better^[1].

Your model makes a fair bit of intuitive sense to me; your model has an immediately-ob... (read more)

3Steve Byrnes3y

Hmm. Why don't you think that behavior might be solely driven by the hypothalamus & brainstem? For what it's worth, decorticate infant rats (rats whose cortex was surgically removed [yikes]) “appear to suckle normally” according to Carter, Witt, Kolb, Whishaw 1982. That’s not definitive evidence (decortication is only a subset of the hypothetical de-Learning-Subsystem-ification) but I find it suggestive, at least in conjunction with other things I know about the brainstem. As I noted in Post #2, “even a 3-month-old infant has had 4 million waking seconds of “training data” to learn from”. That makes it hard to rule out learning, or at least it's hard in the absence of additional arguments, I think.

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

TLW3y30

The cerebellum sits in the middle of the action, always saying to itself “what signal is about to appear here?”, and then it preemptively sends it. And then a fraction of a second later, it sees whether its prediction was correct, and updates its models if it wasn’t.

How does this cope with feedback loops?

Or is the implicit assumption here that the prediction lookahead is always less than the minimum feedback time delay? (If so, how does it know that?)

3Steve Byrnes3y

I've been assuming the latter... Seems to me that there's enough latency in the whole system that it can be usefully reduced somewhat without any risk of reducing it below zero and thus causing instability etc. I can imagine different signals being hardwired to different time-intervals (possibly as a function of age). I can also imagine the interval starts low, and creeps up, millisecond by millisecond over the weeks, as long as the predictions keep being accurate, and conversely creeps down when the predictions are inaccurate. (I think that would work in principle, right?) That's all just idle speculation. I can't immediately think of cerebellum-specific information that would shed light on this. It's a good question :)

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

TLW3y30

The argument here is the same as Section 3.2.1 of the previous post: the genome cannot know exactly which neurons (if any) will store any particular learned-from-scratch pattern, and therefore cannot hardwire a connection to them.

This may be more suited as a comment on the previous post. However, it ties into this, so I'll put it here.

I understand the argument that if the upstream circuit learns entirely from scratch, you can't really have hardwired downstream predictors, for lack of anything stable to hardwire them to.

I don't see a clear argument for the ... (read more)

4Steve Byrnes3y

Thanks for the great comment! That would be Post #2 :-) I don't have strong a priori opposition to this (if I understand it correctly), although I happen to think that it's not how any part of the brain works. If it were true, it would mean that “learning from scratch” is wrong, but not in a really significant way that impacts the big-picture lessons I draw from it. There still needs to be a learning algorithm. There still needs to be supervisory signals. It's still the case that the eventual meaning of any particular neuron in this learning system is unreliable from the genome's perspective. There still needs to be a separate non-learning sensory processing system if we want specific instinctual reactions with specific sensory triggers that are stable throughout life. It's still the case that 99.99…% of the bits of information in the adult network coming from the learning algorithm rather than the genome. Etc. Basically everything of significance that I'll be talking about in the series would still be OK, I think. (And, again, assuming that I'm understanding you correctly.) Thinking about this, I recall a possible example in the fruit fly, where Li et al. 2020 found that there were so-called “atypical” MBONs (= predictor output neurons) that received input not only from dopamine supervisors, and inputs from randomly-pattern-separated Kenyon Cell context signals, but also inputs from various other signals in the brain—signals which (I think) do not pass through any randomizing pattern-separator. If so, and if those connections are playing a “context” role (as opposed to being supervisory signals or real-time hyperparameter settings—I'm not sure which it is), then they could in principle allow the MBONs to do a bit better than chance from birth. My comments above also apply here—in the event that this is true (which I'd still bet against, at least in the human case), it wouldn't impact anything of significance for the series, I think.