User Comment Replies — AI Alignment Forum

Don't design agents which exploit adversarial inputs

Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?

Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of man... (read more)

3Alex Turner2y

I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you're still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked. In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.

2David Manheim2y

This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.

Four usages of "loss" in AI

Jon Garcia3y21

Well, then computing $ℓ_{n o v e l}$ would just take a really long time.

So, it's not impossible in principle if you trained the loss function as I suggested (loss function trained by reinforcement learning, then applied to train the actual novel-generating model), but it is a totally impractical approach.

If you really wanted to teach an AI to generate good novels, you'd probably start by training a LLM to imitate existing novels through some sort of predictive loss (e.g., categorical cross-entropy on next-token prediction) to give it a good prior. Then tr... (read more)

1Alex Flint3y

Interesting - why is that?

Four usages of "loss" in AI

Jon Garcia3y20

With humans in the loop, there actually is a way to implement $ℓ_{n o v e l}$ . Unfortunately, computing the function takes as long as it takes for several humans to read a novel and aggregate their scores. And there's also no way to compute the gradient. So by that point, it's pretty much just a reinforcement learning signal.

However, you could use that human feedback to train a side network to predict the reward signal based on what the AI generates. This second network would then essentially compute a custom loss function (asymptotically approaching $ℓ_{n o v e l}$ with m... (read more)

2Alex Flint3y

Interesting. I have the sense that we would have to get humans to reflect for years after reading a novel to produce a rating that, if optimized, would produce truly great novels. I think that when a novel really moves a person (or, even more importantly, moves a whole culture), it's not at all evident that this has happened until (often) years after-the-fact. I also have the sense that part of what makes a novel great is that a person or a culture decide to associate a certain beautiful insight with it due to the novel's role in provoking that insight. But usually the novel is only partly responsible for the insight, and in part we choose to make the novel beautiful by associating it in our culture with a beautiful thing (and this associating of beautiful things is a good and honest thing to do).

[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain

Jon Garcia3y00

Memory ≠ Unstructured memory (and likewise, locally-random ≠ globally-random): [...]

Agreed. I didn't mean to imply that you thought otherwise.

"just" a memory system + learning algorithm—with a dismissive tone of voice on the "just": [...]

I apologize for how that came across. I had no intention of being dismissive. When I respond to a post or comment, I typically try to frame what I say for a typical reader as much for the original poster. In this case, I had a sense that a typical reader could get the wrong impression about how the neocort... (read more)

[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain

Jon Garcia3y20

Great summary of the argument. I definitely agree that this will be an important distinction (learning-from-scratch vs. innate circuitry) for AGI alignment, as well as for developing a useful Theory of Cognition. The core of what motivates our behavior must be innate to some extent (e.g., heuristics that evolution programmed into our hypothalamus that tell us how far from homeostasis we're veering), to act as a teaching signal to the rest of our brains (e.g., learn to generate goal states that minimize the effort required to maintain or return to homeostas... (read more)

1Steve Byrnes3y

Thanks! I'm not sure we have much disagreement here. Some relevant issues are: * Memory ≠ Unstructured memory (and likewise, locally-random ≠ globally-random): There's certainly a neural architecture, with certain types of connections between certain macroscopic regions. * "just" a memory system + learning algorithm—with a dismissive tone of voice on the "just": Maybe you didn't mean it that way, but for the record, I would suggest that to the extent that you feel wowed by something the neocortex does, I claim you should copy that feeling, and feel equally wowed by "what learning-from-scratch algorithms are capable of". The things that ML people are doing right now are but a tiny corner of a vast universe of possible learning-from-scratch algorithms. * Generative models can be learned from scratch—obviously, e.g. StyleGAN. I imagine you agree, but you mentioned "generative", so I'm just throwing this in. * Dynamics is not unrelated to neural architecture: For example, there's a kinda silly sense in which GPT-3 involves a "wave of activity" that goes from layer 1 through layers 2, 3, 4, …, ending in layer 96. I'm not saying anything profound here—it's just that GPT-3 happens to be feedforward, not recurrent. But anyway, if you think it's important that the neocortex has a bias for waves that travel in certain directions, I'd claim that such a bias can likewise be built out of a (not-perfectly-recurrent) neural architecture in the brain. * Lottery ticket hypothesis: I was recently reading Brain From The Inside Out by Buzsáki, and Chapter 13 had a discussion which to my ears sounded like the author was proposing that we should brain learning as like the lottery ticket hypothesis. E.g. "Rich brain dynamics can emerge from the genetically programmed wiring circuits and biophysical properties of neurons during development even without any sensory input or experience." Then "experience [is] a matching process" between data and this "preexisting dictionary of nonsense

[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?

Jon Garcia3y20

Great introduction. As someone with a background in computational neuroscience, I'm really looking forward to what you have to say on all this.

By the way, you seem to use a very broad umbrella for covering "brain-like-AGI" approaches. Would you classify something like a predictive coding network as more brain-like or more prosaic? What about a neural Turing machine? In other words, do you have a distinct boundary in mind for separating the red box from the blue box, or is your classification more fuzzy (or does it not matter)?

0glazgogabgolab3y

If I'm reading your question right I think the answer is: i.e. The distinction depends on whether or not a given architecture has some properties Steve will mention later. Which, given Steve's work, are probably the key properties of "A learned population of Compositional Generative Models + A largely hardcoded Steering Subsystem".

Steve Byrnes3y70

Thanks!

If I'm not mistaken, the things you brought up are at too low a level to be highly relevant for safety, in my opinion. I guess this series will mostly be at Marr's "computational level" whereas you're asking "algorithm-level" questions, more or less. I'll be talking a lot about things vaguely like "what is the loss function" and much less about things vaguely like "how is the loss function minimized".

For example, I think you can train a neural Turing machine with supervised learning, or you can train a neural Turing machine with RL. That distinction... (read more)

Infra-Bayesian physicalism: a formal theory of naturalized induction

Jon Garcia3y00

The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

I think this is what I was missing. Thanks.

So, then, the monotonicity principle sets a baseline for the agent's loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent's particular value function over universe-states gets added/subtracted on top of that, correct?

3Vanessa Kosoy3y

No, it's not a baseline, it's just an inequality. Let's do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let's suppose these are the only two possible experiences, it can't experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote G the program corresponding to "the agent deciding on an action after it sees a green room" and R the program corresponding to "the agent deciding on an action after it sees a red room". Then, roughly speaking[1], there are 4 possibilities: * α∅: The universe runs neither R nor G. * αR: The universe runs R but not G. * αG: The universe runs G but not R. * αRG: The universe runs both R and G. In this case, the monotonicity principle imposes the following inequalities on the loss function L: L(α∅)≥L(αR) L(α∅)≥L(αG) L(αR)≥L(αRG) L(αG)≥L(αRG) That is, α∅ must be the worst case and αRG must be the best case. ---------------------------------------- 1. In fact, manifesting of computational facts doesn't amount to selecting a set of realized programs, because programs can be entangled with each other, but let's ignore this for simplicity's sake. ↩︎

Infra-Bayesian physicalism: a formal theory of naturalized induction

Jon Garcia3y30

Could you explain what the monotonicity principle is, without referring to any symbols or operators? I gathered that it is important, that it is problematic, that it is a necessary consequence of physicalism absent from cartesian models, and that it has something to do with the min-(across copies of an agent) max-(across destinies of the agent copy) loss. But I seem to have missed the content and context that makes sense of all of that, or even in what sense and over what space the loss function is being monotonic.

Your discussion section is good. I would l... (read more)

4Vanessa Kosoy3y

The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better. This is odd, because it implies that the total destruction of the universe is always the worst possible outcome. And, the creation of an additional, causally disconnected, world can never be net-negative. For a monotonic agent, there can be no net-negative world[1]. In particular, for selfish monotonic agents (such that only assign value to their own observations), this means death is the worst possible outcome and the creation of additional copies of the agent can never be net-negative. Well, there are the "notation" and "notation reference" subsections, that might help. Thank you! ---------------------------------------- 1. At least, all of this is true if we ignore the dependence of the loss function on the other argument, namely the outputs of computations. But it seems like that doesn't qualitatively change the picture. ↩︎

There is essentially one best-validated theory of cognition.

Jon Garcia3y00

So essentially, which types of information get routed for processing to which areas during the performance of some behavioral or cognitive algorithm, and what sort of processing each module performs?

0terry.stewart3y

That sounds right to me. It gives what types of information are processed in each area, and it gives a very explicit statement about exactly what processing each module performs. So I look at ACT-R as sort of a minimal set of modules, where if I could figure out how to get neurons to implement the calculations ACT-R specifies in those modules (or something close to them), then I'd have a neural system that could do a very wide variety of psychology-experiment-type-tasks. As far as current progress goes, I'd say we have a pretty decent way to get neurons to implement the core Production system, and the Buffers surrounding it, but much less of a clear story for the other modules.

There is essentially one best-validated theory of cognition.

Jon Garcia3y40

So, from what I read, it looks like ACT-R is mostly about modeling which brain systems are connected to which and how fast their interactions are, not in any way how the brain systems actually do what they do. Is that fair? If so, I could see this framework helping to set useful structural priors for developing AGI (e.g., so we don't make the mistake of hooking up the declarative memory module directly to the raw sensory or motor modules), but I would expect most of the progress still to come from research in machine learning and computational neuroscience.

4Abram Demski3y

I think that's not quite fair. ACT-R has a lot to say about what kinds of processing are happening, as well. Although, for example, it does not have a theory of vision (to my limited understanding anyway), or of how the full motor control stack works, etc. So in that sense I think you are right. What it does have more to say about is how the working memory associated with each modality works: how you process information in the various working memories, including various important cognitive mechanisms that you might not otherwise think about. In this sense, it's not just about interconnection like you said.

AI ALIGNMENT FORUM
AF

All of Jon Garcia's Comments + Replies