LBIT Proofs 4: Propositions 22-28

Diffractor

Joar Skalse

My name is pronounced "YOO-ar SKULL-se" (the "e" is not silent). I'm a PhD student at Oxford University, and I was a member of the Future of Humanity Institute before it shut down. I have worked in several different areas of AI safety research. For a few highlights, see:

Some of my recent research on the theoretical foundations of reward learning is also described in this sequence.

For a full list of all my research, see my Google Scholar.

Sequences

The Theoretical Foundations of Reward Learning

Posts

Sorted by New

10How to Contribute to Theoretical Reward Learning Research

2mo

10Other Papers About the Theory of Reward Learning

2mo

9Defining and Characterising Reward Hacking

2mo

4Misspecification in Inverse Reinforcement Learning - Part II

2mo

6STARC: A General Framework For Quantifying Differences Between Reward Functions

2mo

10Misspecification in Inverse Reinforcement Learning

2mo

9Partial Identifiability in Reward Learning

2mo

15The Theoretical Reward Learning Research Agenda: Introduction and Motivation

2mo

30Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

34My Criticism of Singular Learning Theory

Wikitag Contributions

Comments

Sorted by

Newest

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse2mo10

No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse2mo10

I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

Joar Skalse7mo20

I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).

I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn't solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn't solved).

I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can "explain your thought process" or "justify your answer" is probably (mostly) in this category. I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like "when solving a maze, it's often a good idea to start from the end". I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of "why"-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider this to be a success. This also means that problems that can be solved by a large ensemble of simple heuristics arguably are fine, provided that the heuristics themselves are intelligible.

(*) It is also not fully clear to me if it even makes sense to say that a task can't be solved by an interpretable program. On an intuitive level, this seems to make sense. However, I'm not able to map this statement onto any kind of formal claim. Would it imply that there are things which are outside the reach of science? I consider it to at least be a live possibility that anything can be made interpretable.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse1y10

You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.

My Criticism of Singular Learning Theory

Joar Skalse1y*10

If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.

I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type of model will learn to exhibit goal-directed behaviour in a given training setup, for example.

The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.

Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a "small" amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I'm not sure).

EDIT: What I wrote here is wrong; the loss landscape does not determine the parameter-function map even in the limit of infinite data (except if we consider a binary classification problem without noise, and consider the loss for each parameter assignment and input with support under the data distribution).

My Criticism of Singular Learning Theory

Joar Skalse1y*20

I'm not sure, but I think this example is pathological.

Yes, it's artificial and cherry-picked to make a certain rhetorical point as simply as possible.

This is the more relevant and interesting kind of symmetry, and it's easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨

This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.

You can make the same critique of Kolmogorov complexity.

Yes, I have been using "Kolmogorov complexity" in a somewhat loose way here.

Wild conjecture: [...]

Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that is compact, etc.)

My Criticism of Singular Learning Theory

Joar Skalse1y10

The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).

My Criticism of Singular Learning Theory

Joar Skalse1y10

Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?

This is basically a justification of something like your point 1, but AFAICT it's closer to a proof in the SLT setting than in your setting.

I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like "the ground truth function is generated by a small neural net" and "learning is done in a Bayesian way", etc.

My Criticism of Singular Learning Theory

Joar Skalse1y15

That's interesting, thank you for this!

How Smart Are Humans?

Joar Skalse2y*21

I think the broad strokes are mostly similar, but that a bunch of relevant details are different.

Yes, a large collective of near-human AI that is allowed to interact freely over a (subjectively) long period of time is presumably roughly as hard to understand and control as a Bostrom/Yudkowsky-esque God in a box. However, in this scenario, we have the option to not allow free interaction between multiple instances, while still being able to extract useful work from them. It is also probably much easier to align a system that is not of overwhelming intelligence, and this could be done before the AIs are allowed to interact. We might also be able to significantly influence their collective behaviour by controlling the initial conditions of their interactions (similarly to how institutions and cultural norms have a substantial long-term impact on the trajectory of a country, for example). It is also more plausible that humans (or human simulations or emulations) could be kept in the loop for a long time period in this scenario. Moreover, if intelligence is bottle-necked by external resources (such as memory, data, CPU cycles, etc) rather than internal algorithmic efficiency, then you can exert more control over the resulting intelligence explosion by controlling those resources. Etc etc.