johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models

Wiki Contributions

Comments

Sorted by

Here's a new Bookkeeping Theorem, which unifies all of the Bookkeeping Rules mentioned (but mostly not proven) in the post, as well as all possible other Bookkeeping Rules.

If all distributions which factor over Bayes net  also factor over Bayes net , then all distributions which approximately factor over  also approximately factor over . Quantitatively:

where  indicates parents of variable  in .

Proof: Define the distribution . Since  exactly factors over , it also exactly factors over . So

Then by the factorization transfer rule (from the post):

which completes the proof.

I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.

quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”

This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.

A couple other points in a similar direction:

  • A thing I noticed in our discussion on the model delta with Christiano post: your criterion for useful AI safety work seems to be roughly "this will contribute some marginal value" as opposed to "this will address a bottleneck".
  • Right at the top of this thread, you say: "I think Redwood Research’s recent work on AI control really 'hits it out of the park', and they have identified a tractable and neglected intervention that can make AI go a lot better". Note what's conspicuously missing there: tractable and neglected, but you don't claim importance.

I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.

Sure, in principle a sum of numbers can be large without any individual number being large. In practice, the 80/20 rule is a thing, and everything has bottlenecks all the time. If work is not addressing a bottleneck, then it's approximately useless.

(Somewhat more precisely: if marginal work is not addressing something which is a bottleneck on current margins, then it's approximately useless.)

Of importance, tractability and neglectedness, importance is the most important. In practice, it is usually better to have a thousand people trying to solve a major bottleneck each with low chance of success, than a thousand people making tractable progress on some neglected issue which is not a bottleneck.

Note to future self: Li & Vitanyi use  for empty string, which makes this post confusing for people who are used to that notation.

You can show that, in order for an agent to persist, it needs to have the capacity to observe and learn about its environment. The math is a more complex than I want to get into here...

Do you have a citation for this? I went looking for the supposed math behind that claim a couple years back, and found one section of one Friston paper which had an example system which did not obviously generalize particularly well, and also used a kinda-hand-wavy notion of "Markov blanket" that didn't make it clear what precisely was being conditioned on (a critique which I would extend to all of the examples you list). And that was it; hundreds of excited citations chained back to that one spot. If anybody's written an actual explanation and/or proof somewhere, that would be great.

This is particularly interesting if we take  and  to be two different models, and take the indices 1, 2 to be different values of another random variable  with distribution  given by . In that case, the above inequality becomes:

Note to self: this assumes P[Y] = Q[Y].

I wasn't imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to "understand a subproblem", so that was useful.

I'll try again:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action how well the AI's plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

(... and presumably an unstated piece here is that "understanding how well the AI's plan/action solves a particular subproblem" might include recursive steps like "here's a sub-sub-problem, assume the AI's actions do a decent job solving that one", where the human might not actually check the sub-sub-problem.)

Does that accurately express the intended message?

Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

Does that accurately express the intended message?
 

... situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...

Can you give an example (toy example is fine) of:

  • an action one might want to understand
  • what plan/strategy/other context that action is a part of
  • what it would look like for a human to understand the action

?

Mostly I'm confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I'm not sure what it would even look like to "understand" that left-turn separate from understanding its whole maze-plan.

One example: you know that thing where I point at a cow and say "cow", and then the toddler next to me points at another cow and is like "cow?", and I nod and smile? That's the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying "cow"? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)

The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we're interested in questions like "what are those structures?", and interoperability helps narrow down the possibilities for what those structures could be.

... and I don't think I've yet fully articulated the general version of the problem here, but the cow example is at least one case where "just take the magic box to be the identity function" fails to answer our question.

Load More