I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.
quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”
This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.
A couple other points in a similar direction:
I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.
Sure, in principle a sum of numbers can be large without any individual number being large. In practice, the 80/20 rule is a thing, and everything has bottlenecks all the time. If work is not addressing a bottleneck, then it's approximately useless.
(Somewhat more precisely: if marginal work is not addressing something which is a bottleneck on current margins, then it's approximately useless.)
Of importance, tractability and neglectedness, importance is the most important. In practice, it is usually better to have a thousand people trying to solve a major bottleneck each with low chance of success, than a thousand people making tractable progress on some neglected issue which is not a bottleneck.
Note to future self: Li & Vitanyi use for empty string, which makes this post confusing for people who are used to that notation.
You can show that, in order for an agent to persist, it needs to have the capacity to observe and learn about its environment. The math is a more complex than I want to get into here...
Do you have a citation for this? I went looking for the supposed math behind that claim a couple years back, and found one section of one Friston paper which had an example system which did not obviously generalize particularly well, and also used a kinda-hand-wavy notion of "Markov blanket" that didn't make it clear what precisely was being conditioned on (a critique which I would extend to all of the examples you list). And that was it; hundreds of excited citations chained back to that one spot. If anybody's written an actual explanation and/or proof somewhere, that would be great.
This is particularly interesting if we take and to be two different models, and take the indices 1, 2 to be different values of another random variable with distribution given by . In that case, the above inequality becomes:
Note to self: this assumes P[Y] = Q[Y].
I wasn't imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to "understand a subproblem", so that was useful.
I'll try again:
I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs
are taking many actionssolving many subproblems, where humans would eventually understandany particular actionhow well the AI's plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee anyactionsubproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to overseeactionssubproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.
(... and presumably an unstated piece here is that "understanding how well the AI's plan/action solves a particular subproblem" might include recursive steps like "here's a sub-sub-problem, assume the AI's actions do a decent job solving that one", where the human might not actually check the sub-sub-problem.)
Does that accurately express the intended message?
Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:
I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs
are taking many actionssolving many subproblems, where humans would eventually understandany particular actionany particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee anyactionsubproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to overseeactionssubproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.
Does that accurately express the intended message?
... situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...
Can you give an example (toy example is fine) of:
?
Mostly I'm confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I'm not sure what it would even look like to "understand" that left-turn separate from understanding its whole maze-plan.
One example: you know that thing where I point at a cow and say "cow", and then the toddler next to me points at another cow and is like "cow?", and I nod and smile? That's the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying "cow"? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)
The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we're interested in questions like "what are those structures?", and interoperability helps narrow down the possibilities for what those structures could be.
... and I don't think I've yet fully articulated the general version of the problem here, but the cow example is at least one case where "just take the magic box to be the identity function" fails to answer our question.
Here's a new Bookkeeping Theorem, which unifies all of the Bookkeeping Rules mentioned (but mostly not proven) in the post, as well as all possible other Bookkeeping Rules.
If all distributions which factor over Bayes net G1 also factor over Bayes net G2, then all distributions which approximately factor over G1 also approximately factor over G2. Quantitatively:
DKL(P[X]||∏iP[Xi|Xpa1(i)])≥DKL(P[X]||∏iP[Xi|Xpa2(i)])
where paj(i) indicates parents of variable i in Gj.
Proof: Define the distribution Q[X]:=∏iP[Xi|Xpa1(i)]. Since Q[X] exactly factors over G1, it also exactly factors over G2: Q[X]=∏iQ[Xi|Xpa2(i)]. So
DKL(P[X]||∏iP[Xi|Xpa1(i)])=DKL(P[X]||Q[X])
=DKL(P[X]||∏iQ[Xi|Xpa2(i)])
Then by the factorization transfer rule (from the post):
≥DKL(P[X]||∏iP[Xi|Xpa2(i)])
which completes the proof.