All of Bird Concept's Comments + Replies

(Nitpick: I'd find the first paragraphs would be much easier to read if they didn't have any of the bolding)

I think this is an application of a more general, very powerful principle of mechanism design: when cognitive labor is abundant, near omni-present surveillance becomes feasible. 

For domestic life, this is terrifying. 

But for some high stakes, arms race-style scenarios, it might have applications. 

Beyond what you metioned, I'm particularly interested in this being a game-changer for bilateral negotiation. Two parties make an agreement, consent to being monitored by an AI auditor, and verify that the auditor's design will communicate with the ... (read more)

2mako yass
Isn't it enough to constrain the output format to so that no steganographic leaks would be possible? Wont the counterparty usually be satisfied just with an hourly signal saying either "Something is wrong" (encompassing "Auditor saw a violation" / "no signal, the host has censored the auditor's report" / "invalid signal, the host has tampered with the auditor system" / "auditor has been blinded to the host's operations, or has ascertained that there are operations which the auditor cannot see") or "Auditor confirms that all systems are nominal and without violation."? The host can remain in control of their facilities, as long as the auditor is running on tamperproof hardware. It's difficult to prove that a physical device can't be tampered with, it may be possible to take some components of the auditor even further and run them in a zero knowledge virtual machine, which provides a cryptographic guarantee that the program wasn't tampered with, so long as you can make it lithe enough to fit (zero knowledge virtual machines currently run at a 10,000x slowdown, though I don't think specialized hardware for them is available yet, crypto may drive that work), though a ZKVM wont provide a guarantee that the inputs to the system aren't being controlled, the auditor is monitoring inputs of such complexity — either footage of the real world or logs of a large training run — that it may be able to prove algorithmically to itself that the sensory inputs weren't tampered with either and the algorithm does have a view into the real world (I'm contending that even large state actors could not create Descarte's evil demon).

Sidenote: I'm a bit confused by the name. The all caps makes it seem like an acronym. But it seems to not be? 

Reply1111

So, when a human lies over the course of an interaction, they'd be holding a hidden state in mind throughout. However, an LLM wouldn't carry any cognitive latent state over between telling the lie, and then responding to the elicitation question. I guess it feels more like "I just woke up from amnesia, and seems I have just told a lie. Okay, now what do I do..."

Stating this to:

  1. Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and
  2. Any thoughts on how this affects the results and approach?
1JanB
  Yes, this is how the paper works.   Not really. I find the simulator framing is useful to think about this.

Curated.

There are really many things I found outstanding about this post. The key one, however, is that after reading this, I feel less confused when thinking about transformer language models. The post had that taste of deconfusion where many of the arguments are elegant, and simple; like suddenly tilting a bewildering shape into place. I particularly enjoyed the discussion of ways agency does and does not manifest within a simulator (multiple agents, irrational agents, non-agentic processes), the formulation of the prediction orthogonality thesis, ways i... (read more)

3janus
Thank you for this lovely comment. I'm pleasantly surprised that people were able to get so much out of it. As I wrote in the post, I wasn't sure if I'd ever get around to publishing the rest of the sequence, but the reception so far has caused me to bump up the priority of that.

If someone asks what the rock is optimizing, I’ll say “the actions” - i.e. the rock “wants” to do whatever it is that the rock in fact does.

This argument does not seem to me like it captures the reason a rock is not an optimiser? 

I would hand wave and say something like: 

"If you place a human into a messy room, you'll sometimes find that the room is cleaner afterwards. If you place a kid in front of a bowl of sweets, you'll soon find the sweets gone. These and other examples are pretty surprising state transitions, that would be highly unlikely i... (read more)

3johnswentworth
Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself. A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

An update on this: sadly I underestimated how busy I would be after posting this bounty. I spent 2h reading this and Thomas post the other day, but didn't not manage to get into the headspace of evaluating the bounty (i.e. making my own interpretation of John's post, and then deciding whether Thomas' distillation captured that). So I will not be evaluating this. (Still happy to pay if someone else I trust claim Thomas' distillation was sufficient.) My apologies to John and Thomas about that.

Cool, I'll add $500 to the distillation bounty then, to be paid out to anyone you think did a fine job of distilling the thing :)  (Note: this should not be read as my monetary valuation for a day of John work!)

(Also, a cooler pay-out would be basis points, or less, of Wentworth impact equity)

2johnswentworth
Needing to judge submissions is the main reason I didn't offer a bounty myself. Read the distillation, and see if you yourself understand it. If "Coherence of Distributed Decisions With Different Inputs Implies Conditioning" makes sense as a description of the idea, then you've probably understood it. If you don't understand it after reading an attempted distillation, then it wasn't distilled well enough.

How long would it have taken you to do the distillation step yourself for this one? I'd be happy to post a bounty, but price depends a bit on that.

2johnswentworth
Short answer: about one full day. Longer answer: normally something like this would sit in my notebook for a while, only informing my own thinking. It would get written up as a post mainly if it were adjacent to something which came up in conversation (either on LW or in person). I would have the idea in my head from the conversation, already be thinking about how best to explain it, chew on it overnight, and then if I'm itching to produce something in the morning I'd bang out the post in about 3-4 hours. Alternative paths: I might need this idea as background for something else I'm writing up, or I might just be in a post-writing mood and not have anything more ready-to-go. In either of those cases, I'd be starting more from scratch, and it would take about a full day.

Jaan/Holden convo link is broken :(

Curated. 

I think this post strikes a really cool balance between discussing some foundational questions about the notion of agency and its importance, as well as posing a concrete puzzle that caused some interesting comments.

For me, Life is a domain that makes it natural to have reductionist intuitions. Compared to say neural networks, I find there are fewer biological metaphors or higher-level abstractions where you might sneak in mysterious answers that purport to solve the deeper questions. I'll consider this post next time I want to introduce some... (read more)

Here are prediction questions for the predictions that TurnTrout himself provided in the concluding post of the Reframing Impact sequence

1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
Attainable Utility theory describes how people feel impacted
99%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world.
99%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
Alex Turner (70%),niplav (75%)
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
The catastrophic convergence conjecture is true. That is, unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.
99%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
niplav (67%)
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
AUP_conceptual prevents catastrophe, assuming the catastrophic convergence conjecture.
99%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
niplav (72%)
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
Some version of Attainable Utility Preservation solves side effect problems for an extremely wide class of real-world tasks and for subhuman agents.
99%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
niplav (55%)
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
For the superhuman case, penalizing the agent for increasing its own Attainable Utility (AU) is better than penalizing the agent for increasing other AUs.
99%
... (read more)
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
MinusGix (11%),tenthkrige (15%),Sapphire Alysa Star (15%),Zvi (17%)
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
philh (40%),Alex Turner (45%)
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
ejacob (50%)
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
XxX (73%),Garrett Baker (74%)
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
The strategy-stealing assumption is "a good enough approximation that we can basically act as if it’s true". That is, for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group of humans can use in order to capture a similar amount of flexible influence over the future. By “flexible” is meant that humans can decide later what to do with that influence  (which is important since humans don’t yet know what we want in the long run).
99%

(You can find a list of all 2019 Review poll questions here.)