An important thing that the AGI alignment field never understood:
Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.
But what I think people haven't understood is
There's exceptions and nuances and possible escape routes. And the older Yudkowsky-led research about decision theory and tiling and reflective probability is relevant. But this basic argument is in some sense simpler (less advanced, but also more radical ("at the root")) than those essays. The response to the failure of those essays can't just be to "try something else about alignment"; the basic problem is still there and has to be addressed.
(related elaboration: https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html https://tsvibt.blogspot.com/2023/01/the-voyage-of-novelty.html )
Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).
IDK if this is a crux for me thinking this is very relevant to stuff on my perspective, but:
The training procedure you propose doesn't seem to actually incentivize indifference. First, a toy model where I agree it does incentivize that:
On the first time step, the agent gets a choice: choose a number 1--N. If the agent says k, then the agent has nothing at all to do for the first k steps, after which some game G starts. (Each play of G is i.i.d., not related to k.)
So this agent is indeed incentivized to pick k uniformly at random from 1--N. Now consider:
The agent is in a rich world. There are many complex multi-step plans to incentivize agent to learn problem-solving. Each episode, at time N, the agent gets to choose: end now, or play 10 more steps.
Does this incentivize random choice at time N? No. It incentivizes the agent to choose randomly End or Continue at the very beginning of the episode, and then carefully plan and execute behavior that acheives the most reward assuming a run of length N or N+10 respectively.
Wait, but isn't this success? Didn't we make the agent have no trajectory length preference?
No. Suppose:
Same as before, but now there's a little guy standing by the End/Continue button. Sometimes he likes to press button randomly.
Do we kill the guy? Yes we certainly do, he will mess up our careful plans.
I'm not sure I understand your question at all, sorry. I'll say my interpretation and then answer that. You might be asking:
Is the point of the essay summed up by saying: " "Thing=Nexus" is not mechanistic/physicalist, but it's still useful; in general, explanations can be non-mechanistic etc., but still be useful, perhaps by giving a functional definition of something."?
My answer is no, that doesn't sum up the essay. The essay makes these claims:
I did fail to list "functional" in my list of "foundational directions", so thanks for bringing it up. What I say about foundational directions would also apply to "functional".
As you can see, the failures lie on a spectrum, and they're model-dependent to boot.
And we can go further and say that the failures lie in a high-dimensional space, and that the apparent tradeoff is more a matter of finding the directions in which to pull the rope sideways. Propagating constraints between concepts and propositions is a way to go that seems hopeworthy to me. One wants to notice commonalities in how each of one's plans are doomed, and then address the common blockers / missing ideas. In other words, recurse to the "abstract" as much as is called for, even if you get really abstract; but treat [abstracting more than what you can directly see/feel as being demanded by your thinking] as a risky investment with opportunity cost.
Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.
I roughly agree. As I mentioned to Adele, I think you could get sort of lame edge cases where the LLM kinda helped find a new concept. The thing that would make me think the end is substantially nigher is if you get a model that's making new concepts of comparable quality at a comparable rate to a human scientist in a domain in need of concepts.
if you nail some Chris Olah style transparency work
Yeah that seems right. I'm not sure what you mean by "about language". Sorta plausibly you could learn a little something new about some non-language domain that the LLM has seen a bunch of data about, if you got interpretability going pretty well. In other words, I would guess that LLMs already do lots of interesting compression in a different way than humans do it, and maybe you could extract some of that. My quasi-prediction would be that those concepts
Ok yeah I agree with this. Related: https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html#pointing-at-reality-through-novelty
And an excerpt from a work in progress:
For example, I reach out and pick up some blueberries. This is some kind of expression of my values, but how so? Where are the values?
Are the values in my hands? Are they entirely in my hands, or not at all in my hands? The circuits that control my hands do what they do with regard to blueberries by virtue of my hands being the way they are. If my hands were different, e.g. really small or polydactylous, my hand-controller circuits would be different and would behave differently when getting blueberries. And the deeper circuits that coordinate visual recognition of blueberries, and the deeper circuits that coordinate the whole blueberry-getting system and correct errors based on blueberrywise success or failure, would also be different. Are the values in my visual cortext? The deeper circuits require some interface with my visual cortex, to do blueberry find-and-pick-upping. And having served that role, my visual cortex is specially trained for that task, and it will even promote blueberries in my visual field to my attention more readily than yours will to you. And my spatial memory has a nearest-blueberries slot, like those people who always know which direction is north.
It may be objected that the proximal hand-controllers and the blueberry visual circuits are downstream of other deeper circuits, and since they are downstream, they can be excluded from constituting the value. But that's not so clear. To like blueberries, I have to know what blueberries are, and to know what blueberries are I have to interact with them. The fact that I value blueberries relies on me being able to refer to blueberries. Certainly, if my hands were different but comparably versatile, then I would learn to use them to refer to blueberries about as well as my real hands do. But the reference to (and hence the value of) blueberries must pass through something playing the role that hands play. The hands, or something else, must play that role in constituting the fact that I value blueberries.
In general, values are founded on reference. The context that makes a value a value has to provide reference.
The situation is like how an abstract concept, once gained, doesn't overwrite and obselete what was abstracted from. Maxwell's equations don't annihilate Faraday's experiments in their detail. The experiments are unified in idea--metaphorically, the field structures are a "cross-section" of the messy detailed structure of any given experiment. The abstract concepts, to say something about a specific concrete experimental situation, have to be paired with specific concrete calculations and referential connections. The concrete situations are still there, even if we now, with our new abstract concepts, want to describe them differently.
If so, then values are essentially diasystemic.
Reference goes through unfolding.
To refer to something in reality is to be brought (or rather, bringable) to the thing. To be brought to a thing is to go to where the thing really is, through whatever medium is between the mind and where the thing really is. The "really is" calls on future novelty. See "pointing at reality through novelty".
In other words, reference is open--maybe radically open. It's supposed to incorporate whatever novelty the mind encounters--maybe deeply.
An open element can't be strongly endosystemic.
An open element will potentially relate to (radical, diasystemic) novelty, so its way of relating to other elements can't be fully stereotyped by preexisting elements with their preexisting manifest relations.
It's definitely like symbol grounding, though symbol grounding is usually IIUC about "giving meaning to symbols", which I think has the emphasis on epistemic signifying?
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?