Yes, I think that was it; and that I did not (and still don't) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.
Didn't read the original paper yet, but from what you describe, I don't understand how the remaining technical problem is not basically the whole of the alignment problem. My understanding of what you say is that he is vague about the values we want to give the agent - and not knowing how to specify human values is kind of the point (that, and inner alignment - which I don't see addressed either).
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.
In sc...
Since I became reasonably sure that I understand your position and reasoning - mostly changing it.
That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.
Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our pa...
Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.
Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.
I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?
It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the...
Thanks for the reply.
To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )
Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)
I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is "that question don't make any sense. Let me replace it with 'what I want to do' or some equivalent". Or at best "that question don't make any sense. raise ValueError('pun intended')"
The basic idea seem to me interesting and true, but I think some important ingredients are missing, or more likely missing in my understanding of what you say:
It seem like you upper-bound the abstractions we may use by basically the information that we may access (actually even higher, assuming you do not exclude what the neighbour does behind closed doors). But isn't this bound very loose? I mean, it seem like every pixel of my sight count as "information at a distance", and my world model is much much smaller.
is time treated the like space? From th
Some sort points:
"human-level question answering is believed to be AI-complete" - I doubt that. I think that we consistently far overestimate the role of language in our cognition, and how much we can actually express using language. The simplest example that come to mind is trying to describe a human face to an AI system with no "visual cortex" in a way that would let it generate a human image (e.g. hex representation of pixels). For that matter, try to describe something less familiar than a human face to a human painter in hope that they can paint it.
"...
Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).
(continuation of the same comment - submitted by mistake and cannot edit...) Assuming modules A,B are already "talking" and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B
I see. I didn't fully adapt to the fact that not all alignment is about RL.
Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don't actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.