I know very little about this area, but I suspect that a writeup like this classic explanation of Godel Incompleteness might be a step in the right direction: Godel incompleteness.
I meant this:
Shard Question: How does the human brain ensure alignment with its values, and how can we use that information to ensure the alignment of an AI with its designers' values?
which does indeed beg the question in the standard meaning of it.
My point is that there is very much no alignment between different values! They are independent at best and contradictory in many cases. There is an illusion of coherent values that is a rationalization. The difference in values sometimes leads to catastrophic Fantasia-like outcomes on the margins (e.g. people w...
Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data.
I am missing something... The idea of correctly extrapolating human values is basically the definition of the Eliezer's original proposal, CEV. In fact, it's right there in the name. What is the progress over the last decade?
I'm confused... What you call the "Pure Reality" view seems to work just fine, no? (I think you had a different name for it, pure counterfactuals or something.) What do you need counterfactuals/Augmented Reality for? Presumably making decisions thanks to "having a choice" in this framework, right? In the pure reality framework the "student and the test" example one would dispassionately calculate what kind of a student algorithm passes the test, without talking about making a decision to study or not to study. Same with the Newcomb's, of course, one just looks at what kind of agents end up with a given payoff. So... why pick an AR view over the PR view, what's the benefit?
First, I really like this shift in thinking, partly because it moves the needle toward an anti-realist position, where you don't even need to postulate an external world (you probably don't see it that way, despite saying "Everything is a subjective preference evaluation").
Second, I wonder if you need an even stronger restriction, not just computable, but efficiently computable, given that it's the agent that is doing the computation, not some theoretical AIXI. This would probably also change "too easily" in "those e...
My answer is a rather standard compatibilist one, the algorithm in your brain produces the sensation of free will as an artifact of an optimization process.
There is nothing you can do about it (you are executing an algorithm, after all), but your subjective perception of free will may change as you interact with other algorithms, like me or Jessica or whoever. There aren't really any objective intentional "decisions", only our perception of them. Therefore there the decision theories are just byproducts of all these algorithms executing. It...
According to this SSC book review, "the secret of our success" is the ability to learn culture + the accumulation of said culture, which seems a bit broader than ability to learn language + language that you describe.
Right, that's the question. Sure, it is easy to state that "metric must be a faithful representation of the target", but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true "values" (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I'm not sure.
I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that "An actual grounded definition of human preferences" is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,
I still don't understand the whole deal about counterfactuals, exemplified as "If Oswald had not shot Kennedy, then someone else would have". Maybe MIRI means something else by the counterfactuals?
If it's the counterfactual conditionals, then the approach is pretty simple, as discussed with jessicata elsewhere: there is the macrostate of the world (i.e. a state known to a specific observer, which consists of many possible substates, or microstates) of the world, one of these micr
...When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.
This doesn't give very useful answers when the state evolution is nearly deterministic, such as an agent made of computer code.
For example, consider an agent trying to decide whether to turn left or turn right. Suppose for the sake of argument that it actually turns left, if you run physics forward. Also suppose that the logical uncertainty has figured that out, so that the best-estimate macrosta...
Well written. Do you have a few examples of pivoting when it becomes apparent that the daily grind no longer optimizes for solving the problem?