How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?
I've failed to find anything in our literature.
It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.
However, fair problems are more than just fair environments: we want a definition of a fair problem (an...
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.
The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were...
I have to further compliment my past self: this section aged extremely well, prefiguring the Shoggoth-with-a-smiley-face analogies several years in advance.
...GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?
One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:
- Two hum
The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)
I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.
I agree with "When you say 'there's a good chance AGI is near', the general public will hear 'AGI is near'".
However, the general public isn't everyone, and the people who can distinguish between the two claims are the most important to reach (per capita, and possibly in sum).
So we'll do better by saying what we actually believe, while taking into account that some audiences will round probabilities off (and seeking ways to be rounded closer to the truth while still communicating accurately to anyone who does understand probabilistic claims). The marginal gain by rounding ourselves off at the start isn't worth the marginal loss by looking transparently overconfident to those who can tell the difference.
I reached this via Joachim pointing it out as an example of someone urging epistemic defection around AI alignment, and I have to agree with him there. I think the higher difficulty posed by communicating "we think there's a substantial probability that AGI happens in the next 10 years" vs "AGI is near" is worth it even from a PR perspective, because pretending you know the day and the hour smells like bullshit to the most important people who need convincing that AI alignment is nontrivial.
I can imagine this coming from the equivalent of "adapt someone else's StackOverflow code" level capability, which is still pretty impressive.
In my opinion, the scariest thing I've seen so far is coding Game Of Life Pong, which doesn't seem to resemble any code GPT-4 would have had in its training data. Stitching those things together means coding for real for real.
Kudos for talking about learning empathy in a way that seems meaningfully different and less immediately broken than adjacent proposals.
I think what you should expect from this approach, should it in fact succeed, is not nothing- but still something more alien than the way we empathize with lower animals, let alone higher animals. Consider the empathy we have towards cats... and the way it is complicated by their desire to be a predator, and specifically to enjoy causing fear/suffering. Our empathy with cats doesn't lead us to abandon our empathy for their...
This is honestly some of the most significant alignment work I've seen in recent years (for reasons I plan to post on shortly), thank you for going to all this length!
Typo: "Thoughout this process test loss remains low - even a partial memorising solution still performs extremely badly on unseen data!", 'low' should be 'high' (and 'throughout' is misspelled too).
There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.
It's also worth noting that, in the recent discussions, Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thes...
One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)
I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has... disconcerting effects on my own level of optimism.
I'd additionally expect the death of pseudonymity on the Internet, as AIs will find it easy to detect similar writing style and correlated posting behavior. What at present takes detective work will in the future be cheaply automated, and we will finally be completely in Zuckerberg's desired world where nobody can maintain a second identity online.
Oh, and this is going to be retroactive, so be ready for the consequences of everything you've ever said online.
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.
I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)
Strongly recommended for inclusion.
It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.
The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.
This reminds me of That Alien Message, but as a parable about mesa-alignment rather than outer alignment. It reads well, and helps make the concepts more salient. Recommended.
Remind me which bookies count and which don't, in the context of the proofs of properties?
If any computable bookie is allowed, a non-Bayesian is in trouble against a much larger bookie who can just (maybe through its own logical induction) discover who the bettor is and how to exploit them.
[EDIT: First version of this comment included "why do convergence bettors count if they don't know the bettor will oscillate", but then I realized the answer while Abram was composing his response, so I edited that part out. Editing it back in so that Abram's reply has context.]
It's a good question!
For me, the most general answer is the framework of logical induction, where the bookies are allowed so long as they have poly-time computable strategies. In this case, a bookie doesn't have to be guaranteed to make money in order to count; rather, if it makes arbitrarily much money, then there's a problem. So convergence traders are at risk of being stuck with a losing ticket, but, their existence forces convergence anyway.
If we don't care about logical uncertainty, the right condition is instead that the bookie knows the agent's beli...
(sorry, couldn't resist)
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models. We will be discussing each approach in turn, with a focus on how they differ from one another.
The goal of this series is to provide a more complete picture of the various options for auditing AI systems than has been provided so far by any single person or organization. The hope is that it will help people make better-i...
I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown ...
That's not a nitpick at all!
Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.
(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)
Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!
The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
I've been using computable to mean a total function (each instance is computable in finite time).
I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others.
By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome&quo...
Let's talk first about non-embedded agents.
Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible.
I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output.
But I can have a logical inductor which, for each TM, produces a ...
I think that computable is obviously too strong a condition for classical utility; enumerable is better.
Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA approximately computable, but their properties are what you'd want the definition of approximately enu...
If the listener is running a computable logical uncertainty algorithm, then for a difficult proposition it hasn't made much sense of, the listener might say "70% likely it's a theorem and X will say it, 20% likely it's not a theorem and X won't say it, 5% PA is inconsistent and X will say both, 5% X isn't naming all and only theorems of PA".
Conditioned on PA being consistent and on X naming all and only theorems of PA, and on the listener's logical uncertainty being well-calibrated, you'd expect that in 78% of s...
As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes."
Remember that this is also used to advance the argument. If A thinks B has such a strategy, A can ask the question in such a way that B's "yes" helps A's argument. But sure, there is something weird here.
the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing
I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you're using it spuriously.
I have what looks to me like an improvement, but there's still a vulnerability:
A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B&a...
I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification:
A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.
Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.
Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.
And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.
So what about doing the same here- send rewar...
I can't for the life of me remember what this is called
(Best wishes, Less Wrong Reference Desk)
Good comment. I disagree with this bit:
I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.
And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.
There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"
Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.
Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.
Nice! One thing that might be useful for context: what's the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.)
My current candidate definitions, with some significant issues in the footnotes:
A fair environment is a probabilistic function F(x1,...,xN)=[X1,...,XN] from an array of actions to an array of payoffs.
An agent A is a random variable
A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN)
which takes in a fair environment F[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in F. [2]
A fair agent is one whose mixed strategy is a function of subjective probabilities[3] that... (read more)