orthonormal - AI Alignment Forum

orthonormal's Shortform

orthonormal20d*10

[EDIT: Never mind, this is just Kleene's second recursion theorem!]

Quick question about Kleene's recursion theorem:

Let's say F is a computable function from ℕ^N to ℕ. Is there a single computable function X from ℕ^N to ℕ such that

X = F(X, y_2,..., y_N) for all y_2,...,y_N in ℕ

(taking the X within F as the binary code of X in a fixed encoding) or do there need to be additional conditions?

orthonormal's Shortform

orthonormal1mo10

My current candidate definitions, with some significant issues in the footnotes:

A fair environment is a probabilistic function from an array of actions to an array of payoffs.

An agent $A$ is a random variable

$A (F, A_{1}, . . ., A_{i - 1}, A_{i} = A, A_{i + 1}, . . ., A_{N})$

which takes in a fair environment $F$ ^[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in $F$ . ^[2]

A fair agent is one whose mixed strategy is a function of subjective probabilities^[3] that it assigns to [the actions of some finite collection of agents in fair environments, where any agents not appearing in the original problem must themselves be fair].

Formally, if $A$ is a fair agent in with a subjective probability estimator $P$ , $A$ 's mixed strategy in a fair environment $F$ ,

$A (F, A_{1}, . . ., A_{i - 1}, A_{i} = A, A_{i + 1}, . . ., A_{N})$

should depend only on a finite collection of $A$ 's subjective probabilities about outcomes

${P (F_{k} (A_{1}, . . ., A_{N}, B_{1}, . . . B_{M})) = [X_{1}, . . ., X_{N + M}]}_{k = 1}^{K}$

for a set of fair environments $F_{1}, . . ., F_{K}$ and an additional set of fair^[4] agents^[5] $B_{1}, . . ., B_{M}$ if needed (note that not all agents need to appear in all environments).

A fair problem is a fair environment with one designated player, where all other agents are fair agents.

^{^}
I might need to require every $F$ to have a default action $d_{F}$ , so that I don't need to worry about axiom-of-choice issues when defining an agent over the space of all fair environments.
^{^}
I specified a probabilistic environment and mixed strategies because I think there should be a unique fixed point for agents, such that this is well-defined for any fair environment $F$ . (By analogy to reflective oracles.) But I might be wrong, or I might need further restrictions on $F$ .
^{^}
Grossly underspecified. What kinds of properties are required for subjective probabilities here? You can obviously cheat by writing BlueEyedBot into your probability estimator.
^{^}
This is an infinite recursion, of course. It works if we require each $B_{m}$ to have a strictly lower complexity in some sense than $A$ (e.g. the rank of an agent is the largest number $K$ of environments it can reason about when making any decision, and each $B_{m}$ needs to be lower-rank than $A$ ), but I worry that's too strong of a restriction and would exclude some well-definable and interesting agents.
^{^}
Does the fairness requirement on the $B_{m}$ suffice to avert the MetaBlueEyedBot problem in general? I'm unsure.

orthonormal's Shortform

orthonormal1mo52

How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?

I've failed to find anything in our literature.

It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.

However, fair problems are more than just fair environments: we want a definition of a fair problem (and fair agents) under which, among other things:

The classic Newcomb's Problem against Omega, with certainty or with 1% random noise: fair
Omega puts $1M in the box iff it predicts that the player consciously endorses one-boxing, regardless of what it predicts the player will actually do (e.g. misunderstand the instructions and take a different action than they endorsed): unfair
Prisoner's Dilemma between two agents who base their actions on not only each others' predicted actions in the current environment, but also their predicted actions in other defined-as-fair dilemmas: fair
- For example, PrudentBot will cooperate with you if it deduces that you will cooperate with it and also that you would defect against DefectBot, because it wants to exploit CooperateBots).
Prisoner's Dilemma between two agents who base their actions on each others' predicted actions in defined-as-unfair dilemmas: unfair
- It would let us smuggle in unfairness from other dilemmas; e.g. if BlueEyedBot only tries Löbian cooperation against agents with blue eyes, and MetaBlueEyedBot only tries Löbian cooperation against agents that predictably cooperate with BlueEyedBot, then the Prisoner's Dilemma against MetaBlueEyedBot should count as unfair.

Modal combat doesn't need to worry about this, because all the agents in it are fair-by-construction.

Yeah, I know, it's about a decade late to be asking this question.

orthonormal's Shortform

orthonormal7mo322

With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.

The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were).

I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).

Headline numbers:

Attrition for the 505 OpenAI employees who signed before the letter was first leaked: at least 24/505 = 4.8%
Attrition for the next 197 to sign (it was leaked again at 667 signatures, and one last time at 702): at least 13/197 = 6.6%
Attrition for the (reported) 68 who had not signed by the last leak: at least 19/68 = 27.9%.

Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.

Below are my current tallies of some notable subsets. Please comment with any corrections!

People from the Superalignment team who never signed as of the 702 leak (including some policy/governance people who seem to have been closely connected) and are now gone:

Carroll Wainwright
Collin Burns
Cullen O'Keefe
Daniel Kokotajlo
Jan Leike (though he did separately Tweet that the board should resign)
Jeffrey Wu
Jonathan Uesato
Leopold Aschenbrenner
Mati Roy
William Saunders
Yuri Burda

People from the Superalignment team (and close collaborators) who did sign before the final leak but are now gone:

Jan Hendrik Kirchner (signed between 668 and 702)
Steven Bills (signed between 668 and 702)
John Schulman (signed between 506 and 667)
Sherry Lachman (signed between 506 and 667)
Ilya Sutskever (signed by 505)
Pavel Izmailov (signed by 505)
Ryan Lowe (signed by 505)
Todor Markov (signed by 505)

Others who didn't sign as of the 702 leak (some of whom may have just been AFK for the wrong weekend, though I doubt that was true of Karpathy) and are now gone:

Andrei Alexandru (Research Engineer)
Andrej Karpathy (Co-Founder)
Austin Wiseman (Finance/Accounting)
Girish Sastry (Policy)
Jay Joshi (Recruiting)
Katarina Slama (Member of Technical Staff)
Lucas Negritto (Member of Technical Staff, then Developer Community Ambassador)
Zarina Stanik (Marketing)

Notable other ex-employees:

Barrett Zoph (VP of Research, Post-Training; signed by 505)
Bob McGrew (Chief Research Officer; signed by 505)
Chris Clark (Head of Nonprofit and Strategic Initiatives; signed by 505)
Diane Yoon (VP of People; signed by 505)
Gretchen Krueger (Policy; signed by 505; posted a significant Twitter thread at the time she left)
Mira Murati (CTO; signed by 505)

Developmental Stages of GPTs

orthonormal10mo42

I have to further compliment my past self: this section aged extremely well, prefiguring the Shoggoth-with-a-smiley-face analogies several years in advance.

GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?
One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:
Two humans are fairly similar to each other, because they have very similar architectures and are learning to succeed in the same environment.
Two convergently evolved species will be similar in some ways but not others, because they have different architectures but the same environmental pressures.
A mimic species will be similar in some ways but not others to the species it mimics, because even if they share recent ancestry, the environmental pressures on the poisonous one are different from the environmental pressures on the mimic.
What we have with the GPTs is the first deep learning architecture we've found that scales this well in the domain (so, probably not that much like our particular architecture), learning to mimic humans rather than growing in an environment with similar pressures. Why should we expect it to be anything but very alien under the hood, or to continue acting human once its actions take us outside of the training distribution?
Moreover, there may be much more going on under the hood than we realize; it may take much more general cognitive power to learn and imitate the patterns of humans, than it requires us to execute those patterns.

Run evals on base models too!

orthonormal1y30

The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)

I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.

Run evals on base models too!

orthonormal1y10

Thank you! I'd forgotten about that.

How to talk about reasons why AGI might not be near?

orthonormal2y22

I agree with "When you say 'there's a good chance AGI is near', the general public will hear 'AGI is near'".

However, the general public isn't everyone, and the people who can distinguish between the two claims are the most important to reach (per capita, and possibly in sum).

So we'll do better by saying what we actually believe, while taking into account that some audiences will round probabilities off (and seeking ways to be rounded closer to the truth while still communicating accurately to anyone who does understand probabilistic claims). The marginal gain by rounding ourselves off at the start isn't worth the marginal loss by looking transparently overconfident to those who can tell the difference.

How to talk about reasons why AGI might not be near?

orthonormal2y22

I reached this via Joachim pointing it out as an example of someone urging epistemic defection around AI alignment, and I have to agree with him there. I think the higher difficulty posed by communicating "we think there's a substantial probability that AGI happens in the next 10 years" vs "AGI is near" is worth it even from a PR perspective, because pretending you know the day and the hour smells like bullshit to the most important people who need convincing that AI alignment is nontrivial.

What 2026 looks like

orthonormal2y10

GPT-4 is good enough to identify you if you're a prolific writer.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments