AI ALIGNMENT FORUM
AF

All of Nisan's Comments + Replies

Nisan's Shortform

Nisan9mo150

On 2018-04-09, OpenAI said^[1]:

OpenAI’s mission is to ensure that artificial general intelligence (AGI) [...] benefits all of humanity.

In contrast, in 2023, OpenAI said^[2]:

[...] OpenAI’s mission: to build artificial general intelligence (AGI) that is safe and benefits all of humanity.

Archived ↩︎
This archived snapshot is from 2023-05-17, but the document didn't get much attention until November that year. ↩︎

The Waluigi Effect (mega-post)

Nisan1y10

The subject of this post appears in the "Did you know..." section of Wikipedia's front page(archived) right now.

Modern Transformers are AGI, and Human-Level

Nisan1y1214

I'm saying "transformers" every time I am tempted to write "LLMs" because many modern LLMs also do image processing, so the term "LLM" is not quite right.

"Transformer"'s not quite right either because you can train a transformer on a narrow task. How about foundation model: "models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks".

Modern Transformers are AGI, and Human-Level

Nisan1y97

I agree 100%. It would be interesting to explore how the term "AGI" has evolved, maybe starting with Goertzel and Pennachin 2007 who define it as:

a software program that can solve a variety of complex problems in a variety of different domains, and that controls itself autonomously, with its own thoughts, worries, feelings, strengths, weaknesses and predispositions

On the other hand, Stuart Russell testified that AGI means

machines that match or exceed human capabilities in every relevant dimension

so the experts seem to disagree. (On the other hand, ... (read more)

Google Gemini Announced

Nisan1y53

I wonder why Gemini used RLHF instead of Direct Preference Optimization (DPO). DPO was written up 6 months ago; it's simpler and apparently more compute-efficient than RLHF.

Is the Gemini org structure so sclerotic that it couldn't switch to a more efficient training algorithm partway through a project?
Is DPO inferior to RLHF in some way? Lower quality, less efficient, more sensitive to hyperparameters?
Maybe they did use DPO, even though they claimed it was RLHF in their technical report?

My take on higher-order game theory

Nisan1y10

Thanks! For convex sets of distributions: If you weaken the definition of fixed point to $g^{C} (S) \leq S$ , then the set ${S \in C X : g^{C} (S) = S}$ has a least element which really is a least fixed point.

Nisan's Shortform

Nisan2y30

Conception is a startup trying to do in vitro gametogenesis for humans!

Shah and Yudkowsky on alignment failures

Nisan3y90

CFAR used to have an awesome class called "Be specific!" that was mostly about concreteness. Exercises included:

Rationalist taboo
A group version of rationalist taboo where an instructor holds an everyday object and asks the class to describe it in concrete terms.
The Monday-Tuesday game
A role-playing game where the instructor plays a management consultant whose advice is impressive-sounding but contentless bullshit, and where the class has to force the consultant to be specific and concrete enough to be either wrong or trivial.
People were encouraged t

... (read more)

My take on higher-order game theory

Nisan3y10

Yep, I skimmed it by looking at the colorful plots that look like Ising models and reading the captions. Those are always fun.

My take on higher-order game theory

Nisan3y10

No, I just took a look. The spin glass stuff looks interesting!

1romeostevensit3y

Are we talking about the same thing? https://www.sciencedirect.com/science/article/am/pii/S0370157317301424

My take on higher-order game theory

Nisan3y10

I think you're saying $A_{ω + 1} := [Δ A_{ω} \to Δ A_{0}]$ , right? In that case, since $A_{0}$ embeds into $A_{ω}$ , we'd have $A_{ω + 1}$ embedding into $A_{ω}$ . So not really a step up.

If you want to play ordinal games, you could drop the requirement that agents are computable / Scott-continuous. Then you get the whole ordinal hierarchy. But then we aren't guaranteed equilibria in games between agents of the same order.

I suppose you could have a hybrid approach: Order $ω + 1$ is allowed to be discontinuous in its order- $ω$ beliefs, but higher orders have to be continuous? Maybe that would get you to $ω 2$ .... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Nisan3y130

I apologize, I shouldn't have leapt to that conclusion.

Eliezer Yudkowsky3y70

Apology accepted.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Nisan3y60

it legitimately takes the whole 4 years after that to develop real AGI that ends the world. FINE. SO WHAT. EVERYONE STILL DIES.

By Gricean implicature, "everyone still dies" is relevant to the post's thesis. Which implies that the post's thesis is that humanity will not go extinct. But the post is about the rate of AI progress, not human extinction.

This seems like a bucket error, where "will takeoff be fast or slow?" and "will AI cause human extinction?" are put in the same bucket.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Nisan3y130

The central hypothesis of "takeoff speeds" is that at the time of serious AGI being developed, it is perfectly anti-Thielian in that it is devoid of secrets

No, the slow takeoff model just precludes there being one big secret that unlocks both 30%/year growth and dyson spheres. It's totally compatible with a bunch of medium-sized $1B secrets that different actors discover, adding up to hyperbolic economic growth in the years leading up to "rising out of the atmosphere".

Rounding off the slow takeoff hypothesis to "lots and lots of little innovations addin... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Nisan3y10

"Takeoff Speeds" has become kinda "required reading" in discussions on takeoff speeds. It seems like Eliezer hadn't read it until September of this year? He may have other "required reading" from the past four years to catch up on.

Ajeya Cotra's "2020 Draft Report on Biological Anchors" is probably the most detailed public model of AI timelines.
Paul Christiano's "What failure looks like" is a slightly more concrete illustration of Paul's model of the future. Unfortunately, it avoids talking about human extinction in a way that gives the false impression

... (read more)

[This comment is no longer endorsed by its author]Reply

Eliezer Yudkowsky3y130

I read "Takeoff Speeds" at the time. I did not liveblog my reaction to it at the time. I've read the first two other items.

I flag your weirdly uncharitable inference.

Confusions re: Higher-Level Game Theory

Nisan4y30

I feel excited about this framework! Several thoughts:

I especially like the metathreat hierarchy. It makes sense because if you completely curry it, each agent sees the foe's action, policy, metapolicy, etc., which are all generically independent pieces of information. But it gets weird when an agent sees an action that's not compatible with the foe's policy.

You hinted briefly at using hemicontinuous maps of sets instead of or in addition to probability distributions, and I think that's a big part of what makes this framework exciting. Maybe if one takes a... (read more)

Reflexive Oracles and superrationality: prisoner's dilemma

Nisan4y10

See also this comment from 2013 that has the computable version of NicerBot.

Challenge: know everything that the best go bot knows about go

Nisan4y20

Or maybe it means we train the professional in the principles and heuristics that the bot knows. The question is if we can compress the bot's knowledge into, say, a 1-year training program for professionals.

There are reasons to be optimistic: We can discard information that isn't knowledge (lossy compression). And we can teach the professional in human concepts (lossless compression).

Challenge: know everything that the best go bot knows about go

Nisan4y60

This sounds like a great goal, if you mean "know" in a lazy sense; I'm imagining a question-answering system that will correctly explain any game, move, position, or principle as the bot understands it. I don't believe I could know all at once everything that a good bot knows about go. That's too much knowledge.

2Adam Shimi4y

That's basically what Paul's universality (my distillation post for another angle) is aiming for: having a question-answering overseer which can tell you everything you want to know about what the system knows and what it will do. You still probably need to be able to ask a relevant question, which I think is what you're pointing at.

1DanielFilan4y

Maybe it nearly suffices to get a go professional to know everything about go that the bot does? I bet they could.

1DanielFilan4y

Good point!

My research methodology

Nisan4y40

Red-penning is a general problem-solving method that's kinda similar to this research methodology.

5Rohin Shah4y

These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm: 1. Choose some initial x, and initialize a set Y = {}. 2. Solve "exists y: not P(x, y)". If unsolvable, you're done. If not, take the discovered y and put it in Y. 3. Solve "exists x: forall y in Y: P(x, y)" and set the solution as your new x. 4. Go to step 2. The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantifier. (For step 3, you inline the "forall y in Y" part, because Y is a small finite set.) The methodology laid out in this post is a counterexample-guided approach to solve the claim "exists alignment proposal: forall plausible worlds: The alignment proposal is safe in the world" Examples from programming languages include CEGIS (counterexample guided inductive synthesis) and CEGAR (counterexample guided abstraction refinement).

Some AI research areas and their relevance to existential safety

Nisan4y70

I'd believe the claim if I thought that alignment was easy enough that AI products that pass internal product review and which don't immediately trigger lawsuits would be aligned enough to not end the world through alignment failure. But I don't think that's the case, unfortunately.

It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.

Some AI research areas and their relevance to existential safety

Nisan4y110

I'd like more discussion of the claim that alignment research is unhelpful-at-best for existential safety because of it accelerating deployment. It seems to me that alignment research has a couple paths to positive impact which might balance the risk:

Tech companies will be incentivized to deploy AI with slipshod alignment, which might then take actions that no one wants and which pose existential risk. (Concretely, I'm thinking of out with a whimper and out with a bang scenarios.) But the existence of better alignment techniques might legitimize governa

... (read more)

Nisan4y70

It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.

Learning the prior

Nisan5y30

In this case humans are doing the job of transferring from $D$ to $D^{*}$ , and the training algorithm just has to generalize from a representative sample of $D^{*}$ to the test set.

2ESRogs5y

Thank you, this was helpful. I hadn't understood what was meant by "the generalization is now coming entirely from human beliefs", but now it seems clear. (And in retrospect obvious if I'd just read/thought more carefully.)

Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern)

Nisan5y10

Thanks for the references! I now know that I'm interested specifically in cooperative game theory, and I see that Shoham & Leyton-Brown has a chapter on "coalitional game theory", so I'll take a look.

1Vojtech Kovarik5y

Related to that: An interesting take (not only) on cooperative game theory is Schelling's The Strategy of Conflict (from 1960, resp. second edition from 1980, but I am not aware of sufficient follow-up research on the ideas presented there). And there might be some useful references in CLR's sequence on Cooperation, Conflict, and Transformative AI.

Why you should minimax in two-player zero-sum games

Nisan5y10

A proof of the lemma $V (a, b^{*}) \leq v^{*}$ :

$V (a, b^{*}) = V (a, {argmin}_{b} max a^{'} V (a^{'}, b))$

$\leq max a^{'} V (a^{'}, {argmin}_{b} max a^{'} V (a^{'}, b))$

$= min b max a^{'} V (a^{'}, b)$

$= v^{*}$

Multi-agent safety

Nisan5y10

Ah, ok. When you said "obedience" I imagined too little agency — an agent that wouldn't stop to ask clarifying questions. But I think we're on the same page regarding the flavor of the objective.

Multi-agent safety

Nisan5y20

Might not intent alignment (doing what a human wants it to do, being helpful) be a better target than obedience (doing what a human told it to do)?

2Richard Ngo5y

I should clarify that when I think about obedience, I'm thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I'm open to arguments about whether it's better to talk in terms of one or the other. I guess I favour "obedience" because it has fewer connotations of agency - if you're "doing what a human wants you to do", then you might run off and do things before receiving any instructions. (Also because it's shorter and pithier - "the goal of doing what humans want" is a bit of a mouthful).

Vanessa Kosoy's Shortform

Nisan5y30

My takeaway from this is that if we're doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.

How would you handle Agent Simulates Predictor? Is that what TRL is for?

2Vanessa Kosoy5y

That's about right. The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined (that's why people have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally ("rationally" according to the FDT philosophy). TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling in order to make optimal use of "thinking time" and "interacting with environment time" (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP. TRL agents can learn whether it's better to be predictable or predicting.

An environment for studying counterfactuals

Nisan6y10

The observation can provide all sorts of information about the universe, including whether exploration occurs. The exact set of possible observations depends on the decision problem.

$E$ and $O$ can have any relationship, but the most interesting case is when one can infer $E$ from $O$ with certainty.

Beliefs at different timescales

Nisan6y10

Thanks, I made this change to the post.

Beliefs at different timescales

Nisan6y20

Yeah, I think the fact that Elo only models the macrostate makes this an imperfect analogy. I think a better analogy would involve a hybrid model, which assigns a probability to a chess game based on whether each move is plausible (using a policy network), and whether the higher-rated player won.

I don't think the distinction between near-exact and nonexact models is essential here. I bet we could introduce extra entropy into the short-term gas model and the rollout would still be superior for predicting the microstate than the Boltzmann distribution.

Beliefs at different timescales

Nisan6y10

The sum isn't over $i$ , though, it's over all possible tuples of length $n - 1$ . Any ideas for how to make that more clear?

2Rohin Shah6y

I find the current notation fine, but if you want to make it more explicit, you could do ∑xk+2∑xk+3⋯∑xk+nPn(…)

2Diffractor6y

My initial inclination is to introduce Xn as the space of events on turn n, and define Xa:b:=b∏i=aXi and then you can express it as ∑σ∈Xk+2:k+nPn(xk+1,σ|x0...xk) .

EDT solves 5 and 10 with conditional oracles

Nisan6y20

I'm having trouble following this step of the proof of Theorem 4: "Obviously, the first conditional probability is 1". Since the COD isn't necessarily reflective, couldn't the conditional be anything?

1Jessica Taylor6y

By definition UO()=FiveTenO(⌈COEDTO(⌈U⌉)⌉)=COEDTO(⌈U⌉), regardless of O. (The subscript Qji to P only affects the distribution of O) EDIT: clarified notation in the post

History of the Development of Logical Induction

Nisan7y30

The linchpin discovery is probably February 2016.

1Scott Garrabrant7y

fixed

Counterfactuals, thick and thin

Nisan7y20

Ok. I think that's the way I should have written it, then.

Counterfactuals, thick and thin

Nisan7y20

Oh, interesting. Would your interpretation be different if the guess occurred well after the coinflip (but before we get to see the coinflip)?

2David Simmons7y

Sure, in that case there is a 0% counterfactual chance of heads, your words aren't going to flip the coin.

On the Role of Counterfactuals in Learning

Nisan7y20

What predictions can we get out of this model? If humans use counterfactual reasoning to initialize MCMC, does that imply that humans' implicit world models don't match their explicit counterfactual reasoning?

An environment for studying counterfactuals

Nisan7y10

I agree exploration is a hack. I think exploration vs. other sources of non-dogmatism is orthogonal to the question of counterfactuals, so I'm happy to rely on exploration for now.

Mechanistic Transparency for Machine Learning

Nisan7y30

"Programmatically Interpretable Reinforcement Learning" (Verma et al.) seems related. It would be great to see modular, understandable glosses of neural networks.

Logical counterfactuals and differential privacy

Nisan7y10

This doesn't quite work. The theorem and examples only work if you maximize the unconditional mutual information, $H (X; Y)$ , not $H (X; Y | A)$ . And the choice of $X$ is doing a lot of work — it's not enough to make it "sufficiently rich".