AI ALIGNMENT FORUM
AF

All of orthonormal's Comments + Replies

My current candidate definitions, with some significant issues in the footnotes:

A fair environment is a probabilistic function $F (x_{1}, . . ., x_{N}) = [X_{1}, . . ., X_{N}]$ from an array of actions to an array of payoffs.

An agent $A$ is a random variable

$A (F, A_{1}, . . ., A_{i - 1}, A_{i} = A, A_{i + 1}, . . ., A_{N})$

which takes in a fair environment $F$ ^[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in $F$ . ^[2]

A fair agent is one whose mixed strategy is a function of subjective probabilities^[3] that... (read more)

orthonormal's Shortform

orthonormal5d52

How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?

I've failed to find anything in our literature.

It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.

However, fair problems are more than just fair environments: we want a definition of a fair problem (an... (read more)

1Vladimir Nesov5d

It's an essential aspect of decision making for an agent to figure out where it might be. Thought experiments try to declare the current situation, but they don't necessarily need to be able to convincingly succeed. Algorithmic induction, such as updating from Solomonoff prior, is the basic way an agent figures out which situations it should care about, and declaring that we are working with a particular thought experiment doesn't affect the prior. In line with updatelessness, an agent should be ready for observations in general (according to which of them it cares about more), rather than particular "fair" observations, so distinguishing observations that describe "fair" thought experiments doesn't seem right either.

1orthonormal5d

My current candidate definitions, with some significant issues in the footnotes: A fair environment is a probabilistic function F(x1,...,xN)=[X1,...,XN] from an array of actions to an array of payoffs. An agent A is a random variable A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN) which takes in a fair environment F[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in F. [2] A fair agent is one whose mixed strategy is a function of subjective probabilities[3] that it assigns to [the actions of some finite collection of agents in fair environments, where any agents not appearing in the original problem must themselves be fair]. Formally, if A is a fair agent in with a subjective probability estimator P, A's mixed strategy in a fair environment F, A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN) should depend only on a finite collection of A's subjective probabilities about outcomes {P(Fk(A1,...,AN,B1,...BM))=[X1,...,XN+M]}Kk=1 for a set of fair environments F1,...,FK and an additional set of fair[4] agents[5] B1,...,BM if needed (note that not all agents need to appear in all environments). A fair problem is a fair environment with one designated player, where all other agents are fair agents. 1. ^ I might need to require every F to have a default action dF, so that I don't need to worry about axiom-of-choice issues when defining an agent over the space of all fair environments. 2. ^ I specified a probabilistic environment and mixed strategies because I think there should be a unique fixed point for agents, such that this is well-defined for any fair environment F. (By analogy to reflective oracles.) But I might be wrong, or I might need further restrictions on F. 3. ^ Grossly underspecified. What kinds of properties are required for subjective probabilities here? You can obviously cheat by writing BlueEyedBot into your probability estimator. 4. ^ This is an infinite recursion, of cour

orthonormal's Shortform

orthonormal6mo322

With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.

The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were... (read more)

Developmental Stages of GPTs

orthonormal9mo42

I have to further compliment my past self: this section aged extremely well, prefiguring the Shoggoth-with-a-smiley-face analogies several years in advance.

GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?
One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:
Two hum

... (read more)

Run evals on base models too!

orthonormal1y30

The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)

I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.

Run evals on base models too!

orthonormal1y10

Thank you! I'd forgotten about that.

How to talk about reasons why AGI might not be near?

orthonormal1y22

I agree with "When you say 'there's a good chance AGI is near', the general public will hear 'AGI is near'".

However, the general public isn't everyone, and the people who can distinguish between the two claims are the most important to reach (per capita, and possibly in sum).

So we'll do better by saying what we actually believe, while taking into account that some audiences will round probabilities off (and seeking ways to be rounded closer to the truth while still communicating accurately to anyone who does understand probabilistic claims). The marginal gain by rounding ourselves off at the start isn't worth the marginal loss by looking transparently overconfident to those who can tell the difference.

How to talk about reasons why AGI might not be near?

orthonormal1y22

I reached this via Joachim pointing it out as an example of someone urging epistemic defection around AI alignment, and I have to agree with him there. I think the higher difficulty posed by communicating "we think there's a substantial probability that AGI happens in the next 10 years" vs "AGI is near" is worth it even from a PR perspective, because pretending you know the day and the hour smells like bullshit to the most important people who need convincing that AI alignment is nontrivial.

1Gordon Seidoh Worley1y

I left a comment over in the other thread, but I think Joachim misunderstands my position. In the above comment I've taken for granted that there's a non-trivial possibility that AGI is near, so I'm not arguing we should say that "AGI is near" regardless of whether it is or not, because we don't know if it is or not, we only have our guesses about it, and so long as there's a non-trivial chance that AGI is near, I think that's the more important message to communicate. Overall it would be better if we can communicate something like "AGI is probably near", but "probably" and similar terms are going to get rounded off, so even if you do literally say "AGI is probably near" or similar, that's not what people will hear, and if you're going to say "probably" my argument is that it's better if they round the "probably" off to "near" rather than "not near".

What 2026 looks like

orthonormal2y10

GPT-4 is good enough to identify you if you're a prolific writer.

GPT can write Quines now (GPT-4)

orthonormal2y50

I can imagine this coming from the equivalent of "adapt someone else's StackOverflow code" level capability, which is still pretty impressive.

In my opinion, the scariest thing I've seen so far is coding Game Of Life Pong, which doesn't seem to resemble any code GPT-4 would have had in its training data. Stitching those things together means coding for real for real.

The Preference Fulfillment Hypothesis

orthonormal2y30

Kudos for talking about learning empathy in a way that seems meaningfully different and less immediately broken than adjacent proposals.

I think what you should expect from this approach, should it in fact succeed, is not nothing- but still something more alien than the way we empathize with lower animals, let alone higher animals. Consider the empathy we have towards cats... and the way it is complicated by their desire to be a predator, and specifically to enjoy causing fear/suffering. Our empathy with cats doesn't lead us to abandon our empathy for their... (read more)

Modal Fixpoint Cooperation without Löb's Theorem

orthonormal2y20

Very cool! How does this affect your quest for bounded analogues of Löbian reasoning?

2Andrew Critch2y

I'm working on it :) At this point what I think is true is the following: If ShortProof(x \leftrightarrow LongProof(ShortProof(x) \to x)), then MediumProof(x). Apologies that I haven't written out calculations very precisely yet, but since you asked, that's roughly where I'm at :)

A Mechanistic Interpretability Analysis of Grokking

orthonormal3y25

This is honestly some of the most significant alignment work I've seen in recent years (for reasons I plan to post on shortly), thank you for going to all this length!

Typo: "Thoughout this process test loss remains low - even a partial memorising solution still performs extremely badly on unseen data!", 'low' should be 'high' (and 'throughout' is misspelled too).

1Neel Nanda3y

Thanks, I really appreciate it. Though I wouldn't personally consider this among the significant alignment work, and would love to hear about why you do!

Developmental Stages of GPTs

orthonormal3y30Review for 2020 Review

There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.

It's also worth noting that, in the recent discussions, Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thes... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

orthonormal3y*110

One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)

I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has... disconcerting effects on my own level of optimism.

What 2026 looks like

orthonormal4y130

I'd additionally expect the death of pseudonymity on the Internet, as AIs will find it easy to detect similar writing style and correlated posting behavior. What at present takes detective work will in the future be cheaply automated, and we will finally be completely in Zuckerberg's desired world where nobody can maintain a second identity online.

Oh, and this is going to be retroactive, so be ready for the consequences of everything you've ever said online.

1orthonormal2y

GPT-4 is good enough to identify you if you're a prolific writer.

2Daniel Kokotajlo4y

Hot damn, that's a good point.

Understanding “Deep Double Descent”

orthonormal4y90Review for 2019 Review

If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.

3Oliver Habryka4y

I agree with this, and was indeed kind of thinking of them as one post together.

What failure looks like

orthonormal4y30Review for 2019 Review

I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)

Strongly recommended for inclusion.

Soft takeoff can still lead to decisive strategic advantage

orthonormal4y30Review for 2019 Review

It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.

Chris Olah’s views on AGI safety

orthonormal4y40Review for 2019 Review

The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.

The Parable of Predict-O-Matic

orthonormal4y10Review for 2019 Review

This reminds me of That Alien Message, but as a parable about mesa-alignment rather than outer alignment. It reads well, and helps make the concepts more salient. Recommended.

Radical Probabilism

orthonormal4y*80

Remind me which bookies count and which don't, in the context of the proofs of properties?

If any computable bookie is allowed, a non-Bayesian is in trouble against a much larger bookie who can just (maybe through its own logical induction) discover who the bettor is and how to exploit them.

[EDIT: First version of this comment included "why do convergence bettors count if they don't know the bettor will oscillate", but then I realized the answer while Abram was composing his response, so I edited that part out. Editing it back in so that Abram's reply has context.]

Abram Demski4y60

It's a good question!

For me, the most general answer is the framework of logical induction, where the bookies are allowed so long as they have poly-time computable strategies. In this case, a bookie doesn't have to be guaranteed to make money in order to count; rather, if it makes arbitrarily much money, then there's a problem. So convergence traders are at risk of being stuck with a losing ticket, but, their existence forces convergence anyway.

If we don't care about logical uncertainty, the right condition is instead that the bookie knows the agent's beli... (read more)

Matt Botvinick on the spontaneous emergence of learning algorithms

orthonormal5y30

The claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

orthonormal5y20

This was literally the first output, with no rerolls in the middle! (Although after posting it, I did some other trials which weren't as good, so I did get lucky on the first one. Randomness parameter was set to 0.5.)

I cut it off there because the next paragraph just restated the previous one.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Answer by orthonormalJul 28, 2020110

(sorry, couldn't resist)

This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models. We will be discussing each approach in turn, with a focus on how they differ from one another.

The goal of this series is to provide a more complete picture of the various options for auditing AI systems than has been provided so far by any single person or organization. The hope is that it will help people make better-i... (read more)

2Daniel Kokotajlo5y

:D How many samples did you prune through to get this? Did you do any re-rolls? What was your stopping procedure?

Are we in an AI overhang?

orthonormal5y110

I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.

Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown ... (read more)

Developmental Stages of GPTs

orthonormal5y50

That's not a nitpick at all!

Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.

(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)

Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!

Developmental Stages of GPTs

orthonormal5y30

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.

1John Maxwell5y

GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?

An Orthodox Case Against Utility Functions

orthonormal5y10

I've been using computable to mean a total function (each instance is computable in finite time).

I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others.

By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome&quo... (read more)

An Orthodox Case Against Utility Functions

orthonormal5y10

Let's talk first about non-embedded agents.

Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible.

I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output.

But I can have a logical inductor which, for each TM, produces a ... (read more)

1Alex Mennen5y

I think we're going to have to back up a bit. Call the space of outcomes O and the space of Turing machines M. It sounds like you're talking about two functions, U:O→R and eval:M→O. I was thinking of U as the utility function we were talking about, but it seems you were thinking of U∘eval. You suggested U should be computable but eval should not be. It seems to me that eval should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing machines do, and that if non-halting is included in a space of outcomes (so that eval is total), it should be represented as some sort of limit of partial information, rather than represented explicitly, so that eval is continuous. In any case, a slight generalization of Rice's theorem tells us that any computable function from Turing machines to reals that depends only of the machine's semantics must be constant, so I suppose I'm forced to agree that, if we want a utility function U∘eval that is defined on all Turing machines and depends only on their semantics, then at least one of U or eval should be uncomputable. But I guess I have to ask why we would want to assign utilities to Turing machines.

An Orthodox Case Against Utility Functions

orthonormal5y10

I mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.

1Alex Mennen5y

It's not clear to me what this means in the context of a utility function.

An Orthodox Case Against Utility Functions

orthonormal5y30

I think that computable is obviously too strong a condition for classical utility; enumerable is better.

Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA approximately computable, but their properties are what you'd want the definition of approximately enu... (read more)

2Alex Mennen5y

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function f:X→R enumerable if there's a program that takes x∈X as input and enumerates the rationals that are less than f(x), but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well. I think accepting the type of reasoning you give suggests that limit-computability is enough (ie there's a program that takes x∈X and produces a sequence of rationals that converges to f(x), with no guarantees on the rate of convergence). Though I don't agree that it's obvious we should accept such utility functions as valid.

Thinking About Filtered Evidence Is (Very!) Hard

orthonormal5y10

If the listener is running a computable logical uncertainty algorithm, then for a difficult proposition it hasn't made much sense of, the listener might say "70% likely it's a theorem and X will say it, 20% likely it's not a theorem and X won't say it, 5% PA is inconsistent and X will say both, 5% X isn't naming all and only theorems of PA".

Conditioned on PA being consistent and on X naming all and only theorems of PA, and on the listener's logical uncertainty being well-calibrated, you'd expect that in 78% of s... (read more)

Writeup: Progress on AI Safety via Debate

orthonormal5y10

As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes."

Remember that this is also used to advance the argument. If A thinks B has such a strategy, A can ask the question in such a way that B's "yes" helps A's argument. But sure, there is something weird here.

Writeup: Progress on AI Safety via Debate

orthonormal5y20

the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that's the worst-case win-rate from blind guessing

I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you're using it spuriously.

I have what looks to me like an improvement, but there's still a vulnerability:

A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B&a... (read more)

1Matthew "Vaniver" Gray5y

I thought the thing A* was doing was giving a measure of "answer differently" that was more reliable than something like 'string comparison'. If B's answer is "dog" and B*'s answer was "canine", then hopefully those get counted as "the same" in situations where the difference is irrelevant and "different" in situations where the difference is relevant. If everything can be yes/no, then I agree this doesn't lose you much, but I think this reduces the amount of trickery you can detect. That is, imagine one of those games where I'm thinking of a word, and you have to guess it, and you can ask questions that narrow down the space of possible words. One thing I could do is change my word whenever I think you're getting close, but I have to do so to a different word that has all the properties I've revealed so far. (Or, like, each time I could answer in the way that leaves me with the largest set of possible words left, maximizing the time-to-completion.) If we do the thing where B says the word, and B* gets to look at B's notes up to point X and say B's word, then the only good strategy for B is to have the word in their notes (and be constrained by it), but this is resistant to reducing it to a yes/no question. (Even if the question is something like "is there a word in B's notes?" B* might be able to tell that B will say "yes" even tho there isn't a word in B's notes; maybe because B says "hey I'm doing the strategy where I don't have a word to be slippery, but pretend that I do have a word if asked" in the notes.) As stated, I think this has a bigger vulnerability; B and B* just always answer the question with "yes." One nice thing about yes/no questions is that maybe you can randomly flip them (so one gets "does B think the animal is a dog?" and the other gets "does B think the animal is not a dog?") so there's no preferred orientation, which would eat the "always say yes" strategy unless the question-flipping is detectable. (Since A is the one asking the question,

Bottle Caps Aren't Optimisers

orthonormal5y10

Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.

Bottle Caps Aren't Optimisers

orthonormal5y50

I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification:

A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.

2DanielFilan5y

Yes, this seems pretty important and relevant. That being said, I think that that definition suggests that natural selection and/or the earth's crust are downstream from an optimiser of the number of Holiday Inns, or that my liver is downstream from an optimiser from my income, both of which aren't right. Probably it's important to relate 'natural subgoals' to some ideal definition - which offers some hope, since 'subgoal' is really a computational notion, so maybe investigation along these lines would offer a more computational characterisation of optimisation. [EDIT: I made this comment longer and more contentful]

Embedded Agents

orthonormal5y40Review for 2018 Review

Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.

The Credit Assignment Problem

orthonormal5y10

Removing things entirely seems extreme.

Dropout is a thing, though.

1Charlie Steiner5y

Dropout is like the converse of this - you use dropout to assess the non-outdropped elements. This promotes resiliency to perturbations in the model - whereas if you evaluate things by how bad it is to break them, you could promote fragile, interreliant collections of elements over resilient elements. I think the root of the issue is that this Shapley value doesn't distinguish between something being bad to break, and something being good to have more of. If you removed all my blood I would die, but that doesn't mean that I would currently benefit from additional blood. Anyhow, the joke was that as soon as you add a continuous parameter, you get gradient descent back again.

The Credit Assignment Problem

orthonormal5y*90

Shapley Values [thanks Zack for reminding me of the name] are akin to credit assignment: you have a bunch of agents coordinating to achieve something, and then you want to assign payouts fairly based on how much each contribution mattered to the final outcome.

And the way you do this is, for each agent you look at how good the outcome would have been if everybody except that agent had coordinated, and then you credit each agent proportionally to how much the overall performance would have fallen off without them.

So what about doing the same here- send rewar... (read more)

3Abram Demski5y

Yeah, it's definitely related. The main thing I want to point out is that Shapley values similarly require a model in order to calculate. So you have to distinguish between the problem of calculating a detailed distribution of credit and being able to assign credit "at all" -- in artificial neural networks, backprop is how you assign detailed credit, but a loss function is how you get a notion of credit at all. Hence, the question "where do gradients come from?" -- a reward function is like a pile of money made from a joint venture; but to apply backprop or Shapley value, you also need a model of counterfactual payoffs under a variety of circumstances. This is a problem, if you don't have a seperate "epistemic" learning process to provide that model -- ie, it's a problem if you are trying to create one big learning algorithm that does everything. Specifically, you don't automatically know how to because in the cases I'm interested in, ie online learning, you don't have the option of -- because you need a model in order to rerun. But, also, I think there are further distinctions to make. I believe that if you tried to apply Shapley value to neural networks, it would go poorly; and presumably there should be a "philosophical" reason why this is the case (why Shapley value is solving a different problem than backprop). I don't know exactly what the relevant distinction is. (Or maybe Shapley value works fine for NN learning; but, I'd be surprised.)

1Charlie Steiner5y

Removing things entirely seems extreme. How about having a continuous "contribution parameter," where running the algorithm without an element would correspond to turning this parameter down to zero, but you could also set the parameter to 0.5 if you wanted that element to have 50% of the influence it has right now. Then you can send rewards to elements if increasing their contribution parameter would improve the decision. :P

Zack M. Davis5y80

I can't for the life of me remember what this is called

Shapley value

(Best wishes, Less Wrong Reference Desk)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

orthonormal5y30

Good comment. I disagree with this bit:

I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.

And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.

Proposal for an Implementable Toy Model of Informed Oversight

orthonormal8y00

I like this suggestion of a more feasible form of steganography for NNs to figure out! But I think you'd need further advances in transparency to get useful informed oversight capabilities from (transformed or not) copies of the predictive network.

HCH as a measure of manipulation

orthonormal8y10

I should have said "reliably estimate HCH"; I'd also want quite a lot of precision in addition to calibration before I trust it.

HCH as a measure of manipulation

orthonormal8y00

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

HCH as a measure of manipulation

orthonormal8y00

Re #1, an obvious set of questions to include in $q$ are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

HCH as a measure of manipulation

orthonormal8y00

There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"

Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.

1Jessica Taylor8y

"Having a well-calibrated estimate of HCH" is the condition you want, not "being able to reliably calculate HCH".

All the indifference designs

orthonormal8y00

Question that I haven't seen addressed (and haven't worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)?

Modal Combat for games other than the prisoner's dilemma

orthonormal8y00

This is a lot of good work! Modal combat is increasingly deprecated though (in my opinion), for reasons like the ones you noted in this post, compared to studying decision theory with logical inductors; and so I'm not sure this is worth developing further.

Censoring out-of-domain representations

orthonormal8y00

Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.

(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)

A whitelisting variant would be way more reliable than a blacklisting one, clearly.

(Non-)Interruptibility of Sarsa(λ) and Q-Learning

orthonormal8y00

Nice! One thing that might be useful for context: what's the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.)