AI ALIGNMENT FORUM
AF

All of Rafael Harth's Comments + Replies

METR: Measuring AI Ability to Complete Long Tasks

I really don't think this is a reasonable measure for ability to do long term tasks, but I don't have the time or energy to fight this battle, so I'll just register my prediction that this paper is not going to age well.

Have LLMs Generated Novel Insights?

Rafael Harth1mo32

Instead of "have LLMs generated novel insights", how about "have LLMs demonstrated the ability to identify which views about a non-formal topic make more or less sense?" This question seems easier to operationalize and I suspect points at a highly related ability.

GPTs are Predictors, not Imitators

Rafael Harth4mo10Review for 2023 Review

As someone who expects LLMs to be a dead end, I nonetheless think this post makes a valid point and does so using reasonable and easy to understand arguments. I voted +1.

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Rafael Harth9mo56

I don't find this framing compelling. Particularly wrt to this part:

Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it's standard practice.)

I grant the point that an AI that does what the user wants can still be dangerous (in fact it could outright destroy the world). But I'd describe that situation as "we successfully aligned AI and things went wrong anyway" rather than "we failed to align AI". I grant t... (read more)

Many arguments for AI x-risk are wrong

Rafael Harth1y22

The post defending the claim is Reward is not the optimization target. Iirc, TurnTrout has described it as one of his most important posts on LW.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Rafael Harth2y10

I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.

Ok, I don't disagree with this. I certainly didn't develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn't seem very informative either way; I didn't spend that much time trying to grok his arguments.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Rafael Harth2y813

I also don't really get your position. You say that,

[Eliezer] confidently dismisses ANNs

but you haven't shown this!

In Surface Analogies and Deep Causes, I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it

... (read more)

3Alex Turner2y

Responding to part of your comment: I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments. That claim seems to be advanced due to... there not being enough similarities between ANNs and human brains -- that without enough similarity in mechanisms wich were selected for by evolution, you simply can't get the AI to generalize in the mentioned human-like way. Not as a matter of the AI's substrate, but as a matter of the AI's policy not generalizing like that. I think this is a dubious claim, and it's made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms). From the Alexander/Yudkowsky debate: There's some assertion like "no, there's not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like 'don't steal')." (And maybe Eliezer comes and says "no that's not what I mean", but, man, I sure don't know what he does mean, then.) Here's some more evidence along those lines: Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) "the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise." Hopefully this helps clarify what I'm trying to critique?

AGI Ruin: A List of Lethalities

Rafael Harth3y55

This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.

3Rob Bensinger3y

I don't think I personally could have written it; if others think they could have, I'd genuinely be interested to hear them brag, even if they can't prove it. Maybe the ideal would be 'I generated the core ideas of [a,b,c] with little or no argument from others; I had to be convinced of [d,e,f] but I now agree with them; I disagree with [g,h,i]; I think you left out important considerations [x,y,z].' Just knowing people's self-model is interesting to me, I don't demand that everything you believe be immediately provable to me.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Rafael Harth3y10

Yes, but I didn't mean to ask whether it's relevant, I meant to ask whether it's accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you've seen lots of text completions.

And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they're a much smaller step toward AGI than many argued recently.

2gwern3y

I said it was an analogy. You were discussing what intelligent human-level entities with inhibition control problems would hypothetically look like; well, as it happens, we do have such entities, in the form of sociopaths, and as it happens, they do not simply explode in every direction due to lacking inhibitions but often perform at high levels manipulating other humans until suddenly then they explode. This is proof of concept that you can naturally get such streaky performance without any kind of exotic setup or design. Seems relevant to mention.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Rafael Harth3y10

Do you think this describes language models?

2gwern3y

'Insensitivity to reward or punishment' does sound relevant...

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Rafael Harth3y10

I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.

What do you picture a language model with no inhibitions to look like? Because if I try to imagine it, then "something that outputs reasonable sounding text until sooner or later it fails hard" seems to be a decent fit. Of course haven't thought much about the generator/assessor distinction.

I mean, surely "inhibitions" of the language model don't map onto human inhibitions, right? Like, a language model without the assessor module (or a much worse a... (read more)

3gwern3y

An analogy that comes to mind is sociopathy. Closely linked to fear/reward insensitivity and impulsivity. Something you see a lot in case studies of diagnosed or accounts of people who look obviously like sociopaths is that they will be going along just fine, very competent and intelligent seeming, getting away with everything, until they suddenly do something which is just reckless, pointless, useless and no sane person could possibly think they'd get away with it. Why did they do X, which caused the whole house of cards to come tumbling down and is why you are now reading this book or longform investigative piece about them? No reason. They just sorta felt like it. The impulse just came to them. Like jumping off a bridge.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Rafael Harth3y00

(Extremely speculative comment, please tell me if this is nonsense.)

If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?

My first reaction to this is "obviously not since the architecture is completely different, so why would they map onto each other?", but a possible answer could be "well if the brain has them as separate modules, it could mean t... (read more)

1Steve Byrnes3y

Hmm. An algorithm trained to reproduce human output is presumably being trained to imitate the input-output behavior of the whole system including Thought Generator and Thought Assessor and Steering Subsystem. I’m trying to imagine deleting the Thought Assessor & Steering Subsystem, and replacing them with a constant positive RPE (i.e., “this is good, keep going”) signal. I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that. I think it’s a necessary design feature for good performance. Think of it this way. I’m evolution. Here are two tasks I want to solve: (A) estimate how much a particular course-of-action advances inclusive genetic fitness, (B) find courses of action that get a good score according to (A). (B) obviously benefits from incorporating a learning algorithm. And if you think about it, (A) benefits from incorporating a learning algorithm as well. But the learning algorithms involved in (A) and (B) are fundamentally working at cross-purposes. (A) is the critic, (B) is the actor. They need separate training signals and separate update rules. If you try to blend them together in an end-to-end way, you just get wireheading. (I.e., if the (B) learning algorithm had free reign to update the parameters in (A), using the same update rule as (B), then it would immediately set (A) to return infinity all the time, i.e. wirehead.) (Humans can wirehead to some extent (see Post #9) but we need to explain why it doesn't happen universally and permanently within five minutes of birth.)

Inner Alignment: Explain like I'm 12 Edition

Rafael Harth3y10

Thanks! I agree it's an error, of course. I've changed the section, do you think it's accurate now?)

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

Rafael Harth3y30

No; it was just that something about how the post explained it made me think that it wasn't #1.

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

Rafael Harth3y30

I don't completely get this.

Let's call the short term predictor (in the long term predictor circuit) $P$ , so if $P$ tries to predict [what $P$ predicts in 0.3s], then the correct prediction would be to immediately predict the output at whatever point in the future the process terminates (the next ground truth injection?). In particular, $P$ would always predict the same until the ground truth comes in. But if I understand correctly, this is not what's going on.

So second try: is $P$ really still only trying to predict 0.3s into the future, making it less of a "long t... (read more)

2Steve Byrnes3y

I’m advocating for the first one—P is trying to predict the next ground-truth injection. Does something trouble you about that?

Inner Alignment: Explain like I'm 12 Edition

Rafael Harth3y10Review for 2020 Review

Author here. One thing I think I've done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I've now added this paragraph at the end of chapter 1:

Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep l

... (read more)

Morality is Scary

Rafael Harth3y10

I strongly believe that (1) well-being is objective, (2) well-being is quantifiable, and (3) Open Individualism is true (i.e., the concept of identity isn't well-defined, and you're subjectively no less continuous with the future self if any other person than your own future self).

If (1-3) are all true, then utilitronium is the optimal outcome for everyone even if they're entirely selfish. Furthermore, I expect an AGI to figure this out, and to the extent that it's aligned, it should communicate that if it's asked. (I don't think an AGI will therefore deci... (read more)

Morality is Scary

Rafael Harth3y10

I don't have any reason why this couldn't happen. My position is something like "morality is real, probably precisely quantifiable; seems plausible that in the scenario of humans with autonomy and aligned AI, this could lead to an asymmetry where more people tend toward utilitronium over time". (Hence why I replied, you didn't seem to consider that possibility.) I could make up some mechanisms for this, but probably you don't need me for that. Also seems plausible that this doesn't happen. If it doesn't happen, maybe the people who get to decide what happe... (read more)

3Wei Dai3y

I'm interested in your view on this, plus what we can potentially do to push the future in this direction.

Morality is Scary

Rafael Harth3y20

This comment seems to be consistent with the assumption that the outcome 1 year after the singularity is locked in forever. But the future we're discussing here is one where humans retain autonomy (?), and in that case, they're allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI. I think a future where we begin with highly suboptimal personal utopias and gradually transition into utilitronium is among the more plausible outcomes. Compared with other outcomes where Not Everyone Dies, anyway. Your credence may differ if you're a moral relativist.

Wei Dai3y80

But the future we’re discussing here is one where humans retain autonomy (?), and in that case, they’re allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI.

What if the humans ask the aligned AI to help them be more moral, and part of what they mean by "more moral" is having fewer doubts about their current moral beliefs? This is what a "status game" view of morality seems to predict, for the humans whose status games aren't based on "doing philosophy", which seems to be most of them.

Biology-Inspired AGI Timelines: The Trick That Never Works

Rafael Harth3y50

1: To me, it made it more entertaining and thus easier to read. (No idea about non-anecdotal data, would also be interested.)

3: Also no data; I strongly suspect the metric is generally good because... actually I think it's just because the people I find worth listening to are overwhelmingly not condescending. This post seems highly usual in several ways.

Biology-Inspired AGI Timelines: The Trick That Never Works

Rafael Harth3y40

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?

My answer to this is that

First, no update whatsoever should take place because a probability distribution already expresses uncertainty, and there's no mechanism by which the uncertainty increased. Adele Lopez independently (and earlier) came up with the same answer.

Second, if there were an update -- say EY learned "one of the steps us

Rafael Harth3y*100

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.

While this this may be a fair criticism, I feel like someone ought to point out... (read more)

6Alex Turner3y

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI. I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

Samuel Dylan Martin3y*80

isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

I think this specific scenario sketch is from a mainstream AI safety perspective a case where we've already failed - i.e. we've invented a useless corrigibility intervention that we confidently but wrongly think is scalable.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI

Rafael Harth3y40

Yeah, it's fixed now. Thanks for pointing it out.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Rafael Harth3y*140

Survey on model updates from reading this post. Figuring out to what extent this post has led people to update may inform whether future discussions are valuable.

Results: (just posting them here, doesn't really need its own post)

The question was to rate agreement on the 1=Paul to 9=Eliezer axis before and after reading this post.

Data points: 35

Mean: $5.2 \to 6.06$

Median: $5 \to 7$

Graph of distribution before (blue) and after (red) and of mean shifts based on prior position (horizontal bar chart).

Raw Data

Anynymous Comments:

Agreement more on need for actions tha

... (read more)

4Daniel Kokotajlo3y

Wow, I did not expect those results!

1Ben Pace3y

How interesting; I am the median.

3Edouard Harris3y

(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.

Ngo and Yudkowsky on AI capability gains

Rafael Harth3y120

One of my updates from reading this is that Rapid vs. Gradual takeoff seems like an even more important variable for many people's model than I had assumed. Making this debate less one-sided might thus be super valuable even if writing up arguments is costly.

Inner Alignment: Explain like I'm 12 Edition

Rafael Harth3y10

Thanks. It's because directupload often has server issues. I was supposed to rehost all images from my posts to a more reliable host, but apparently forgot this one. I'll fix it in a couple of hours.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Answer by Rafael HarthJun 18, 202110

I'll take a shot at this. Let $A$ and $B$ be the sets of actions of Alice and Bob. Let $o_{n} : B \to {1, . . . n}$ (where 'n' means 'nice') be function that orders $B$ by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let $o_{s} : B \to {1, . . ., n}$ (where 's' means 'selfish') be the function that orders $B$ by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function $ψ$ measuring similarity between two orderings of a finite set (should range over $[- 1, 1]$ ); the alignment of $B$ with $A$ is then $ψ (o_{n}, o_{s})$ .

Example: in... (read more)

Why I'm excited about Debate

Rafael Harth4y30

I think the Go example really gets to the heart of why I think Debate doesn't cut it.

Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking "what is a provably secure alignment proposal?"

3Charlie Steiner4y

Good question. There's a big roadblock to your idea as stated, which is that asking something to define "alignment" is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?" I don't know - I think it still depends on how bad humans are as judges of arguments - we've made the domain more objective, but maybe there's some policy of argumentation that still wins by what we would consider cheating. I can imagine being convinced that it would work by seeing Debates play out with superhuman litigators, but since that's a very high bar maybe I should apply more creativity to my expextations.

Debate update: Obfuscated arguments problem

Rafael Harth4y60

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of ... (read more)

2Beth Barnes4y

Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might think this is a strategy that's possible for ML models, but that humans aren't naturally very good. Yeah, this is a great summary. One thing I would clarify is that it's sufficient that the set of arguments don't look suspicious to the judge. The arguments might look suspicious to the expert, but unless they have a way to explain to the judge why it's suspicious, we still have a problem. Yeah, I think that is the obvious next step. The concern is that the reasons the argument is suspicious may be hard to justify in a debate, especially if they're reasons of the form 'look, I've done a bunch of physics problems, and approaching it this way feels like it will makes things messy, whereas approaching it this way feels cleaner'. Debate probably doesn't work very well for supervising knowledge that's gained through finding patterns in data, as opposed to knowledge that's gained through step-by-step reasoning. Something like imitative generalisation (AKA 'learning the prior') is trying to fill this gap.

Conclusion to 'Reframing Impact'

Rafael Harth5y30

Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

2Alex Turner5y

I'm very glad you enjoyed it! I'd say so, yes.

Attainable Utility Preservation: Scaling to Superhuman

Rafael Harth5y20

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting $R_{AUX} := R$ seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action $a$ which doesn't change the state and leads to 1 reward, and a sequence $a_{1}, . . ., a_{n}$ of actions such that $a_{n}$ has $m$ reward with $m > n$ (and all $a_{< n}$ have 0 reward), then it's conceivable that $R_{AUP-1}$ would c... (read more)

2Alex Turner5y

For optimal policies, yes. In practice, not always - in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm! Yeah. I'm also pretty sympathetic to arguments by Rohin and others that the Raux=R variant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal. I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet. Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on.

Inner Alignment: Explain like I'm 12 Edition

Rafael Harth5y*60

Many thanks for taking the time to find errors.

I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.

I'm hesitant to change #4 before I fully understand why.

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world

Rafael Harth5y70

An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get... (read more)

2Alex Turner5y

Yes, this is exactly what the plan was. :) Yeah, but one doesn't involve visibly destroying an object, which matters for certain impact measures (like whitelisting). You're right that they're quite similar. Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it. I think the n-step stepwise inaction baseline doesn't fail at any of them?

Attainable Utility Preservation: Concepts

Rafael Harth5y10

I was initially writing a comment about how AUP $_{conceptual}$ doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)

It does seem that AUP $_{conceptual}$ will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if... (read more)

1Alex Turner5y

And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUPconceptual should work just fine for penalizing-increases-only. I think this is much less of a problem in the "penalize increases with respect to agent inaction" scenario.

Attainable Utility Landscape: How The World Is Changed

Rafael Harth5y10

The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.

The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

I did interpret the firs... (read more)

2Alex Turner5y

yes yeah, this is because those are “nonstationary” policies - you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy. yup!

Seeking Power is Often Convergently Instrumental in MDPs

Rafael Harth5y20

Thoughts after reading and thinking about this post

The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.

In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In tha... (read more)

2Alex Turner5y

Yes, this is roughly correct! As an additional note: it turns out, however, that even if you slightly refine the notion of "power that this part of the future gives me, given that I start here", you have neither "more power → instrumental convergence" nor "instrumental convergence → more power" as logical implications. Instead, if you're drawing the causal graph, there are many, many situations which cause both instrumental convergence and greater power. The formal task is then, "can we mathematically characterize those situations?". Then, you can say, "power-seeking will occur for optimal agents with goals from [such and such distributions] for [this task I care about] at [these discount rates]".

World State is the Wrong Abstraction for Impact

Rafael Harth5y20

Thoughts I have at this point in the sequence

This style is extremely nice and pleasant and fun to read. I saw that the first post was like that months ago; I didn't expect the entire sequence to be like that. I recall what you said about being unable to type without feeling pain. Did this not extend to handwriting?
The message so far seems clearly true in the sense that measuring impact by something that isn't ethical stuff is a bad idea, and making that case is probably really good.
I do have the suspicion that quantifying impact properly is impossible with

... (read more)

1Alex Turner5y

Thank you! I poured a lot into this sequence, and I'm glad you're enjoying it. :) Looking forward to what you think of the rest! Handwriting was easier, but I still had to be careful not to do more than ~1 hour / day.

Will humans build goal-directed agents?

Rafael Harth5y30

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term.

Is this true? Since ML generally doesn't choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won't understand that it shouldn't care about instances of X that appear in future episodes.

3Rohin Shah5y

A few points: 1. It's not clear that the current deep RL paradigm would lead to a mesa optimizer. I agree it could happen, but I would like to see an argument as to why it is likely to happen. (I think there is probably a stronger case that any general intelligence we build will need to be a mesa optimizer and therefore goal-directed, and if so that argument should be added to this list.) 2. Even if we did get a mesa optimizer, the base optimizer (e.g. gradient descent) would plausibly select for mesa optimizers that care only up till the end of the episode. A mesa optimizer that wasn't myopic in this way might spend the entire episode learning and making money that it can use in the future, and as a result get no training reward, and so would be selected against by the outer optimizer.

How special are human brains among animal brains?

Rafael Harth5y40

I might be confused here, but it seems to me that it's easy to interpret the arguments in this post as evidence in the wrong direction.

I see the following three questions as relevant:

1. How much sets human brains apart from other brains?

2. How much does the thing that humans have and animals don't matter?

3. How much does better architecture matter for AI?

Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot. However, holding #2 constant, #1 an... (read more)

1Alex Zhu5y

Not necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.

Topological Fixed Point Exercises

Rafael Harth6y20

Ex 5 (fixed version)

Let $D$ denote the triangle. For each $n \in N$ , construct a 2-d simplex $S_{n}$ with $(n + 1)^{2}$ nodes in $D$ , where the color of a point corresponds to the place in the disk that $f$ carries that point to, then choose $x_{n}$ to be a point within a trichromatic triangle in the graph. Then $(x_{n})_{n \in N}$ is a bounded sequence having a limit point $x^{*}$ . Let $t$ be the center of the disc; suppose that $f (x^{*}) \neq t$ . Then there is at least one region of the disc that $f (x^{*})$ doesn't touch. Let $ϵ$ be the distance to the furthest side, that is, let $ϵ = max {d (f (x^{*}), G), d (f (x^{*}), R), d (f$ ... (read more)

Topological Fixed Point Exercises

Rafael Harth6y40

I'm late, but I'm quite proud of this proof for #4:

Call the large triangle a graph and the $n^{2}$ triangles simply triangles. First, note that for any size, there is a graph where the top node is colored red, the remaining nodes on the right diagonal are colored green, and all nodes not on the right diagonal are colored blue. This graph meets the conditions, and has exactly one trichromatic triangle, namely the one at the top (no other triangle contains a red node). It is trivial to see that this graph can be changed into an arbitrary graph by re-col... (read more)

Topological Fixed Point Exercises

Rafael Harth6y30

Ex 1

Let $V = {v_{0}, . . ., v_{n}}$ and $e_{k} := (v_{k - 1}, v_{k})$ . Given an edge $e$ , let $ϕ_{e}$ denote the map that maps the color of the left to that of the right node. Given a $k \in {1, . . ., n}$ , let $ϕ_{k} = ϕ_{e_{k}} \circ . . . \circ ϕ_{e_{1}}$ . Let $b$ denote the color blue and $g$ the color green. Let $c (k)$ be 1 if edge $e_{k}$ is bichromatic, and 0 otherwise. Then we need to show that $ϕ_{n} (b) = g ⟹ (\sum_{k = 1}^{n} c (k)) is odd$ . We'll show $(\sum_{k = 1}^{n} c (k)) is even ⟺ ϕ_{n} (b) = b$ , which is a striclty stronger statement than the contrapositive.

For $n = 1$ , the LHS is equivalent to $c (ϕ_{1}) = 1$ , and indeed $ϕ (e_{1}) (b)$ equals $g$ if $e_{1}$ is bichromatic, and $b$ o... (read more)

Diagonalization Fixed Point Exercises

Rafael Harth6y40

Ex 4

Given a computable function $ϕ : S \mapsto Comp (S, {T, F})$ , define a function $f : S \mapsto {T, F}$ by the rule $f (s) = {\begin{matrix} T & ϕ (s) (s) = F F & ϕ (s) (s) = T \end{matrix}$ . Then $f$ is computable, however, $f \notin ϕ (S)$ because for $s \in S$ , we have that $ϕ (s) (s) = F ⟹ f (s) = T ⟹ ϕ (s) \neq f$ and $ϕ (s) (s) = T ⟹ f (s) = F ⟹ ϕ (s) \neq f$ .

Ex 5:

We show the contrapositive: given a function halt, we construct a surjective function $f$ from $S$ to $Comp (S, {T, F})$ as follows: enumerate all turing machines, such that each corresponds to a string. Given a $t \in S$ , if $t$ does not decode to a turing machine, set $f (t) \equiv F$ . If it does, let $[t]$ denote that turning machine. Let $f$ ... (read more)

Diagonalization Fixed Point Exercises

Rafael Harth6y30

Ex 1

Exercise 1: Let $f : S \mapsto P (S)$ and let $A = {x \in S | x \notin f (x)}$ . Suppose that $A \in f (S)$ , then let $x \in S$ be an element such that $f (x) = A$ . Then $x \notin f (x) = A ⟹ x \in A$ by definition, and $x \in A = f (x) ⟹ x \notin A$ . So $x \notin A ⟺ x \in A$ , a contradiction. Hence $A \notin f (S)$ , so that $f$ is not surjective.

Ex 2

Exercise 2: Since $T$ is nonempty, it contains at least one element $t_{0}$ . Let $h : T \mapsto T$ be a function without a fixed point, then $t_{0} \neq h (t_{0}) =: t_{1}$ , so that $t_{0}$ and $t_{1}$ are two different elements in $T$ (this is the only thing we shall use the function $h$ for).

Let $Ψ : S \mapsto T^{S}$ for $S$ nonempty. Suppose by contradiction that $Ψ$ is surject... (read more)