All of Koen.Holtman's Comments + Replies

Thanks for reading my paper! For the record I agree with some but not all points in your summary.

My later paper 'AGI Agent Safety by Iteratively Improving the Utility Function' also uses the simulation environment with the and actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the and are supposed to be symbols representing arrows and for 'push # to later in time' and 'pull # earlier in time'.

The g_c agent does indeed satisfy d

... (read more)

Fun to see this is now being called 'Holtman's neglected result'. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:

I can't fully fault the world for neglecting 'Corrigibility with Utility Preservation' because it is full of a lot of dense math.

I wrote two followup papers to 'Corrigibility with Utility Preservation' which present the same results with more accessible math. For these I am a bit more up... (read more)

I am currently almost fulltime doing AI policy, but I ran across this invite to comment on the draft, so here goes.

On references:

Please add Armstrong among the author list in the reference to Soares 2015, this paper had 4 authors, and it was actually Armstrong who came up with indifference methods.

I see both 'Pettigrew 2019' and 'Pettigrew 2020' in the text? Is the same reference?

More general:

Great that you compare the aggregating approach to two other approaches, but I feel your description of these approaches needs to be improved.

Soares et al 2015 define... (read more)

Note: This is presumably not novel, but I think it ought to be better-known.

This indeed ought to be better-known. The real question is: why is it not better-known?

What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.

When you then ask people why they believe that, you usually get some answer 'because MIRI', and then when you ask further it turns out these people did not actually read MI... (read more)

I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:

In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one's conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.

I do not talk about this in the ... (read more)

Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.

I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom:

  1. Corrigibility is anti-natural to consequentialist reasoning

I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.

I suggest you read my paper Counterfactual Planning in AGI Systems, because it is ... (read more)

0Towards_Keeperhood
Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.) From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsky means it, but not sure if that's what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won't be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven't looked into your paper yet, so it's well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won't be able to just one-shot solve quantum gravity or so when we give it the prompt "solve quantum gravity". There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn't. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

  • My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.

  • On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of M

... (read more)

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based w... (read more)

1Towards_Keeperhood
Hi Koen, thank you very much for writing this list! I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.) So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all). But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn't a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I'll do it sometime this week of you write me back, but no promises.) Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)

As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.

First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.

What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.

However, the underlying idea that meaningful progress is possible by... (read more)

3Vivek Hebbar
Can you link these papers here?  No need to write anything, just links.

Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.

Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing s... (read more)

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.

I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is ... (read more)

2Eric Drexler
What you describe is a form of spiral causation (only seeded by claims of impossibility), which surely contributes strongly to what we’ve seen in the evolution of ideas. Fortunately, interest in the composite-system solution space (in particular, the open agency model) is now growing, so I think we’re past the worst of the era of unitary-agent social proof.

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:

[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to u

... (read more)

But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell's approach, and this seems bizarre.

Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.

Stuart Russell's approach is a broad agend... (read more)

Interesting. Some high-level thoughts:

When reading your definition of concept extrapolation as it appears here here:

Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation... and extrapolating it safely to a more general situation.

this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-... (read more)

I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.

That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are

  1. Do something that will affect policy in a positive way

  2. Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions

Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]

I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future... (read more)

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.

I can honestly say however that the project of writing t... (read more)

If you’re interested in conceptual work on agency and the intersection of complex systems and AI alignment

I'm interested in this agenda, and I have been working on this kind of thing myself, but I am not interested at this time in moving to Prague. I figure that you are looking for people interested in moving to Prague, but if you are issuing a broad call for collaborators in general, or are thinking about setting up a much more distributed group, please clarify.

A more technical question about your approach:

What we’re looking for is more like a ver

... (read more)

There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that

The wider AI research community is an almost-optimal engine of apocalypse.

and

AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.

I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace... (read more)

To do this, we'll start by offering alignment as a service for more limited AIs.

Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!

2Stuart Armstrong
Bedankt!

I like your section 2. As you are asking for feedback on your plans in section 3:

By default I plan to continue looking into the directions in section 3.1, namely transparency of current models and its (potential) intersection with developments in deep learning theory. [...] Since this is what I plan to do, it'd be useful for me to know if it seems totally misguided

I see two ways to improve AI transparency in the face of opaque learned models:

  1. try to make the learned models less opaque -- this is your direction

  2. try to find ways to build more transp

... (read more)

instrumental convergence basically disappears for agents with utility functions over action-observation histories.

Wait, I am puzzled. Have you just completely changed your mind about the preconditions needed to get a power-seeking agent? The way the above reads is: just add some observation of actions to your realistic utility function, and you instrumental convergence problem is solved.

  1. u-AOH (utility functions over action-observation histories): No IC

  2. u-OH (utility functions over observation histories): Strong IC

There are many utility func... (read more)

2Alex Turner
I recommend reading the quoted post for clarification.

Any thoughts on how to encourage a healthier dynamic.

I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum.

My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go.

So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main pu... (read more)

I like your summary of the situation:

Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong.

This has also been my experience, at least on this forum. Much less so in academic-style papers about alignment. This has certain consequences for the problem of breaking into preparadigmatic alignment research.

Here are two ways to do preparadigmatic research:

  1. Find something that is all wrong with somebody else's paradigm, then write about it.

  2. Find a new useful paradigm and write about it.

MIR... (read more)

1Chris_Leong
Hmm... Yeah, I certainly don't think that there's enough collaboration or appreciation of the insights that other approaches may provide. Any thoughts on how to encourage a healthier dynamic.

Thanks, yes that new phrasing is better.

Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom's Superintelligence, and it also appears under the phrasing 'problem of control' in Russell's Human Compatible.

In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader ... (read more)

Thanks for posting this writeup, overall this reads very well, and it should be useful to newcomers. The threat models section is both compact and fairly comprehensive.

I have a comment on the agendas to build safe AGI section however. In the section you write

I focus on three agendas I consider most prominent

When I finished reading the list of three agendas in it, my first thought was 'Why does this not mention other prominent agendas like corrigibility? This list is hardly is a birds-eye overview mentioning all prominent agendas to build safe AI.'

D... (read more)

1Neel Nanda
Thanks for the feedback! That makes sense, I've updated the intro paragraph to that section to: Does that seem better? For what it's worth, my main bar was a combination of 'do I understand this agenda well enough to write a summary' and 'do I associate at least one researcher and some concrete work with this agenda'. I wouldn't think of corrigibility as passing the second bar, since I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It's very possible I've missed out on some important work though, and I'd love to hear pushback on this

Not aware of which part would be a Wittgenstenian quote. Long time ago that I read Wittgenstein, and I read him in German. In any case, I remain confused on what you mean with 'circular'.

1Chris_Leong
Hmm... Oh, I think that was elsewhere on this thread. Probably not to you. Eliezer's Where Recursive Justification Hits Bottom seems to embrace a circular epistemology despite its title.

Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.

So I think I am still struggling to see what flavour of philosophical thought you want people to engage with, when you mention 'circular'.

Mind you, I... (read more)

1Chris_Leong
If you're referring to the Wittgenstenian quote, I was merely quoting him, not endorsing his views.

OK thanks for explaining. See my other recent reply for more thoughts about this.

It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false

OK, so if I understand you correctly, you posit that there is something called 'circular epistemology'. You said in the earlier post you link to at the top:

You might think that the circularity is a problem, but circular epistemology turns out to be viable (see Eliezer's Where Recursive Justification Hits Bottom). And while circular reasoning is less than ideal, if the comparative is eventually hitting a point where we can provide

... (read more)
1Chris_Leong
Yeah, I believe epistemology to be inherently circular. I think it has some relation to counterfactuals being circular, but I don't see it as quite the same as counterfactuals seem a lot harder to avoid using than most other concept. The point of mentioning circular epistemology was to persuade people that my theory isn't as absurd as it sounds at first.

Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think

??? I don't follow. You meant to write "use system X instead of using system Y which calls itself a definition of counterfactuals "?

1Chris_Leong
What I mean is that some people seem to think that if they can describe a system that explains counterfactuals without mentioning counterfactuals when explaining them that they've avoided a circular dependence. When of course, we can't just take things at face value, but have to dig deeper than that.

Some people have asked why the Bayesian Network approach suggested by Judea Pearl is insufficient (including in the comments below). This approach is firmly rooted in Causal Decision Theory (CDT). Most people on LW have rejected CDT because of its failure to handle Newcomb's Problem.

I'll make a counter-claim and say that most people on LW in fact have rejected the use of Newcomb's Problem as a test that will say something useful about decision theories.

That being said, there is definitely a sub-community which believes deeply in the relevance of Newcomb... (read more)

1Chris_Leong
Firstly, I don't see why that would interfere with evaluating possible arguments for and against circular dependency. It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false (not stating that an article would have to necessarily engage with 3 different arguments to win). Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think merely not mentioning counterfactuals means that there isn't a dependence on them. So there likely needs to be some part of such an article discussing why things that look counterfactual really aren't. I briefly skimmed your article and I'm sure if I read it further I'd learn something interesting, but merely as is it wouldn't be on scope.

Not entirely sure what you mean with your aside on 'unsupervised predictive reward'. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal?

But on your observation that 'the learned model will compensate for distortions': this sounds familiar. Here is a discussion.

Intuition pumps and inner alignment failure

It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen... (read more)

1Charlie Steiner
This isn't about "inner alignment" (as catchy as the name might be), it's just about regular old alignment. But I think you're right. As long as the learning step "erases" the model editing in a sensible way, then I was wrong and there won't be an incentive for the learned model to compensate for the editing process. So if you can identify a "customer gender detector" node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender. I'm not sure how well this approach fulfills modern regulatory guidelines, or how useful it is for AGI alignment, for basically the same reason: models that are learned from scratch sometimes encode knowledge in a distributed way that's hard to edit out. Ultimately, your proof that a certain edit reduces gender bias is going to have to come from testing the system and checking that it behaves well, which is a kind of evidence a lot of regulators / businesspeople are skittish about.

I'm especially interested in the analogy between AI alignment and democracy.

This is indeed a productive analogy. Sadly, on this forum, this analogy is used in 99% of the cases to generate AI alignment failure mode stories, whereas I am much more interested in using it to generate useful ideas about AI safety mechanisms.

You may be interested in my recent paper 'demanding and designing', just announced here, where I show how to do the useful idea generating thing. I transfer some insights about aligning powerful governments and companies to the probl... (read more)

Very open to feedback.

I have not read the whole comment section, so this feedback may already have been given, but...

I believe the “indifference” method represented some progress towards a corrigible utility-function-over-future-states, but not a complete solution (apparently it’s not reflectively consistent—i.e., if the off-switch breaks, it wouldn't fix it), and the problem remains open to this day.

Opinions differ on how open the problem remains. Definitely, going by the recent Yudkowsky sequences, MIRI still acts as if the problem is open, and ... (read more)

Thanks!

I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand.

But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right?

Yes I am disagreeing, of sorts. I would disagree with the statement that

| AI alignment research is a subset of AI research

but I agree with the statement that

| Some parts of AI alignment research a... (read more)

2Charlie Steiner
The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.

OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety.

To go back to the Wikipedia quote:

The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design.

There are plenty of people in CS wh... (read more)

"we're going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems".

I disagree that AI code is orders of magnitude more complex than say the code in a web browser or modern compiler: in fact quite the opposite applies. Most modern ML algorithms are very short pieces of code. If you are willing to use somewhat abstract math where you do not write out all the hyperparameter values, you can specify everything that goes on in a deep learning algorithm in... (read more)

1Koen Holtman
OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety. To go back to the Wikipedia quote: There are plenty of people in CS who are inspired by this thought: if we can design safer bridges by using the tool of mathematical analysis, why not software? First, some background comments about formal methods in ML In modern ML, there are a lot of people who use mathematical analysis in an attempt to better understand or characterize the performance of ML algorithms. Such people have in fact been there since the start of AI as a field. I am attending NeurIPS 2021, and there are lots of papers which are all about this use of formal methods, or have sections which are all about this. To pick a random example, there is 'Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss' (preprint here) where the abstract says things like (I have not actually read the paper. I am just using this as a random example.) In general, my sense is that in the ML community right now, there is a broad feeling that mathematical analysis still needs to have some breakthrough before it can say truly useful things, truly useful things when it comes to safety engineering, for AIs that use deep neural networks. If you read current ML research papers on 'provable guarantees' and then get very unexited, or even deeply upset at how many simplifying assumptions the authors are making in order to allow for any proofs at all, I will not blame you. My sense is also that different groups of ML researchers disagree about how likely such future mathematical breakthroughs will be. As there is plenty of academic and commercial interest in this formal methods sub-field of ML, I am not currently working in it myself, and I do not follow it closely. As an independent researcher, I have the

I think I will need to write two separate replies to address the points you raise. First, a more meta-level and autobiographical reply.

When it comes to formal methods, I too have a good idea of what I am talking about. I did a lot of formal methods stuff in the 1990 at Eindhoven University, which at the time was a one of the places with the most formal methods in the Netherlands.

I also see that you're saying formal methods help you design stuff. That sounds very wrong to me.

When I talk about using formal methods to design stuff, I am very serious. B... (read more)

2Adam Shimi
Thanks for explaining your point in more details. I'm also aware of this approach; it is just about as intractable in my opinion and experience than the "check afterward". You still need the specification, which is one of my big arguments against the usefulness of formal methods to alignment. Also, such methods followed strongly are hardly "faster" than using a theorem prover — they are about emulating the sort of thing a theorem prover does while writing the program! Actually, I wonder if this is not just equivalent to writing the proof in Coq and generating the corresponding program. (To be fair, I expect you'll gain an order of magnitude over Coq proofs because you don't have to go that anally over that many details; but an order of magnitude or two better than Coq proof is still incredibly slow when compared to the speed of software development, let alone ML!) One really important detail that is missing from your description is the incredible simplicity of distributed programs (where I actually mean distributed, not the concurrent you seem to imply) and yet they are tremendously difficult to certify and/or come up with hand and hand in proof. Like the most advanced distributed algorithms in use are probably simpler in terms of code and what happens than the second or third project someone has to code in their first software engineering class. So I hardly see why I should generalize with any confidence from "we manage to design through proofs programs that are simpler that literally what any non-distributed computing programmer writes" to "we're going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems". Sure — my point is that useful safety properties are incredibly hard to specify at that level. I'm of course interested of examples to the contrary. ;) Actually, this is a complain by basically every researcher and big shot in academic formal methods that I ev

Nice overview, but what surprises me is that you are not in fact describing the main epistemic strategy used in engineering. What you say about engineering is:

[...] tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.

You fail to mention the more important engineering strategy: one which does not rely on tinkering, but instead on log... (read more)

2Adam Shimi
Thanks for the thoughtful comment! That's a fair point. My intuition here is that this more advanced epistemic strategy requires the science, and that the one I described is the one people actually used for thousands of years when they had to design bridge without material science and decent physics. So in a sense, the modern engineering epistemic strategy sounds even less feasible for alignment. As for your defense of formal methods... let's start by saying that I have a good idea of what you're talking about: I studied engineering in Toulouse, which is one of the places with the most formal methods in France, which is itself a big formal methods hotspots. I got classes and lab sessions about all the classical formal methods techniques. I did my PhD in a formal methods lab, going to a bunch of seminars and conferences. And even if my PhD was on the theory of distributed computing, which takes a lot from computability and complexity theory, it's also the field that spurned the invention of formal methods and the analysis of algorithms as this is pretty much the only way of creating distributed algorithms that work. All of that to clarify that I'm talking from some amount of expertise here. And my reaction is that formal methods have two big problems: they're abysmally slow, and they're abysmally weak, at least compared to the task of verifying an AGI. On the slow part, it took two years for researchers to formally verify Compcert, a C compiler that actually can't compile the whole C specification. It's still an amazing accomplishment, but that gives you an idea of the timescale of formal verification. You could of course answer that CompCert was verified "by hand" using proof assistants, and that we can leverage more automatic methods like model checking or abstract interpretation. But there's a reason why none of these approaches was used to certify a whole compiler: they're clearly not up to the task currently. Which brings us to the weak part. The most rele

Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.

You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of th... (read more)

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.

In my own alignment work I am mostly looking at solution (2), specificall... (read more)

Of course there has been lots of 'obvious output of this kind from the rest of the "AI safety" field'. It is not like people have been quiet about convergent instrumental goals. So what is going on here?

I read this line (and the paragraphs that follow it) as Eliezer talking smack about all other AI safety researchers. As observed by Paul here:

Eliezer frequently talks smack about how the real world is surprising to fools like Paul

I liked some of Eliezer's earlier, more thoughtful writing better.

Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.

Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.

but maybe I'm missing something?

The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among them... (read more)

4Logan Riggs Smith
The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that? Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across that way). Though I believe the majority of the comments were in the spirit of understanding and coming to an agreement. Adam Shimi is also working on a post to describe the disagreements in the dialogue as different epistemic strategies, meaning the cause of disagreement is non-obvious. Alignment is pre-paradigmic, so agreeing is more difficult compared to communities that have clear questions and metrics to measure them on. I still think we succeed at the harder problem. By "community of philosophers", you mean noone makes any actual progress on anything (or can agree that progress is being made)?  * I believe Alex Turner has made progress on formalizing impact and power-seeking and I'm not aware of parts of the community arguing this isn't progress at all (though I don't read every comment).  * I also believe Vanessa's and Diffractor's Infrabayesism is progress on thinking about probabilities, and am unaware of parts of the community arguing this isn't progress (though there is a high mathematical bar required before you can understand it enough to criticize it) * I also also believe Evan Hubingers et al work on mesa optimizers is quite clearly progress on crisply stating an alignment issue that the community has largely agreed is progress. Do you disagree on these examples or disagree that they prove the community makes progress and agrees that progress is being made?

I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.

I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level ... (read more)

1Logan Riggs Smith
I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo the word corrigible. This has actually already happened in the document with corrigible either meaning: 1. Correctable all the time regardless 2. Correctable up until the point where the agent actually knows how to achieve your values better than you (related to intent alignment and coherent extrapolated volition). Then we can think "assuming corrigible-definition-1, then yes, this is a solution".   I don't see a benefit to the exploratory/decision making forum distinction when you can just do the above, but maybe I'm missing something?
3Alex Turner
Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

I'm aware that a lot of AI Safety research is already of questionable quality. So my question is: how can I determine as quickly as possible whether I'm cut out for this?

My key comment here is that, to be an independent researcher, you will have to rely day-by-day on your own judgement on what has quality and what is valuable. So do you think you have such judgement and could develop it further?

To find out, I suggest you skim a bunch of alignment research agendas, or research overviews like this one, and then read some abstracts/first pages of papers m... (read more)

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?

Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.

What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designe... (read more)

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will

... (read more)

As nobody else has mentioned it yet in this comment section: AI Safety Support is a resource-hub specifically set up to help people get into alignment research field.

I am a 50 year old independent alignment researcher. I guess I need to mention for the record that I never read the sequences, and do not plan to. The piece of Yudkowsky writing that I'd recommend everybody interested in alignment should read is Corrigibilty. But in general: read broadly, and also beyond this forum.

I agree with John's observation that some parts of alignment research are ... (read more)

10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

Not sure if a short answer will help, so I will write a long one.

In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument . This is possible for example by using together with a (learned) model of the compute core to predict : so a viable could be defined as . This could make pr... (read more)

Load More