Fun to see this is now being called 'Holtman's neglected result'. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:
I can't fully fault the world for neglecting 'Corrigibility with Utility Preservation' because it is full of a lot of dense math.
I wrote two followup papers to 'Corrigibility with Utility Preservation' which present the same results with more accessible math. For these I am a bit more up...
I am currently almost fulltime doing AI policy, but I ran across this invite to comment on the draft, so here goes.
On references:
Please add Armstrong among the author list in the reference to Soares 2015, this paper had 4 authors, and it was actually Armstrong who came up with indifference methods.
I see both 'Pettigrew 2019' and 'Pettigrew 2020' in the text? Is the same reference?
More general:
Great that you compare the aggregating approach to two other approaches, but I feel your description of these approaches needs to be improved.
Soares et al 2015 define...
Note: This is presumably not novel, but I think it ought to be better-known.
This indeed ought to be better-known. The real question is: why is it not better-known?
What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.
When you then ask people why they believe that, you usually get some answer 'because MIRI', and then when you ask further it turns out these people did not actually read MI...
I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:
In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one's conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.
I do not talk about this in the ...
Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.
I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom:
- Corrigibility is anti-natural to consequentialist reasoning
I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.
I suggest you read my paper Counterfactual Planning in AGI Systems, because it is ...
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.
To comment on your quick thoughts:
My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.
On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of M
OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.
This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.
I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).
Math-based w...
As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.
First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.
What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.
However, the underlying idea that meaningful progress is possible by...
Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.
Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing s...
At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.
I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is ...
I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.
Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:
...[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to u
But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell's approach, and this seems bizarre.
Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.
Stuart Russell's approach is a broad agend...
Interesting. Some high-level thoughts:
When reading your definition of concept extrapolation as it appears here here:
Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation... and extrapolating it safely to a more general situation.
this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-...
I generally agree with you on the principle Tackle the Hamming Problems, Don't Avoid Them.
That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are
Do something that will affect policy in a positive way
Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions
Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:
Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]
I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future...
I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.
Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.
I can honestly say however that the project of writing t...
If you’re interested in conceptual work on agency and the intersection of complex systems and AI alignment
I'm interested in this agenda, and I have been working on this kind of thing myself, but I am not interested at this time in moving to Prague. I figure that you are looking for people interested in moving to Prague, but if you are issuing a broad call for collaborators in general, or are thinking about setting up a much more distributed group, please clarify.
A more technical question about your approach:
...What we’re looking for is more like a ver
There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that
The wider AI research community is an almost-optimal engine of apocalypse.
and
AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.
I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace...
To do this, we'll start by offering alignment as a service for more limited AIs.
Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!
I like your section 2. As you are asking for feedback on your plans in section 3:
By default I plan to continue looking into the directions in section 3.1, namely transparency of current models and its (potential) intersection with developments in deep learning theory. [...] Since this is what I plan to do, it'd be useful for me to know if it seems totally misguided
I see two ways to improve AI transparency in the face of opaque learned models:
try to make the learned models less opaque -- this is your direction
try to find ways to build more transp
instrumental convergence basically disappears for agents with utility functions over action-observation histories.
Wait, I am puzzled. Have you just completely changed your mind about the preconditions needed to get a power-seeking agent? The way the above reads is: just add some observation of actions to your realistic utility function, and you instrumental convergence problem is solved.
u-AOH (utility functions over action-observation histories): No IC
u-OH (utility functions over observation histories): Strong IC
There are many utility func...
Any thoughts on how to encourage a healthier dynamic.
I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum.
My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go.
So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main pu...
I like your summary of the situation:
Most people doing MIRI-style research think most other people doing MIRI-style research are going about it all wrong.
This has also been my experience, at least on this forum. Much less so in academic-style papers about alignment. This has certain consequences for the problem of breaking into preparadigmatic alignment research.
Here are two ways to do preparadigmatic research:
Find something that is all wrong with somebody else's paradigm, then write about it.
Find a new useful paradigm and write about it.
MIR...
Thanks, yes that new phrasing is better.
Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom's Superintelligence, and it also appears under the phrasing 'problem of control' in Russell's Human Compatible.
In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader ...
Thanks for posting this writeup, overall this reads very well, and it should be useful to newcomers. The threat models section is both compact and fairly comprehensive.
I have a comment on the agendas to build safe AGI section however. In the section you write
I focus on three agendas I consider most prominent
When I finished reading the list of three agendas in it, my first thought was 'Why does this not mention other prominent agendas like corrigibility? This list is hardly is a birds-eye overview mentioning all prominent agendas to build safe AI.'
D...
Not aware of which part would be a Wittgenstenian quote. Long time ago that I read Wittgenstein, and I read him in German. In any case, I remain confused on what you mean with 'circular'.
Wait, I was under the impression from the quoted text that you make a distinction between 'circular epistemology' and 'other types of epistemology that will hit a point where we can provide no justification at all'. i.e. these other types are not circular because they are ultimately defined as a set of axioms, rewriting rules, and observational protocols for which no further justification is being attempted.
So I think I am still struggling to see what flavour of philosophical thought you want people to engage with, when you mention 'circular'.
Mind you, I...
OK thanks for explaining. See my other recent reply for more thoughts about this.
It's possible for an article to be here's why these 3 reasons why we might think counterfactuals are circular are all false
OK, so if I understand you correctly, you posit that there is something called 'circular epistemology'. You said in the earlier post you link to at the top:
...You might think that the circularity is a problem, but circular epistemology turns out to be viable (see Eliezer's Where Recursive Justification Hits Bottom). And while circular reasoning is less than ideal, if the comparative is eventually hitting a point where we can provide
Secondly, I guess my issue with most of the attempts to say "use system X for counterfactuals" is that people seem to think
??? I don't follow. You meant to write "use system X instead of using system Y which calls itself a definition of counterfactuals "?
Some people have asked why the Bayesian Network approach suggested by Judea Pearl is insufficient (including in the comments below). This approach is firmly rooted in Causal Decision Theory (CDT). Most people on LW have rejected CDT because of its failure to handle Newcomb's Problem.
I'll make a counter-claim and say that most people on LW in fact have rejected the use of Newcomb's Problem as a test that will say something useful about decision theories.
That being said, there is definitely a sub-community which believes deeply in the relevance of Newcomb...
Not entirely sure what you mean with your aside on 'unsupervised predictive reward'. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal?
But on your observation that 'the learned model will compensate for distortions': this sounds familiar. Here is a discussion.
It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen...
I'm especially interested in the analogy between AI alignment and democracy.
This is indeed a productive analogy. Sadly, on this forum, this analogy is used in 99% of the cases to generate AI alignment failure mode stories, whereas I am much more interested in using it to generate useful ideas about AI safety mechanisms.
You may be interested in my recent paper 'demanding and designing', just announced here, where I show how to do the useful idea generating thing. I transfer some insights about aligning powerful governments and companies to the probl...
Very open to feedback.
I have not read the whole comment section, so this feedback may already have been given, but...
I believe the “indifference” method represented some progress towards a corrigible utility-function-over-future-states, but not a complete solution (apparently it’s not reflectively consistent—i.e., if the off-switch breaks, it wouldn't fix it), and the problem remains open to this day.
Opinions differ on how open the problem remains. Definitely, going by the recent Yudkowsky sequences, MIRI still acts as if the problem is open, and ...
Thanks!
I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand.
But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right?
Yes I am disagreeing, of sorts. I would disagree with the statement that
| AI alignment research is a subset of AI research
but I agree with the statement that
| Some parts of AI alignment research a...
OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety.
To go back to the Wikipedia quote:
The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design.
There are plenty of people in CS wh...
"we're going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems".
I disagree that AI code is orders of magnitude more complex than say the code in a web browser or modern compiler: in fact quite the opposite applies. Most modern ML algorithms are very short pieces of code. If you are willing to use somewhat abstract math where you do not write out all the hyperparameter values, you can specify everything that goes on in a deep learning algorithm in...
I think I will need to write two separate replies to address the points you raise. First, a more meta-level and autobiographical reply.
When it comes to formal methods, I too have a good idea of what I am talking about. I did a lot of formal methods stuff in the 1990 at Eindhoven University, which at the time was a one of the places with the most formal methods in the Netherlands.
I also see that you're saying formal methods help you design stuff. That sounds very wrong to me.
When I talk about using formal methods to design stuff, I am very serious. B...
Nice overview, but what surprises me is that you are not in fact describing the main epistemic strategy used in engineering. What you say about engineering is:
[...] tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.
You fail to mention the more important engineering strategy: one which does not rely on tinkering, but instead on log...
Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.
You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of th...
If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.
Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.
In my own alignment work I am mostly looking at solution (2), specificall...
Of course there has been lots of 'obvious output of this kind from the rest of the "AI safety" field'. It is not like people have been quiet about convergent instrumental goals. So what is going on here?
I read this line (and the paragraphs that follow it) as Eliezer talking smack about all other AI safety researchers. As observed by Paul here:
Eliezer frequently talks smack about how the real world is surprising to fools like Paul
I liked some of Eliezer's earlier, more thoughtful writing better.
Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.
Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.
but maybe I'm missing something?
The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among them...
I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.
I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.
I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level ...
I'm aware that a lot of AI Safety research is already of questionable quality. So my question is: how can I determine as quickly as possible whether I'm cut out for this?
My key comment here is that, to be an independent researcher, you will have to rely day-by-day on your own judgement on what has quality and what is valuable. So do you think you have such judgement and could develop it further?
To find out, I suggest you skim a bunch of alignment research agendas, or research overviews like this one, and then read some abstracts/first pages of papers m...
I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.
What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designe...
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
...In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will
As nobody else has mentioned it yet in this comment section: AI Safety Support is a resource-hub specifically set up to help people get into alignment research field.
I am a 50 year old independent alignment researcher. I guess I need to mention for the record that I never read the sequences, and do not plan to. The piece of Yudkowsky writing that I'd recommend everybody interested in alignment should read is Corrigibilty. But in general: read broadly, and also beyond this forum.
I agree with John's observation that some parts of alignment research are ...
10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?
Not sure if a short answer will help, so I will write a long one.
In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument . This is possible for example by using together with a (learned) model of the compute core to predict : so a viable could be defined as . This could make pr...
Thanks for reading my paper! For the record I agree with some but not all points in your summary.
My later paper 'AGI Agent Safety by Iteratively Improving the Utility Function' also uses the simulation environment with the > and < actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the > and < are supposed to be symbols representing arrows → and ← for 'push # to later in time' and 'pull # earlier in time'.
... (read more)