Thanks for the comment!
We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.
Thanks for the comment!
We'll correct the typo in the next patch/bug fix.
As for the more direct adversarial tone of the prologue, it is an explicit choice (and is contrasted by the rest of the document). For the moment, we're waiting to get more feedback on the doc to see if it really turns people off or not.
Yep, I think you're correct.
Will correct in the next minor update. Thanks!
Good catch, I think we are indeed mixing the sizes here.
As you say, the point still stands, but we will change it in the next minor update to either compare the same size or make the difference in size explicit.
Thanks for the comment!
We want to check the maths, but if you're indeed correct we will update the numbers (and reasoning) in the next minor version.
In many ways, this post is frustrating to read. It isn't straigthforward, it needlessly insults people, and it mixes irrelevant details with the key ideas.
And yet, as with many of Eliezer's post, its key points are right.
What this post does is uncover the main epistemological mistakes made by almost everyone trying their hands at figuring out timelines. Among others, there is:
I was mostly thinking of the efficiency assumption underlying almost all the scenarios. Critch assumes that a significant chunk of the economy always can and does make the most efficient change (everyone replacing the job, automated regulations replacing banks when they can't move fast enough). Which neglects many potential factors, like big economic actors not having to be efficient for a long time, backlash from customers, and in general all factors making economic actors and market less than efficient.
I expect that most of these factors could be addressed with more work on the scenarios.
I consider this post as one of the most important ever written on issues of timelines and AI doom scenario. Not because it's perfect (some of its assumptions are unconvincing), but because it highlights a key aspect of AI Risk and the alignment problem which is so easy to miss coming from a rationalist mindset: it doesn't require an agent to take over the whole world. It is not about agency.
What RAAPs show instead is that even in a purely structural setting, where agency doesn't matter, these problem still crop up!
This insight was already present in Drexle...
What are you particularly interested in? I expect I could probably write it with a bit of rereading.
Thanks for your comment!
Actually, I don't think we really disagree. I might have just not made my position very clear in the original post.
The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem.
Do you think there is still a disagreement after this clarification?
In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.
Thanks for the kind words!
I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is and the one for the grader optimization is ...
Thanks for taking time to answer my questions in detail!
About your example for other failure modes
Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?
If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to ...
It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.
Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
> This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.
> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]
...But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and
- Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
- In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
- In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
- To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
O...
A few questions to better understand your frame:
Thanks for the kind words!
- Are there any particular lessons/ideas from Refine that you expect (or hope) SERI MATS to incorporate?
I have shared some of my models related to epistemology and key questions to MATS organizers, and I think they're supposed to be integrated in one of the future programs. Mostly things regarding realizing the importance of productive mistakes in science (which naturally pushes back a bit from the mentoring aspect of MATS) and understanding how less "clean" most scientific progress actually look like historically (with a basic read...
Thanks for the kind words and useful devil's advocate! (I'm expecting nothing less from you ;p)
...
- I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
- It might be useful to predict this kind of thi
You probably know better than me, but I still have this intuition that seed-AI and FOOM have oriented the framing of the problem and the sort of question asked. I think people who came to agent foundations from different routes ended up asking slightly different questions.
I could totally be wrong though, thanks for making this weakness of my description explicit!
That's a great point!
There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.
I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be appl...
Yeah, I will be posting updates, and probably the participants themselves will post some notes and related ideas. Excited too about how it's going to pan out!
Thanks for the comment!
To be honest, I had more trouble classifying you, and now that you commented, I think you're right that I got the wrong label. My reasoning was that your agenda and directions look far more explicit and precise than Paul or Evan's, which is definitely a more mosaic-y trait. On the other hand, there is the iteration that you describe, and I can clearly see a difference in terms of updating between you and let's say John/Eliezer.
My current model is that you're more palimpsest-y, but compared with most of us, you're surprisingly good at making your current iteration fit into a proper structure that you can make explicit and legible.
(Will update the post in consequence. ;) )
Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.
The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.
It seems like a core part of this initial framing relies on the operationalisation of ...
Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.
Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.
You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.
I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by tryin...
In what way is AF not open to new ideas? I think it is a bit scary to publish a post here, but that has more to do with it being very public, and less to do with anything specific about the AF. But if AF has a culture of being non welcoming of new ideas, maybe we should fix that?
It's not that easy to justify a post from a year ago, but I think that what I meant was that the alignment forum has a certain style of alignment research, and thus only reading it means you don't see stuff like CHAI research or other works that are and aim at alignment without being shared that much on the AF.
Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?
This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.
...Other possible names would
Thanks for this post, it's clear and insightful about RLHF.
From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we're aiming for and debugging problems there, rather than thinking about the guarantees of the inference?
(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)
...Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a wo
One thing that I had to remind myself while reading this post is that "far away" is across space-time, emphasis on time. So "far away" can be about optimizing the future.
Do you think that thinking explicitly about distributed systems (in the theoretical computer science sense) could be useful for having different frames or understanding of the tradeoffs? Or are you mostly using the idea of distributed systems as an intuitive frame without seeing much value in taking it too seriously?
Thanks for trying to make the issue more concrete and provide a way to discuss it!
One thing I want to point out is that you don't really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.
Using sets, you have a set of world you want, and a proxy that is a superset of this (because you're not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it'...
Great post!
...For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)
This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent.
I followed approximately the technical discussion, and now I'm wondering what that would buy us if you are correct.
Are these correct, and what am I missing?
That's basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions.
One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there's probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.
Thanks for the post!
So if I understand correctly, your result is aiming at letting us estimate the dimensionality of the solution basins based on the gradients for the training examples at my local min/final model? Like, I just have to train my model, and then compute the Hessian/behavior gradients and I would (if everything you're looking at works as intended) have a lot of information about the dimensionality of the basin (and I guess the modularity is what you're aiming at here)? That would be pretty nice.
What other applications do you see for this resu...
...I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd
Thanks for the answer.
...One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed b
I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd...
I like that you're proposing an explicit heuristic inspired by the history of science for judging research directions and approaches, and acknowledge that it leads to conclusion that are counter intuitive to my Richard-model (pushing for Agents foundations for example), so you're not just retrofitting your own conclusion AFAIK. I also like that you're applying it to object-level directions in alignment — that's something I'm working on at the moment for my own research, based on your pushback.
That being said, my prediction/retrodiction is that this is too ...
Sorry to make you work more, but happy to fill a much needed niche. ^^
Thanks! Yes, this is very much an experiment, and even if it fails, I expect it to be a productive mistake we can learn from. ;)
I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).
I have a framing of AI risks scenarios that I think is more general and more powerful than most available online, and that might be a good frame before going into examples. It's not posted yet (I'm finishing the sequence now) but I could sent somethings to you if you're interested. ;)
(I will be running the Incubator at Conjecture)
The goal for the incubator is to foster new conceptual alignment research bets that could go on to become full-fledged research directions, either at Conjecture or at other places. We’re thus planning to select mostly on the quality we expect for a very promising independent conceptual researcher, that is proactivity (see Paul Graham’s Relentlessly Resourceful post) and some interest or excitement about not fully tapped streams of evidence (see this recent post).
Although experience with alignment cou...
Thanks for the answer!
Unfortunately, I don't have the time at the moment to answer in detail and have more of a conversation, as I'm fully focused on writing a long sequence about pushing for pluralism in alignment and extracting the core problem out of all the implementation details and additional assumption. I plan on going back to analyzing timeline research in the future, and will probably give better answers then.
That being said, here are quick fire thoughts:
Typo addressed in the latest patch!