If you pin down what a thing refers to according to what that thing was optimized to refer to, then don't you have to look at the structure of the one who did the optimizing in order to work out what a given thing refers to? That is, to work out what the concept "thermodynamics" refers to, it may not be enough to look at the time evolution of the concept "thermodynamics" on its own, I may instead need to know something about the humans who were driving those changes, and the goals held within their minds. But, if this is correct, then doesn't it raise anot...
That is correct. I know it seems little weird to generate a new policy on every timestep. The reason it's done that way is that the logical inductor needs to understand the function that maps prices to the quantities that will be purchased, in order to solve for a set of prices that "defeat" the current set of trading algorithms. That function (from prices to quantities) is what I call a "trading policy", and it has to be represented in a particular way -- as a set of syntax tree over trading primitives -- in order for the logical inductor to solve for pri...
This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like "planning fallacy" seem to have been added to the list with little time taken to place them in the context of the other thing...
Many people believe that they already understand Dennett's intentional stance idea, and due to that will not read this post in detail. That is, in many cases, a mistake. This post makes an excellent and important point, which is wonderfully summarized in the second-to-last paragraph:
...In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and a
This post consists of comments on summaries of a debate about the nature and difficulty of the alignment problem. The original debate was between Eliezer Yudkowsky and Richard Ngo but this post does not contain the content from that debate. This posts is mostly of commentary by Jaan Tallinn on that debate, with comments by Eliezer.
The post provides a kind of fascinating level of insight into true insider conversations about AI alignment. How do Eliezer and Jaan converse about alignment? Sure, this is a public setting, so perhaps they communicate differentl...
Yeah it might just be a lack of training data in 10-second-or-less interactive instructions.
The thing I really wanted to test with this experiment was actually whether ChatGPT could engage with the real world using me as a guinea pig. The 10-second-or-less thing was just the format I used to try to "get at" the phenomenon of engaging with the real world. I'm interested in improving the format to more cleanly get at the phenomenon.
I do currently have the sense that it's more than just a lack of training data. I have the sense that ChatGPT has learned much l...
I asked a group of friends for "someone to help me with an AI experiment" and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.
If you look at how ChatGPT responds, it seems to be really struggling to "get" what's happening in the kitchen -- it never really comes to the point of giving specific instructions, and especially never comes to the point of having any sense of the "situation" in the kitchen -- e.g. whet...
I'm very interested in Wei Dai's work, but I haven't followed closely in recent years. Any pointers to what I might read of his recent writings?
I do think Eliezer tackled this problem in the sequences, but I don't really think he came to an answer to these particular questions. I think what he said about meta-ethics is that it is neither that there is some measure of goodness to be found in the material world independent from our own minds, nor that goodness is completely open to be constructed based on our whims or preferences. He then says "well there ju...
Did you ever end up reading Reducing Goodhart?
Not yet, but I hope to, and I'm grateful to you for writing it.
processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good
Well, sure, but the question is whether this can really be done by modelling human values and then evolving those models. If you claim yes then there are several thorny issues to contend with, including what constitutes a viable starting point for such a process, what is a reasonable dynamic for such a process, and on what basis we decide the answers to these things.
If you train a model by giving it reward when it appears to follow a particular human's intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.
Right yeah I do agree with this.
...Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That
Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don't actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.
Now suppose that you have this intelligent person write essays about how one ough...
This is a post about the mystery of agency. It sets up a thought experiment in which we consider a completely deterministic environment that operates according to very simple rules, and ask what it would be for an agentic entity to exist within that.
People in the game of life community actually spent some time investigating the empirical questions that were raised in this post. Dave Greene notes:
...The technology for clearing random ash out of a region of space isn't entirely proven yet, but it's looking a lot more likely than it was a year ago, that a work
This post attempts to separate a certain phenomenon from a certain very common model that we use to understand that phenomenon. The model is the "agent model" in which intelligent systems operate according to an unchanging algorithm. In order to make sense of their being an unchanging algorithm at the heart of each "agent", we suppose that this algorithm exchanges inputs and outputs with the environment via communication channels known as "observations" and "actions".
This post really is my central critique of contemporary artificial intelligence discourse....
This post trims down the philosophical premises that sit under many accounts of AI risk. In particular it routes entirely around notions of agency, goal-directedness, and consequentialism. It argues that it is not humans losing power that we should be most worried about, but humans quickly gaining power and misusing such a rapid increase in power.
Re-reading the post now, I have the sense that the arguments are even more relevant than when it was written, due to the broad improvements in machine learning models since it was written. The arguments in this po...
Thanks for writing this.
Alignment research has a track record of being a long slow slog. It seems that what we’re looking for is a kind of insight that is just very very hard to see, and people who have made real progress seem to have done so through long periods of staring at the problem.
With your two week research sprints, how do you decide what to work on for a given sprint?
Regarding your point on ELK: to make the output of the opaque machine learning system counterfactable, wouldn't it be sufficient to include the whole program trace? Program trace means the results of all the intermediate computations computed along the way. Yet including a program trace wouldn't help us much if we don't know what function of that program trace will tell us, for example, whether the machine learning system is deliberately deceiving us.
So yes it's necessary to have an information set that includes the relevant information, but isn't the main part of the (ELK) problem to determine what function of that information corresponds to the particular latent variable that we're looking for?
If I understand you correctly, the reason that this notion of counterfactable connects with what we normally call a counterfactual is that when an event screens of its own history, it's easy to consider other "values" of the "variable" underlying that event without coming into any logical contradictions with other events ("values of other variables") that we're holding fixed.
For example if I try to consider what would have happened if there had been a snow storm in Vermont last night, while holding fixed the particular weather patterns observed in Vermont ...
I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).
I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.
I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system ...
I really appreciate this post!
For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.
Fascinatingly, EA employers in particular seem to seek employees who do try to forward organization goals in unforeseen ways!
I have the sense that boundaries are so effective as a coordination mechanism that we have come to believe that they are an end in themselves. To me it seems that the over-use of boundaries leads to loneliness that eventually obviates all the goodness of the successful coordination. It's as if we discovered that cars were a great way to get from place to place, but then we got so used to driving in cars that we just never got out of them, and so kind of lost all the value of being able to get from place to place. It was because the cars were in fact so eff...
You write down an optimization problem over (say) linear combinations of image pixels, minimizing some measure of marginal returns to capacity given current network parameters (first idea) or overall importance as measured by absolute value of dL/dC_i, again given current network parameters (second idea). By looking just for the feature that is currently "most problematic" you may be able to sidestep the need to identify the full set of "features" (whatever that really means).
I don't know how exactly you would formulate these objective functions but it seems do-able no?
OK, good, thanks for that correction.
One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment -- enough, let's say, to get the house built, but not enough, let's say, to really understand wha...
Thanks for writing this!
Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term f...
I wonder how your definition of multi-agent power would look in a game of chess or go. There is this intuitive thing where players who have pieces more in the center of the board (chess) or have achieved certain formations (go) seem to acquire a kind of power in those games, but this doesn't seem to be about achieving different terminal goals. Rather it seems more like having the ability to respond to whatever one's opponent does. If the two agents cannot perfectly predict what their opponent will do then there is value in having the ability to respond to ...
Suppose the human is trying to build a house and plans to build an AI to help with that. What would and mean -- just at an intuitive level -- in a case like that?
I suppose that to compute you would sample many different arrangement of matter -- some containing houses of various shapes and sizes and some not -- and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to bui...
Thanks for this.
we don't have a generic technique to define capacity across different architectures and loss functions
Got it. I imagine that for some particular architectures, and given some particular network weights, you can numerically compute the marginal returns to capacity curves, but that it's hard to express capacity analytically as a function of network weights since you really need to know what the particular features are in order to compute returns to capacity -- is that correct?
Right! Two quick ideas:
Although it's not easy to determine the full set of "natural" features for arbitrary networks, still you might be able to solve an optimization problem that identifies the single feature with most negative marginal returns to capacity given the weights of some particular trained network. If you could do this then perhaps you could apply a regularization to the network that "flattens out" the marginal returns curve for just that one feature, then apply further training to the network and ask again which single feature has most nega
Yeah I also have the sense that we mostly agree here.
I have the sense that CEV stands for, very roughly, "what such-and-such a person would do if they became extremely wise", and the hope (which I think is a reasonable hope) is that there is a direction called "wisdom" such that if you move a person far enough in that direction then they become both intelligent and benevolent, and that this eventually doesn't depend super much on where you started.
The tricky part is that we are in this time where we have the option of making some moves that might be quite ...
I think what you're saying here ought to be uncontroversial. You're saying that should a small group of technical people find themselves in a position of enormous influence, they ought to use that influence in an intelligent and responsible way, which may not look like immediately shirking that responsibility out of a sense that nobody should ever exert influence over the future.
I have the sense that in most societies over most of time, it was accepted that of course various small groups would at certain time find themselves in positions of enormous influe...
I wonder if there is some way to alter the marginal return curves so that they are not diminishing, and see whether that removes polysemanticity from the trained network. This seems difficult to do at the outset, because a lot of features just are going to be of somewhat diminishing marginal utility for most real-world learning problems. But I wonder if there is some way to train a network, and then, based on the trained weights, do some fine-tuning using a loss that is set up so that the marginal returns to feature capacity are non-diminishing for the particular features learned so far.
Thank you for this work. How feasible would it be to replicate these results using large neural networks (but still using synthetic training data where you can control feature sparsity directly)? What then would be the path to determining whether networks trained on real-world datasets behave according to this model?
good capabilities form something like an attractor well
In my own experience examining the foundations of things in the world, I have repeatedly found there to be less of an attractor-of-fundamentally-effective-decision-making than I had anticipated. In every way that I expected to find such an attractor within epistemology, mathematics, empiricism, ethics, I found in fact that even the very basic assumptions that I started with were unfounded, and found nothing firm to replace with them with. Probability theory: not a fundamental answer to epistemology;...
But the vast majority of initial states for a lump of carbon/oxygen/hydrogen/nitrogen atoms do not include a person programming a computer with the intention of taking over the internet. Shouldn't you apply the same logic there that you apply to the case of a computer?
In fact a single zero day exploit is certainly much simpler than a full human, so aprior it's more likely for a computer with a zero day exploit to form from the void than for a computer with a competent human intent on taking over the internet to form from the void.
This is one of the clearest top-to-bottom accounts of the alignment problem and related world situation that I've seen here in a while. Thank you for writing it.
i believe, akin to the yudkowsky-moore law of mad science, that the amount of resources it takes for the world to be destroyed — whether on purpose or by accident — keeps decreasing.
Yes it seems that in this particular way the world is becoming more and more unstable
...pretty soon (probly this decade or the next), an artificial intelligence capable of undergoing recursive self-improvement (RSI)
See also Scott Aaronson on experimental computational complexity theory (haha its a joke wait no maybe he's not joking wait what?)
...The meeting ended with a “Wild & Crazy Ideas Session,” at which I (naturally) spoke. I briefly considered talking about quantum gravity computing, closed timelike curves, or quantum anthropic postselection, but ultimately decided on something a little less mainstream. My topic was “Experimental Computational Complexity Theory,” or “why do theoretical physicists get $8-billion machines for the sole purpose of confirming or
Above some threshold level of deceptive capabilities we should stop trusting the results of behavioral experiments no matter what they show
I agree, and if we don't know how to verify that we're not being deceived, then we can't trust almost any black-box-measurable behavioral property of extremely intelligent systems, because any such black-box measurement rests on the assumption that the object being measured isn't deliberately deceiving us.
It seem that we ought to be able to do non-black-box stuff, we just don't know how to do that kind of stuff very well yet. In my opinion this is the hard problem of working with highly capable intelligent systems.
I'm interested in more of a sense of what mistake you think people are making, because I think caring about something strong enough to change who you are around it can be a very positive force in the world.
Yeah, caring about something enough to change who you are is really one of the highest forms of virtue, as far as I'm concerned. It's somewhat tragic that the very thing that makes us capable of this high form of virtue -- our capacity to deliberately shift what we value -- can also be used to take what was once an instrumental value and make it, more...
It's worse, even, in a certain way, than that: the existence of optimizing systems organized around a certain idea of "natural class" feeds back into more observers observing data that is distributed according to this idea of "natural class", leading to more optimizing systems being built around that idea of "natural class", and so on.
Once a certain idea of "natural class" gains a foothold somewhere, observers will make real changes in the world that further suggest this particular idea of "natural class" to others, and this forms a feedback loop.