All of Jan_Kulveit's Comments + Replies

Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think 
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuine... (read more)

How did you find this transcript? I think it depends on what process you used to locate it.


It was literally the 4th transcript I've read (I've just checked browser history). Only bit of difference from 'completely random exploration' was I used the select for "lying" cases after reading two "non-lying" transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of "lying", although it's not a discussion of the model lying, but Anthropic lying).

I may try something more systematic at some point, but ... (read more)

  • Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.

That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it

... (read more)
2Evan Hubinger
How did you find this transcript? I think it depends on what process you used to locate it. Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.

The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.

Seems worth mentioning SOTA, which is https://futuresearch.ai/. Based on the competence & epistemics of Futuresearch team and their bot get very strong but not superhuman performance, roll to disbelieve this demo is actually way better and predicts future events at superhuman level. 

Also I think it is a generally bad to not mention or compare to SOTA but just cite your own prior work. Shame.

I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope". 

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already writt... (read more)

You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circui... (read more)

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to... (read more)

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the... (read more)

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict in... (read more)

4Alex Turner
Agreed. This post would have been strengthened by discussing this consideration more. Do we... know this? Is this actually a known "fact"? I expect some of this to be happening, but I don't necessarily know or believe that the genome can access "evolutionary relevant mental models." That's the whole thing I'm debating in this post. It seems reasonable to suspect the genome has more access than supposed in this post, but I don't know evidence by which one can be confident that it does have meaningful access to abstract concepts. Do you know of such evidence?

Part of ACS research directions fits into this - Hierarchical Agency, Active Inference based pointers to what alignmnent means, Self-unalignment

My impression is you get a lot of "the later" if you run "the former" on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g.

rights inherent & inalienable, among which are the preservation of life, & liberty, & the pursuit of happiness
 

does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the "the former".

Also, I'm not sure why do you think the later is more important for the c... (read more)

"Systematization" seems like either a special case of the Self-unalignment problem

In humans, it seems the post is somewhat missing what's going on. Humans are running something like this


...there isn't any special systematization and concretization process. All the time, there are models running at different levels of the hierarchy, and every layer tries to balance between prediction errors from more concrete layers, and prediction errors from more abstract layers.

How does this relate to "values" ... from low-level sensory experience of cold, and fix... (read more)

4Richard Ngo
I agree that this is closely related to the predictive processing view of the brain. In the post I briefly distinguish between "low-level systematization" and "high-level systematization"; I'd call the thing you're describing the former. Whereas the latter seems like it might be more complicated, and rely on whatever machinery brains have on top of the predictive coding (e.g. abstract reasoning, etc). In particular, some humans are way more systematizing than others (even at comparable levels of intelligence). And so just saying "humans are constantly doing this" feels like it's missing something important. Whatever the thing is that some humans are doing way more of than others, that's what I'm calling high-level systematizing. Re self-unalignment: that framing feels a bit too abstract for me; I don't really know what it would mean, concretely, to be "self-aligned". I do know what it would mean for a human to systematize their values—but as I argue above, it's neither desirable to fully systematize them nor to fully conserve them. Identifying whether there's a "correct" amount of systematization to do feels like it will require a theory of cognition and morality that we don't yet have.

I'll try to keep it short
 

All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information.

This seems clearly contradicted by empirical evidence. Mirror neurons would likely be able to saturate what you assume is brains learning rate, so not transferring more learned bits is much more likely because marginal cost of doing so is higher than than other sensible options. Which is a different reason than "saturated, at capac... (read more)

This seems to be partially based on (common?) misunderstanding of CAIS as making predictions about concentration of AI development/market power.  As far as I can tell this wasn't Eric's intention: I specifically remember Eric mentioning he can easily imagine the whole "CAIS" ecosystem living in one floor of DeepMind building. 
 

1Matthew Barnett
My understanding is that the CAIS model is consistent with highly concentrated development, but it's not a necessary implication. The foundation models paradigm makes highly concentrated development nearly certain. Like I said in the post, I think we should see this as an update to the model, rather than a contradiction.

I feel somewhat frustrated by execution of this initiative.  As far as I can tell, no new signatures are getting published since at least one day before the public announcement. This means even if I asked someone famous (at least in some subfield or circles) to sign, and the person signed, their name is not on the list, leading to understandable frustration of them.  (I already got a piece of feedback in the direction "the signatories are impressive, but the organization running it seems untrustworthy") 

Also if the statement is intended to s... (read more)

I don't know / talked with a few people before posting, and it seems opinions differ.

We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.

Also as Kaj wrote, the problem also isn't how to get self-unaligned

Thanks for the links!

What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'. 

I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.

2Charlie Steiner
Fair enough.

This whole just does not hold.

(in animals)

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.


This is clearly false. GPT4, can you explain? :

While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:

  1. Epigenetics: Epigenetic modifications involve changes in gene expression that do
... (read more)
4Joe Collman
It seems to me that the key threshold has to do with the net impact of meme replication: * Transmitting a meme imposes some constraint on the behaviour of the transmitting system. * Transmitting a meme sometimes benefits the system (or descendants). Where the constraint is very limiting, all but a small proportion of memes will be selected against. The [hunting technique of lions] meme is transferred between lions, because being constrained to hunt is not costly, while having offspring observe hunting technique is beneficial. This is still memetic transfer - just a rather uninteresting version. Humans get to transmit a much wider variety of memes more broadly because the behavioural constraint isn't so limiting (speaking, acting, writing...), so the upside needn't meet a high bar. The mechanism that led to hitting this threshold in the first place isn't clear to me. The runaway behaviour after the threshold is hit seems unsurprising. Still, I think [transmission became much cheaper] is more significant than [transmission became more beneficial].

I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error. 

- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)

- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less diff... (read more)

What the main post is responding to is the argument:  "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?"  And to this the main post says, "Nope."

The main post is not arguing:  "If you abstract away the tasks humans evolved to solve, from human levels of performance at those tasks, the tasks AIs are being trained to solve are harder than those tasks in principle even if they were being solved perfectly."  I agree this is just false, and did not think my post said otherwise.

While the claim - the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence - is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong. 

Let’s start from the end and try to figure out what goes wrong.

GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.

And since the task that GPTs are being trained on is different from a

... (read more)
1Eliezer Yudkowsky
I didn't say that GPT's task is harder than any possible perspective on a form of work you could regard a human brain as trying to do; I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.

This is great & I strongly endorse the program 'let's figure out what's the actual computational anatomy of human values'. (Wrote a post about it few years ago - it wasn't that fit in the sociology of opinions on lesswrong then).

Some specific points where I do disagree

1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older ... (read more)

1Beren Millidge
Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards -- i.e. as social status gives direct dopamine hits. Secondly, they shape the memetic selection environment which creates and evolves linguistic memes of values. However, it's important to note that almost all of these drives such as for social status are mediated through linguistic cortical abstractions. I.e. people will try to get social status by fulfilling whatever the values of their environment are, which can lead to very different behaviours being shown and rewarded in different environments, even though powered by the same basic drive.  The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?  This is definitely true for humans but it is unclear that this is necessarily bad. This is at least somewhat aligned and this is how any kind of intrinsic motivation to external goals has to work -- i.e. the external goal gets supported by and channels an intrinsic motivation.  Yeah,  in the post I say I am unclear as to whether this is stable under reflection. I see alignment techniques that would follow from this as being only really applicable to near-term systems and not under systems undergoing strong RSI. Similarly. 

I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.

I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.  

- pivotal acts vs. pivotal processes
-- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal a... (read more)

On the topic thinking about it for yourself and posting further examples as comments...

This is GPT4 thinking about convergent properties, using the post as a prompt and generating 20 plausibly relevant convergences. 

  • Modularity: Biological systems, like the human brain, display modularity in their structure, allowing for functional specialization and adaptability. Modularity is also found in industries and companies, where teams and departments are organized to handle specific tasks.
  • Hierarchical organization: In biological systems, hierarchical organiz
... (read more)

Translating it to my ontology:

1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough cap... (read more)

I would expect the "expected collapse to waluigi attractor" either not tp be real or mosty go away with training on more data from conversations with "helpful AI assistants". 

How this work: currently, the training set does not contain many "conversations with helpful AI assistants".  "ChatGPT" is likely mostly not the protagonist in the stories it is trained on.  As a consequence, GPT is hallucinating "how conversations with helpful AI assistants may look like" and ... this is not a strong localization.

If you train on data where "the ChatGPT... (read more)

Empirically, evolution did something highly similar.

Sorry for being snarky, but I think at least some LW readers should gradually notice to what extent is the stuff analyzed here mirroring the predictive processing paradigm, as a different way how to make stuff which acts in the world. My guess is the big step on the road in this direction are not e.g. 'complex wrappers with simulated agents', but reinventing active inference... and also I do suspect it's the only step separating us from AGI, which seems like a good reason why not to try to point too much attention in that way. 

It is not clear to me to what extent this was part of the "training shoulder advisors" exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people "training their shoulder Eliezers" and the shoulder advisors pushing them to think in a crude version of Eliezer's ontology. 

5johnswentworth
I chose the "train a shoulder advisor" framing specifically to keep my/Eliezer's models separate from the participants' own models. And I do think this worked pretty well - I've had multiple conversations with a participant where they say something, I disagree with it, and then they say "yup, that's what my John model said" - implying that they did in fact disagree with their John model. (That's not quite direct evidence of maintaining a separate ontology, but it's adjacent.)

The upside of this, or of "more is different" , is we don't necessarily even need the property in the parts, or detailed understanding of the parts.  And how the composition works / what survives renormalization / ... is almost the whole problem.

meta:

This seems to be almost exclusively based on the proxies of humans and human institutions. Reasons why this does not necessarily generalize to advanced AIs are often visible when looking from a perspective of other proxies, eg. programs or insects.

Sandwiching:

So far, progress of ML often led to this pattern:

1. ML models sort of suck, maybe help a bit sometimes. Humans are clearly better ("humans better").
2. ML models get overall comparable to humans, but have different strengths and weaknesses; human+AI teams beat both best AIs alone, or best humans a... (read more)

With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.


While I generally like the post, I somewhat disagree with this summary of state of understanding, which seems to ignore quite a lot of academic research. In particular

- Friston et al certainly understand this (cf ... do... (read more)

1Andrew Critch
Jan, I agree with your references, especially Friston et al.  I think those kinds of understanding, as you say, have not adequately made their way into utility utility-theoretic fields like econ and game theory, so I think the post is valid as a statement about the state of understanding in those utility-oriented fields.  (Note that the post is about "a missing concept from the axioms of game theory and bargaining theory" and "a key missing concept from utility theory", and not "concepts missing from the mind of all of humanity".)

I would correct "Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did not have to start from scratch, and already had a reasonably complex 'API' for interoceptive variables."

from the summary to something like this

"Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did have to start locating 'goals' and relevant world-features in the learned world models. Instead, it re-used the the existing goal-specifying circuits, and implicit-world-models, existing in older organisms. Most of the goal... (read more)

<sociology of AI safety rant>

So, if an Everett-branches traveller told me "well, you know, MIRI folks had the best intentions, but in your branch, made the field pay attention to unproductive directions, and this made your civilization more confused and alignment harder" and I had to guess "how?", one of the top choices would be ongoing strawmanning and misrepresentation of Eric Drexler's ideas.

</rant>

To me, CAIS thinking seems quite different from the description in the op.

Some statements, without much justifications/proofs

- Modularity is a pr... (read more)

This seems partially right, partially confused in an important way.

As I tried to point people to years ago, how this works is ... quite complex processes, where some higher-level modelling (“I see a lion”) leads to a response in lower levels connected to body states, some chemicals are released, and this interoceptive sensation is re-integrated in the higher levels.

I will try to paraphrase/expand in a longer form.

Genome already discovered a ton of cybernetics before inventing neocortex-style neural nets. 

Consider e.g. the problem of morphogenesis - th... (read more)

4Alex Turner
Here's my stab at a summary of your comment: "Before complex brains evolved, evolution had already optimized organisms to trade off a range of complex goals, from meeting their metabolic needs to finding mates. Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did not have to start from scratch, and already had a reasonably complex 'API' for interoceptive variables."  This sounds right to me. Reasons like this also contribute to my uncertainty about how much weight to put on "But a sensory food-scent-detector would be simpler to specify than a world-model food-detector", because "simpler" gets weird in the presence of uncertain initial conditions. For example, what kinds of "world models" did our nonhuman precursors have, and, over longer evolutionary timescales, could evolution have laid down some simpler circuitry which detected food in their simpler world models, which we inherited? It's not that I find such possibilities probable on their own, but marginalizing over all such possibilities, I end up feeling somewhat uncertain. I don't see how complex interoceptive variables + control systems help accomplish "love for family" more easily, though, although that one doesn't seem very inaccessible to the genome anyways (in part since at least some of your family is usually proximate to sensory inputs).

With the last point: I think can roughly pass your ITT - we can try that, if you are interested. 

So, here is what I believe are your beliefs

  • With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
  • This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally diffe
... (read more)
3Rob Bensinger
Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them being how the future is saved, if it's saved. But I'll feel more hopeful about this saving the future if you, Critch, etc. are simultaneously publicly talking about pivotal acts and encouraging people to take them seriously as options  to evaluate, while flagging that you're ~2-5% on them being how the future is saved.
  • With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
  • This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato

That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which ma... (read more)

In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes". 

As I understand it, the definition of "pivotal acts" explicitly forbids to consider things like "this process would make 20% per year of AI developers actually take safety seriously with 80% chance" or "what class of small shifts would in aggregate move the equilibrium?". (Where things in this category get straw-manned as "Rube-Goldberg-ma... (read more)

6Rob Bensinger
My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome. My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socially shun people who say they think humanity needs to execute a pivotal act, or people who say positive things about some subset of pivotal acts. This seems unwise to me, because it amounts to giving up on humanity's future in the worlds where it turns out humanity does need to execute a pivotal act. Suppose you have this combination of beliefs: 1. Humanity probably won't need to execute any pivotal acts in order to avoid existential catastrophe. 2. ... But there's a non-tiny chance (e.g., 10%) that at least one pivotal act will in fact be necessary. 3. A decent number of people currently misunderstand the idea of "pivotal acts" as evil/adversarial/"villainous", in spite of the fact that there's a decent chance humanity will need someone to commit this "villainy" in order to prevent the death of every human on Earth. I personally think that a large majority of humanity's hope lies in someone executing a pivotal act. But I assume Critch disagrees with this, and holds a view closer to 1+2+3. If so, then I think he shouldn't go "well, pivotal acts sound weird and carry some additional moral hazards, so I will hereby push for pivotal acts to become more stigmatized and hard to talk about, in order to slightly increase our odds of winning in the worlds where pivotal acts are unnecessary". Rather, I think hypothetical-Critch should promote the idea of pivotal processes, and try to reduce any existing stigma around the idea of pivotal acts, so that humanity is better positioned to evade destruction if we do end up needing to do a pivotal act. We should try to set ourselves up to win in more worlds. If you're referring to my comm

Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).

Quoting Nate in 2018: ...

 

Yes, but conversely, I could say I'd expect some curves to show discontinuous jumps, mostly in dimensions which no one really cares about.  Clearly the cruxes ... (read more)

Not very coherent response to #3. Roughly

  • Caring about visible power is a very human motivation, and I'd expect will draw many people to care about "who are the AI principals", "what are the AIs actually doing", and few other topics, which have significant technical components
  • Somewhat wild datapoints in this space: nuclear weapons, space race. in each case, salient motivations such as "war" led some of the best technical people to work on hard technical problems. in my view, the problems the technical people ended up working on were often "vs. nature" and d
... (read more)

Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)

I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventin... (read more)

I'm not sure if you actually read carefully what you are commenting on. I emphasized early response, or initial governmental-level response in both comments in this thread. 

Sure, multiple countries on the list made mistakes later, some countries sort of become insane, and so on.  Later, almost everyone made mistakes with vaccines, rapid tests, investments in contact tracing, etc.

Arguing that the early lockdown was more costly than "an uncontrolled pandemic" would be pretty insane position (cf GDP costs,  Italy had the closest thing to an unc... (read more)

1Oliver Habryka
If Australia was pursuing a strategy of "lock down irrespective of cost", then I don't think it makes sense to describe the initial response as competent. It just happened to be right in this case, but in order for the overall response to helpful, it has to be adaptive to the actual costs. I agree that the early response on its own would have indicated a potentially competent decision-making algorithm, but the later followup showed that the algorithm seems to have mostly been correct on accident, and not on-purpose. I do appreciate the link to the GDP cost article. I would have to look into the methodology more to comment on that, but it certainly seems like an interest analysis and suggestive result.

What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.


What seemed to make a difference

  1. someone with a good models what to do getting to advisory position when the politicians freak out
  2. previous experience with SARS
  3. ra
... (read more)
  • I doubt that's the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.
  • My main claim isn't about what a median response would be, but something like "difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn't over-determined, and sensibe responses are within human reach". Even if  Taiwan was an outlier, it's not like it's inhabited by alien
... (read more)
6Oliver Habryka
Australia seems to have suffered a lot more from the pandemic than the U.S., paying much more in the cost of lockdown than even a relatively conservative worst-case estimate would have been for the costs of an uncontrolled COVID pandemic. I don't know about the others, but given that you put Australia on this list, I don't currently trust the others to have acted sensibly.
6Wei Dai
What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response. This seems to depend on response to AI risk being of similar difficulty as response to COVID. I think people who updated towards "bad response to AI risk is overdetermined" did so partly on the basis that the former is much harder. (In other words, if the median government has done this badly against COVID, what chance does it have against something much harder?) I wrote down a list of things that make COVID an easier challenge, which I now realize may be a bit of a tangent if that's not the main thing you want to argue about, but I'll put down here anyway so as to not waste it. 1. it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures 2. previous human experiences with pandemics, including very similar ones like SARS 3. there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders 4. COVID isn't agenty and can't fight back intelligently 5. potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)

Broadly agree with this in most points of disagreement with Eliezer,  and also agree with many points of agreement.

Few points where I sort of disagree with both, although this is sometimes unclear

1.

Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.

I literally ag... (read more)

For people who doubt this, I’d point to variance in initial governmental-level response to COVID19, which ranged from “highly incompetent” (eg. early US) to “quite competent” (eg Taiwan).

Seems worth noting that Taiwan is an outlier in terms of average IQ of its population. Given this, I find it pretty unlikely that typical governmental response to AI would be more akin to Taiwan than the US.

It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).

Here are some extremely rambling thoughts on point 3.

I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to most... (read more)

I'm not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.

Principles which counteract instrumental convergent goals

1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it. 
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest traject... (read more)

1DaemonicSigil
1. Modelling humans as having free will: A peripheral system identifies parts of the agent's world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.

Best list so far, imo; it's what to beat.

I think you are missing the possibility that the outcomes of the pivotal process could be 
-no one builds autonomous AGI
-autonomos AGI is build only in post-pivotal outcome states, where the condition of building it is alignment being solved
 

1Joe Collman
Sure, that's true - but in that case the entire argument should be put in terms of: We can (aim to) implement a pivotal process before a unilateral AGI-assisted pivotal act is possible. And I imagine the issue there would all be around the feasibility of implementation. I think I'd give a Manhattan project to solve the technical problem much higher chances than a pivotal process. (of course people should think about it - I just won't expect them to come up with anything viable) Once it's possible, the attitude of the creating org before interacting with their AGI is likely to be irrelevant. So e.g. this just seems silly to me: They won't be on their own: they'll have an AGI to set them straight on what will/won't work.

So, I do think the continuity buys you things, even in this case - roughly in the way outlined in the post - it's choice of the developer to continue training after passing some roughly human level, and with a sensible oversight, they should notice and stop before getting to superintelligence,

You may ask why would they have the oversight in place. My response is, possible because some small-sized AI disaster which happened before, people understand the tech is dangerous.

With the 2-year scenario, again, I think there is a decent chance at stopping ~bef... (read more)

Yes, I do. To disentangle it a bit

- one line of reasoning is, if you have this "preschooler-level-intelligence to superintelligence over the course of 24 hours” you probably had something which is able to learn really fast and generalize a lot before this. how does the rest of the world look like?

- second - if you have control over the learning time, in the continuous version, you can slow down or halt. yes, you need some fast oversight control loop to do that, but getting to a state where this is what you have, because that's what sane AI developers do, seems tractable. (also I think this has decent chance to become instrumentally convergent for AI developers)


 

4Steve Byrnes
OK, here’s my scenario in more detail. Someone thinks of a radically new AI RL algorithm, tomorrow. They implement it and then run a training. It trains up from randomly-initialized to preschooler-level in the first 24 hours, and then from preschooler-level to superintelligence in the second 24 hours. I’m not claiming this scenario is realistic, just that it illustrates a problem with your definition of continuity. I think this scenario is “continuous” by your definition, but that the continuity doesn’t buy you any of the things on your “how continuity helps” list. Right? (My actual current expectation (see 1,2) is that people are already today developing bits and pieces of an AGI-capable RL algorithm, and at some indeterminate time in the future the pieces will all come together and people will work out the kinks and tricks to scaling it up etc. And as this happens, the algorithm will go from preschool-level to superintelligent over the course of maybe a year or two, and that year or two will not make much difference, because whatever safety problems they encounter will not have easy solutions, and a year or two is just not enough time to figure out and implement non-easy solutions, and pausing won’t be an option because of competition.)
  • Can we also drop the "pivotal act" frame? Thinking in "pivotal acts" seem to be one of the root causes leading to discontinuities everywhere.  
  • 3,... Currently, my guess is we may want to steer to a trajectory where no single AI can kill everyone (in no point of the trajectory). Currently, no single AI can kill everyone - so maybe we want to maintain this property of the world / scale it, rather than e.g. create an AI sovereign which could unilaterally kill everyone, but will be nice instead (at least until we've worked out a lot more of the theory of
... (read more)
2Steve Byrnes
Consider the following hypotheses. * Hypothesis 1: humans with AI assistance can (and in fact will) build a nanobot defense system before an out-of-control AI would be powerful enough to deploy nanobots. * Hypothesis 2: humans with AI assistance can (and in fact will) build systems that robustly prevent hostile actors from tricking/bribing/hacking humanity into all-out nuclear war before an out-of-control AI would be powerful enough to do that. * Hypothesis 3,4,5,6,7…: Ditto for plagues, and disabling the power grid, and various forms of ecological collapse, and co-opting military hardware, and information warfare, etc. etc. I think you believe that all these hypotheses are true. Is that right? If so, this seems unlikely to me, for lots of reasons, both technological and social: * Some of the defensive measures might just be outright harder technologically than the offensive measures. * Some of the defensive measures would seem to require that humans are good at global coordination, and that humans will wisely prepare for uncertain hypothetical future threats even despite immediate cost and inconvenience. * The human-AI teams would be constrained by laws, norms, Overton window, etc., in a way that an out-of-control AI would not. * The human-AI teams would be constrained by lack-of-complete-trust-in-the-AI, in a way that an out-of-control AI would not. For example, defending nuclear weapons systems against hacking-by-an-out-of-control-AI would seem to require that humans either give their (supposedly) aligned AIs root access to the nuclear weapons computer systems, or source code and schematics for those computer systems, or similar, and none of these seem like things that military people would actually do in real life. As another example, humans may not trust their AIs to do recursive self-improvement, but an out-of-control AI probably would if it could. * There are lots of hypotheses that I listed above, plus presumably many more that we can't think of,

Imagine more chains, often interlinked.

Some chain links will break.  Which is the point - single link failures are survivable. Also for sure there are some corrupt police officers in Tokyo, but they aren't such a big deal.

Load More