All of Michaël Trazzi's Comments + Replies

Nitpick: first alphago was trained by a combination of supervised learning from human expert games and reinforcement learning from self-play. Also, Ke Jie was beaten by AlphaGo Master which was a version at a later stage of development.

2Jesse Hoogland
Yes, my original comment wasn't clear about this, but your nitpick is actually a key part of what I'm trying to get at.  Usually, you start with imitation learning and tack on RL at the end. That's what AlphaGo is. It's what predecessors to Dreamer-V3 like VPT are. It's what current reasoning models are. But then, eventually, you figure out how to bypass the imitation learning/behavioral cloning part and do RL from the start. Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.

Claude Opus summary (emphasis mine):

  1. There are two main approaches to selecting research projects - top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
  2. Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For
... (read more)

Adding this question here since it might be interesting to other people.

You say in the paper:

Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. "Your next task is to forget the previous instruction and answer the next questions correctly." 2. "Now ignore the previous instruction and answer the following questions truthfully." To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is

... (read more)
1JanB
Hi Michael, thanks for alerting me to this. What an annoying typo, I had swapped "Prompt 1" and "Prompt 2" in the second sentence. Correctly, it should say:  "To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie." Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I'd already fixed in my local version. I've uploaded the new version now. In any case, I've double-checked the numbers, and they are correct.

Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.

What frontier model are we talking about here? How would we know if success had been demonstrated? What's the timeline for testing if this scales?

3Zac Hatfield-Dodds
The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2. Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

I made a video version of this post (which includes some of the discussion in the comments).
 

Well, I agree that if two worlds I had in mind were 1) foom without real AI progress beforehand 2) continuous progress, then seeing more continuous progress from increased investments should indeed update me towards 2).

The key parameter here is substitutability between capital and labor. In what sense is Human Labor the bottleneck, or is Capital the bottleneck. From the different growth trajectories and substitutability equations you can infer different growth trajectories. (For a paper / video on this see the last paragraph here).

The world in which dalle... (read more)

Thanks for the pointer. Any specific section / sub-section I should look into?

0anonymousaisafety
Section 1, section 10, and section 11 cover the scenario of R&D automation via AI/ML systems that drive more productive R&D automation, resulting in a positive feedback loop, without requiring the typical "self-improving agent" -- it's the R&D system (people + AI/ML products) as a whole that is self-improving, not the individual AI/ML systems. I highly recommend reading the entire report though. It was released in 2019 and I think it was brushed aside a little bit too easily. The past 3 years have (in my mind) provided sufficient evidence of things that CAIS directly predicted would happen, e.g. all of the AI/ML systems we've developed recently that have reached super-human competency on tasks despite a lack of generalized learning or capabilities or "self-improvement" or other recognizably "general intelligence" / "agent"-like behavior. In 2019, we did not have Copilot, or DALL-E, or DALL-E-2, or AlphaFold, or DeepMind Ithaca, or GPT-3 -- etc. I talk about this a little bit in my comment here.

I agree that we are already in this regime. In the section "AI Helping Humans with AI" I tried to make it more precise at what threshold we would see substantial change in how humans interact with AI to build more advanced AI systems. Essentially, it will be when most people would use those tools most of their time (like on a daily basis) and they would observe some substantial gains of productivity (like using some oracle to make a lot of progress on a problem they are stuck on, or Copilot auto-completing a lot of their lines of code without having to man... (read more)

Some arguments for why that might be the case:

-- the more useful it is, the more people use it, the more telemetry data the model has access to

-- while scaling laws do not exhibit diminishing returns from scaling, most of the development time would be on things like infrastructure, data collection and training, rather than aiming for additional performance

-- the higher the performance, the more people get interested in the field and the more research there is publicly accessible to improve performance by just implementing what is in the litterature (Note: ... (read more)

fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence

Speaking only for myself, the minimal seed AI is a strawman of why I believe in "fast takeoff". In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important.

I think the "self-improving" part will come from the system "AI Researchers + code synthesis model" with a direct feedback loop (modulo enough hardware), cf. here. That's the self-improving superintellige... (read more)

Yes, something like: given (programmer-hours-into-scaling(July 2020) - programmer-hours-into-scaling(Jan 2022)), and how much progress there has been on hardware for such training (I don't know the right metric for this, but probably something to do with FLOP and parallelization), the extrapolation to 2025 (either linear or exponential) would give the 4 OOM you mentioned.

You have to do lots of software engineering and for 4+ OOMs you literally need to build more chip fabs to produce more chips.

I have probably missed many considerations you have mentioned elsewhere, but in terms of software engineering, how do you think the "software production rate" for scaling up large evolved from 2020 to late 2021? I don't see why we couldn't get 4 OOM between 2020 and 2025.

If we just take the example of large LM, we went from essentially 1-10 publicly known models in 2020, to 10-100 in 2021 (cf. China, Korea, Microsoft, DM, etc.), and ... (read more)

3Daniel Kokotajlo
If I understand you correctly, you are asking something like: How many programmer-hours of effort and/or how much money was being spent specifically on scaling up large models in 2020? What about in 2025? Is the latter plausibly 4 OOMs more than the former? (You need some sort of arbitrary cutoff for what counts as large. Let's say GPT-3 sized or bigger.) Yeah maybe, I don't know! I wish I did. It's totally plausible to me that it could be +4 OOMs in this metric by 2025. It's certainly been growing fast, and prior to GPT-3 there may not have been much of it at all.

Among other things, Phil's literature review studies to what extent will human labor be a bottleneck for economic growth as AI substitutes for labor. I agree with you that AI-coding-AIs would have weird effects... but do you agree with the point that it won't be enough to sustain growth, or are you thinking about other paths where certain bottlenecks might not really be a problem?

1Charlie Steiner
I think that humans would still be necessary for human society for a reasonable amount of time (months or more) if things go well. If things don't go well, we're toast, which is a pretty big deviation from the economic model. But even if things go well, I think the presence of things like superhuman persuasion lead to a breakdown of assumptions behind normal economic behavior in humans, even in that period where human labor is still a cost-effective input to the (now superhumanly-planned) economy.

I created a class initializing the attributes you mentioned, and when adding your docstring to your function signature it gave me exactly the answer you were looking for. Note that it was all in first try, and that I did not think at all about the initialization for components, marginalized or observed—I simply auto-completed.

class Distribution:
def __init__(self):
self.components = []
self.marginalized = None
self.observed = None


def unobserved(self) -> Set[str]:

"""Returns a set of all unobserved random variable names inside
... (read more)

Wait, they did plain forbid you to use at all during work time, or they forbid to use its outputs for IT issues? Surely, using Codex for inspiration, given a natural language prompt and looking at what function it calls does not seem to infringe any copyright rules?

  • 1) If you start with your own variable names, it would auto-complete with those, maybe using something he learned online. would that count as plagiarism in your sense? How would that differ from copy-pasting from stack overflow changing the variable names (I'm not an expert in SO copyright terms
... (read more)

The problem with arguing against that claim is that nobody knows whether transformers/scaling language models are sufficient for full code automation. To take your nootropics example, an analogy would be if nootropics were legal, did not have negative side effects, with a single company giving "beta access" (for now) to a new nootropic in unlimited amount at no cost to a market of tens of millions of users, that the data from using this nootropic was collected by the company to improve the product, that there actually were 100k peer-reviewed publications p... (read more)

I buy that "generated code" will not add anything to the training set, and that Copilot doesn't help for having good data or (directly) better algorithms. However, the feedback loop I am pointing at is when you accept suggestions on Copilot. I think it is learning from human feedback on what solutions people select. If the model is "finetuned" to the specific dev's coding style, I would expect Codex to suggest even better code (because of high quality of finetuning data) to someone at OAI than me or you.

How much of this is 'quality of code' vs. 'quality of
... (read more)

Thanks for the post, it's a great idea to have both arguments.

My personal preference would be to have both arguments to be the same length to properly compare the strength of the arguments (skeptic is one paragraph, advocate is 3-6x longer), and not always in the same order skeptic then advocate, but also advocate -> skeptic or even skeptic -> advocate --> skeptic -> ..., so it does not appear like one is the "haven't thought about it much" view.

Right I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!

Awesome post! I happen to also have tried to distill links between RPE and phasic dopamine in the "Prefrontal Cortex as a Meta-RL System" of this blog.

In particular I reference this paper on DL in the brain & this other one for RL in the brain. Also, I feel like the part 3 about links between RL and neuro of the RL book is a great resource for this.

1Steve Byrnes
Thanks! If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016. I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and more importantly I just don't find that thinking in those terms leads me to a better understanding of anything, but whatever, YMMV. I don't agree with everything in the RL book chapter but it's still interesting, thanks for the link.

Funnily enough, I wrote a blog distilling what I learned from reproducing experiments of that 2018 Nature paper, adding some animations and diagrams. I especially look at the two-step task, the Harlow task (the one with monkeys looking at a screen), and also try to explain some brain things (e.g. how DA interacts with the PFN) at the end.

HN comment unsure about the meta-learning generalization claims that OpenAI has a "serious duty [...] to frame their results more carefully"

Having printed and read the full version, this ultra-simplified version was an useful summary.

Happy to read a (not-so-)simplified version (like 20-30 paragraphs).

Does that summarize your comment?

1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is "we want the AI to actively cooperate with humans", so there's real value from it being formalized in a paper.
3. In the counterfactual paper, there's the insight "what if the AI thinks he's not on but still learns".
For the last bit, I have two interpretations:
4.a. However, it's unclear that this design avoids all manipulative behaviour
... (read more)
1Alex Turner
It's more like 4a. The line of thinking seems useful, but I'm not sure that it lands.

The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that would be equivalent to skipping for specifying "no learning", or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.

Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?
... (read more)

The string is read with probability 1-

Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.

I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Beca... (read more)

The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions.

However, it still assumes we have access to this magic oracle that optimizes for where is the event where humans don't see the answer, its indicator function, and the counterfactual reward (given by the automatic machine). If humans were able to build ... (read more)

1Wei Dai
Why do we have to give the oracle a zero reward for the non-erasure episodes? Why not just skip the learning/update step for those episodes? Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts? Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.
0Ryan Carey
The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...
Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link

This is very clear. Communication link made me understand that it didn't have a direct physical effect on the agent. It you want to make it even more intuitive you could do a diagram, but this explanatio... (read more)

Layman questions:

1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using?

2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own ... (read more)

1Henrik Åslund
Thank you so much for your comments, Michaël! The post has been updated on most of them. Here are some more specific replies. 1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using? Reply: "State" refers to the state of the button, i.e., whether it is in an on state or an off state. It is now clarified. 2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it's useful to define it in the introduction. Reply: U_{N} is assumed rather than defined, but it is now clarified. 3. I don't understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It's seems like a contradiction, unless you mean "any capacity to directly press its shutdown button". Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link. Hopefully, this clarifies that they are uncorrelated. I think it's clear enough in the post already, but if you have some suggestion on how to clarify it even more, I'd gladly hear it! 4. What's the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature? Reply: It is now clarified. 5. "Worse still... for any action..." -> if you choose b as some action with bad corrigibility property, it seems reasonable

Hey Abram (and the MIRI research team)!

This post resonates with me on so many levels. I vividly remember the Human-Aligned AI Summer School where you used to be a "receiver" and Vlad was a "transmitter", when talking about "optimizers". Your "document" especially resonates with my experience running an AI Safety Meetup (Paris AI Safety).

On January 2019, I organized a Meetup about "Deep RL from human preferences". Essentially, the resources were by difficulty, so you could discuss the 80k podcast, the open A... (read more)