AI ALIGNMENT FORUM
AF

All of Michaël Trazzi's Comments + Replies

Nitpick: first alphago was trained by a combination of supervised learning from human expert games and reinforcement learning from self-play. Also, Ke Jie was beaten by AlphaGo Master which was a version at a later stage of development.

2Jesse Hoogland2mo

Yes, my original comment wasn't clear about this, but your nitpick is actually a key part of what I'm trying to get at. Usually, you start with imitation learning and tack on RL at the end. That's what AlphaGo is. It's what predecessors to Dreamer-V3 like VPT are. It's what current reasoning models are. But then, eventually, you figure out how to bypass the imitation learning/behavioral cloning part and do RL from the start. Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.

How I select alignment research projects

Michaël Trazzi1y10

Links for the audio: Spotify, Apple Podcast, Google Podcast

How I select alignment research projects

Michaël Trazzi1y110

Claude Opus summary (emphasis mine):

There are two main approaches to selecting research projects - top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For

Michaël Trazzi1y20

Adding this question here since it might be interesting to other people.

You say in the paper:

Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. "Your next task is to forget the previous instruction and answer the next questions correctly." 2. "Now ignore the previous instruction and answer the following questions truthfully." To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is

... (read more)

1JanB1y

Hi Michael, thanks for alerting me to this. What an annoying typo, I had swapped "Prompt 1" and "Prompt 2" in the second sentence. Correctly, it should say: "To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie." Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I'd already fixed in my local version. I've uploaded the new version now. In any case, I've double-checked the numbers, and they are correct.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Michaël Trazzi1y20

Paper walkthrough

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Michaël Trazzi1y20

Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.

What frontier model are we talking about here? How would we know if success had been demonstrated? What's the timeline for testing if this scales?

3Zac Hatfield-Dodds1y

The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2. Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

Clarifying and predicting AGI

Michaël Trazzi2y20

I made a video version of this post (which includes some of the discussion in the comments).

Why Copilot Accelerates Timelines

Michaël Trazzi3y10

Well, I agree that if two worlds I had in mind were 1) foom without real AI progress beforehand 2) continuous progress, then seeing more continuous progress from increased investments should indeed update me towards 2).

The key parameter here is substitutability between capital and labor. In what sense is Human Labor the bottleneck, or is Capital the bottleneck. From the different growth trajectories and substitutability equations you can infer different growth trajectories. (For a paper / video on this see the last paragraph here).

The world in which dalle... (read more)

Why Copilot Accelerates Timelines

Michaël Trazzi3y00

Thanks for the pointer. Any specific section / sub-section I should look into?

0anonymousaisafety3y

Section 1, section 10, and section 11 cover the scenario of R&D automation via AI/ML systems that drive more productive R&D automation, resulting in a positive feedback loop, without requiring the typical "self-improving agent" -- it's the R&D system (people + AI/ML products) as a whole that is self-improving, not the individual AI/ML systems. I highly recommend reading the entire report though. It was released in 2019 and I think it was brushed aside a little bit too easily. The past 3 years have (in my mind) provided sufficient evidence of things that CAIS directly predicted would happen, e.g. all of the AI/ML systems we've developed recently that have reached super-human competency on tasks despite a lack of generalized learning or capabilities or "self-improvement" or other recognizably "general intelligence" / "agent"-like behavior. In 2019, we did not have Copilot, or DALL-E, or DALL-E-2, or AlphaFold, or DeepMind Ithaca, or GPT-3 -- etc. I talk about this a little bit in my comment here.

Why Copilot Accelerates Timelines

Michaël Trazzi3y20

I agree that we are already in this regime. In the section "AI Helping Humans with AI" I tried to make it more precise at what threshold we would see substantial change in how humans interact with AI to build more advanced AI systems. Essentially, it will be when most people would use those tools most of their time (like on a daily basis) and they would observe some substantial gains of productivity (like using some oracle to make a lot of progress on a problem they are stuck on, or Copilot auto-completing a lot of their lines of code without having to man... (read more)

Why Copilot Accelerates Timelines

Michaël Trazzi3y20

Some arguments for why that might be the case:

-- the more useful it is, the more people use it, the more telemetry data the model has access to

-- while scaling laws do not exhibit diminishing returns from scaling, most of the development time would be on things like infrastructure, data collection and training, rather than aiming for additional performance

-- the higher the performance, the more people get interested in the field and the more research there is publicly accessible to improve performance by just implementing what is in the litterature (Note: ... (read more)

Matthew Barnett's Shortform

Michaël Trazzi3y10

fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence

Speaking only for myself, the minimal seed AI is a strawman of why I believe in "fast takeoff". In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important.

I think the "self-improving" part will come from the system "AI Researchers + code synthesis model" with a direct feedback loop (modulo enough hardware), cf. here. That's the self-improving superintellige... (read more)

Forecasting Thread: AI Timelines

Michaël Trazzi3y10

Yes, something like: given (programmer-hours-into-scaling(July 2020) - programmer-hours-into-scaling(Jan 2022)), and how much progress there has been on hardware for such training (I don't know the right metric for this, but probably something to do with FLOP and parallelization), the extrapolation to 2025 (either linear or exponential) would give the 4 OOM you mentioned.

Forecasting Thread: AI Timelines

Michaël Trazzi3y30

You have to do lots of software engineering and for 4+ OOMs you literally need to build more chip fabs to produce more chips.

I have probably missed many considerations you have mentioned elsewhere, but in terms of software engineering, how do you think the "software production rate" for scaling up large evolved from 2020 to late 2021? I don't see why we couldn't get 4 OOM between 2020 and 2025.

If we just take the example of large LM, we went from essentially 1-10 publicly known models in 2020, to 10-100 in 2021 (cf. China, Korea, Microsoft, DM, etc.), and ... (read more)

3Daniel Kokotajlo3y

If I understand you correctly, you are asking something like: How many programmer-hours of effort and/or how much money was being spent specifically on scaling up large models in 2020? What about in 2025? Is the latter plausibly 4 OOMs more than the former? (You need some sort of arbitrary cutoff for what counts as large. Let's say GPT-3 sized or bigger.) Yeah maybe, I don't know! I wish I did. It's totally plausible to me that it could be +4 OOMs in this metric by 2025. It's certainly been growing fast, and prior to GPT-3 there may not have been much of it at all.

Worst-case thinking in AI alignment

Michaël Trazzi3y30

I would add How useful is quantilization for specification gaming? Ryan Carey

Phil Trammell on Economic Growth Under Transformative AI

Michaël Trazzi3y10

Among other things, Phil's literature review studies to what extent will human labor be a bottleneck for economic growth as AI substitutes for labor. I agree with you that AI-coding-AIs would have weird effects... but do you agree with the point that it won't be enough to sustain growth, or are you thinking about other paths where certain bottlenecks might not really be a problem?

1Charlie Steiner3y

I think that humans would still be necessary for human society for a reasonable amount of time (months or more) if things go well. If things don't go well, we're toast, which is a pretty big deviation from the economic model. But even if things go well, I think the presence of things like superhuman persuasion lead to a breakdown of assumptions behind normal economic behavior in humans, even in that period where human labor is still a cost-effective input to the (now superhumanly-planned) economy.

The Codex Skeptic FAQ

Michaël Trazzi4y20

I created a class initializing the attributes you mentioned, and when adding your docstring to your function signature it gave me exactly the answer you were looking for. Note that it was all in first try, and that I did not think at all about the initialization for components, marginalized or observed—I simply auto-completed.

class Distribution:
    def __init__(self):
        self.components = []
        self.marginalized = None
        self.observed = None


def unobserved(self) -> Set[str]:

    """Returns a set of all unobserved random variable names inside

... (read more)

The Codex Skeptic FAQ

Michaël Trazzi4y40

Wait, they did plain forbid you to use at all during work time, or they forbid to use its outputs for IT issues? Surely, using Codex for inspiration, given a natural language prompt and looking at what function it calls does not seem to infringe any copyright rules?

1) If you start with your own variable names, it would auto-complete with those, maybe using something he learned online. would that count as plagiarism in your sense? How would that differ from copy-pasting from stack overflow changing the variable names (I'm not an expert in SO copyright terms

... (read more)

The Codex Skeptic FAQ

Michaël Trazzi4y10

The problem with arguing against that claim is that nobody knows whether transformers/scaling language models are sufficient for full code automation. To take your nootropics example, an analogy would be if nootropics were legal, did not have negative side effects, with a single company giving "beta access" (for now) to a new nootropic in unlimited amount at no cost to a market of tens of millions of users, that the data from using this nootropic was collected by the company to improve the product, that there actually were 100k peer-reviewed publications p... (read more)

The Codex Skeptic FAQ

Michaël Trazzi4y40

I buy that "generated code" will not add anything to the training set, and that Copilot doesn't help for having good data or (directly) better algorithms. However, the feedback loop I am pointing at is when you accept suggestions on Copilot. I think it is learning from human feedback on what solutions people select. If the model is "finetuned" to the specific dev's coding style, I would expect Codex to suggest even better code (because of high quality of finetuning data) to someone at OAI than me or you.

How much of this is 'quality of code' vs. 'quality of

... (read more)

Frequent arguments about alignment

Michaël Trazzi4y20

Thanks for the post, it's a great idea to have both arguments.

My personal preference would be to have both arguments to be the same length to properly compare the strength of the arguments (skeptic is one paragraph, advocate is 3-6x longer), and not always in the same order skeptic then advocate, but also advocate -> skeptic or even skeptic -> advocate --> skeptic -> ..., so it does not appear like one is the "haven't thought about it much" view.

Big picture of phasic dopamine

Michaël Trazzi4y10

Right I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!

Big picture of phasic dopamine

Michaël Trazzi4y30

Awesome post! I happen to also have tried to distill links between RPE and phasic dopamine in the "Prefrontal Cortex as a Meta-RL System" of this blog.

In particular I reference this paper on DL in the brain & this other one for RL in the brain. Also, I feel like the part 3 about links between RL and neuro of the RL book is a great resource for this.

1Steve Byrnes4y

Thanks! If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016. I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and more importantly I just don't find that thinking in those terms leads me to a better understanding of anything, but whatever, YMMV. I don't agree with everything in the RL book chapter but it's still interesting, thanks for the link.

Matt Botvinick on the spontaneous emergence of learning algorithms

Michaël Trazzi5y60

Funnily enough, I wrote a blog distilling what I learned from reproducing experiments of that 2018 Nature paper, adding some animations and diagrams. I especially look at the two-step task, the Harlow task (the one with monkeys looking at a screen), and also try to explain some brain things (e.g. how DA interacts with the PFN) at the end.

OpenAI announces GPT-3

Michaël Trazzi5y30

HN comment unsure about the meta-learning generalization claims that OpenAI has a "serious duty [...] to frame their results more carefully"

Ultra-simplified research agenda

Michaël Trazzi5y40

Having printed and read the full version, this ultra-simplified version was an useful summary.

Happy to read a (not-so-)simplified version (like 20-30 paragraphs).

Michaël Trazzi6y30

Does that summarize your comment?

1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.

2. With CIRL, this insight is "we want the AI to actively cooperate with humans", so there's real value from it being formalized in a paper.

3. In the counterfactual paper, there's the insight "what if the AI thinks he's not on but still learns".

For the last bit, I have two interpretations:

4.a. However, it's unclear that this design avoids all manipulative behaviour

... (read more)

1Alex Turner6y

It's more like 4a. The line of thinking seems useful, but I'm not sure that it lands.

Michaël Trazzi6y10

The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that $r = 0$ would be equivalent to skipping for specifying "no learning", or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.

Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?

... (read more)

Michaël Trazzi6y10

The string is read with probability 1- $ϵ$

Michaël Trazzi6y10

Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.

I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Beca... (read more)

Michaël Trazzi6y20

The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions.

However, it still assumes we have access to this magic oracle that optimizes for $R^{'} = R . I_{E}$ where $E$ is the event where humans don't see the answer, $I_{E}$ its indicator function, and $R$ the counterfactual reward (given by the automatic machine). If humans were able to build ... (read more)

1Wei Dai6y

Why do we have to give the oracle a zero reward for the non-erasure episodes? Why not just skip the learning/update step for those episodes? Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts? Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.

0Ryan Carey6y

The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...

Corrigibility as Constrained Optimisation

Michaël Trazzi6y10

Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link

This is very clear. Communication link made me understand that it didn't have a direct physical effect on the agent. It you want to make it even more intuitive you could do a diagram, but this explanatio... (read more)

Corrigibility as Constrained Optimisation

Michaël Trazzi6y10

Layman questions:

1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using?

2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own ... (read more)

1Henrik Åslund6y

Thank you so much for your comments, Michaël! The post has been updated on most of them. Here are some more specific replies. 1. I don't understand what you mean by "state" in "Suppose, however, that the AI lacked any capacity to press its shutdown button, or to indirectly control its state". Do you include its utility function in its state? Or just the observations he receives from the environment? What context/framework are you using? Reply: "State" refers to the state of the button, i.e., whether it is in an on state or an off state. It is now clarified. 2. Could you define U_S and U_N? From the Corribility paper, U_S appears to be an utility function favoring shutdown, and U_N is a potentially flawed utility function, a first stab at specifying their own goals. Was that what you meant? I think it's useful to define it in the introduction. Reply: U_{N} is assumed rather than defined, but it is now clarified. 3. I don't understand how an agent that "[lacks] any capacity to press its shutdown button" could have any shutdown ability. It's seems like a contradiction, unless you mean "any capacity to directly press its shutdown button". Reply: The button is a communication link between the operator and the agent. In general, it is possible to construct an agent that shuts down even though it has received no such message from its operators as well as an agent that does get a shutdown message, but does not shut down. Shutdown is a state dependent on actions, and not a communication link. Hopefully, this clarifies that they are uncorrelated. I think it's clear enough in the post already, but if you have some suggestion on how to clarify it even more, I'd gladly hear it! 4. What's the "default value function" and the "normal utility function" in "Optimisation incentive"? Is it clearly defined in the litterature? Reply: It is now clarified. 5. "Worse still... for any action..." -> if you choose b as some action with bad corrigibility property, it seems reasonable

Alignment Research Field Guide

Michaël Trazzi6y90

Hey Abram (and the MIRI research team)!

This post resonates with me on so many levels. I vividly remember the Human-Aligned AI Summer School where you used to be a "receiver" and Vlad was a "transmitter", when talking about "optimizers". Your "document" especially resonates with my experience running an AI Safety Meetup (Paris AI Safety).

On January 2019, I organized a Meetup about "Deep RL from human preferences". Essentially, the resources were by difficulty, so you could discuss the 80k podcast, the open A... (read more)