User Comment Replies — AI Alignment Forum

Hi, is there a way to get people in touch with a project or project lead? For example, I'd like to get in touch with Masaharu Mizumoto because iVAIS sounds related to the aintelope project.

2Linda Linsefors6mo

You can find their prefeed contact info in each document in the Team section.

1[comment deleted]6mo

Why is o1 so deceptive?

Answer by Gunnar ZarnckeSep 27, 20246-3

I notice that o1's behavior (it's cognitive process) looks suspiciously like human behaviors:

Cognitive dissonance: o1 might fabricate or rationalize to maintain internal consistency of conflicting data (which means there is inconsistency).
Impression management/Self-serving bias: o1 may attempt to appear knowledgeable or competent, leading to overconfidence because it is rewarded for the look more than for the content (which means the model is stronger than the feedback).

But why is this happening more when o1 can reason more than previous models? Shouldn't ... (read more)

3Abram Demski7mo

To me, this comparison to humans doesn't seem to answer why the o1 training ended up producing this result.

Four visions of Transformative AI success

Gunnar Zarncke1y70

I think the four scenarios outlined here roughly map to the areas 1, 6, 7, and 8 of the 60+ Possible Futures post.

Compositional language for hypotheses about computations

Gunnar Zarncke2y40

Can you provide some simple or not-so-simple example automata in that language?

5Vanessa Kosoy2y

Good idea! Example 1 Fix some alphabet Σ. Here's how you make an automaton that checks that the input sequence (an element of Σ∗) is a subsequence of some infinite periodic sequence with period n. For every k in Z/n, let Ak be an automaton that checks whether the symbols in the input sequences at places i s.t. i≡k(modn) are all equal (its number of states is O(n|Σ|)). We can modify it to make a transducer A′k that produces its unmodified input sequence if the test passes and ⊥ if the test fails. It also produces ⊥ when the input is ⊥. We then chain A′0,A′1…A′n−1 and get the desired automaton. Alternatively, we can connect the Ak in parallel and then add an automaton B with n boolean inputs that acts as an AND gate. B is a valid multi-input automaton in our language because AND is associative and commutative (so we indeed get a functor on the product category). Notice that the description size of this automaton in our language is polynomial in n. On the other hand, a tabular description would be exponential in n (the full automaton has exponentially many states). Moreover, I think that any regular expression for this language is also exponentially large. Example 2 We only talked about describing deterministic (or probabilistic, or monadic) automata. What about nondeterministic? Here is how you can implement a nondeterministic automaton in the same language, without incurring the exponential penalty of determinization, assuming non-free categories are allowed. Let B be some category that contains an object B and a morphism b:B→B s.t. b≠idB and b2=b. For example it can be the closed cartesian category freely generated by this datum (which I think is well-defined). Then, we can simulate a non-deterministic automaton A on category C by a deterministic transducer from C to B: * The state set is always the one element set (or, it can be two elements: "accept" and "reject"). * For every state of A, we have a variable of signature B→B. This variable is intended to

Thoughts on hardware / compute requirements for AGI

Gunnar Zarncke2y52

Just a data point that support hold_my_fish's argument: Savant Kim Peek did likely memorize gigabytes of information and could access them quite reliably:

https://personal.utdallas.edu/~otoole/CGS_CV/R13_savant.pdf

2Steve Byrnes2y

Ooh interesting! Can you say how you're figuring that it's "gigabytes of information?"

TurnTrout's shortform feed

Gunnar Zarncke2y10

Are there different classes of learning systems that optimize for the reward in different ways?

3Alex Turner2y

Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.

Understanding and avoiding value drift

Gunnar Zarncke3y*44

I don't think that shards are distinct - neither physically nor logically, so they can't hide stuff in the sense of keeping it out of view of the other shards.

Also, I don't think "querying for plans" is a good summary of what goes on in the brain.

I'm coming more from a brain-like AGI lens, and my account of what goes on would be a bit different. I'm trying to phrase this in shard theory terminology.

First, a prerequisite: Why do Alice's shards generate thoughts that value Rick's state, to begin with? The Risk-shard has learned that actions that... (read more)

3Alex Turner2y

Thanks for detailing it. I understand you to describe ~iterative filtering and refinement of a crude proto-plan (church-related thoughts and impulses) which filter down into a more detailed plan, where each piece is selected to be amenable to all relevant shards (without explicit planning). I think it doesn't sound quite right to me, still, for a few reasons I'll think about more.

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Gunnar Zarncke3y00

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)
Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

Can you provide some links to these groups?

ao3y88

These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

How I think about alignment

Gunnar Zarncke3y00

Some observations:

Genes reproduce themselves.
Humans reproduce themselves.
Symbols are relearned.
Values are reproduced.

Each needs an environment to do so, but the key observation seems to be that a structure is reliably reproduced across intermediate forms (mitosis, babies, language, society) and build on top of each other. It seems plausible that there is a class of formal representations that describe

the parts that are retained across instances and
the embedding into each other (values into genes and symbols), and
the dynamics of the transfer.

1Linda Linsefors3y

If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics. Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from "agent" being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world. Either way, the type of agency I'm mainly interested in is the type of agency that have other goals than just self replication. Although maybe there are things to be learned from the special case of having self replication as a goal? If the AI learns my values then this is a replication of my values. But there are also examples of magic agentic force where my values are not copied at any point along the way. Looking at how society is transferred between generations, might have some clues to value learning? But I'm less optimistic about looking at what is similar between self replication in general, because I think I already know this, and also, it seems to be one abstraction level to high, i.e. the similarity are properties above the mechanistic details, and those details is what I want.

A Mechanistic Interpretability Analysis of Grokking

Gunnar Zarncke3y36

You don't talk about human analogs of grokking, and that makes sense for a technical paper like this. Nonetheless, grokking also seems to happen in humans, and everybody has had "Aha!" moments before. Can you maybe comment a bit on the relation to human learning? It seems clear that human grokking is not a process that purely depends on the number of training samples seen but also on the availability of hypotheses. People grok faster if you provide them with symbolic descriptions of what goes on. What are your thoughts on the representation and transfer of the resulting structure, e.g., via language/token streams?

Neel Nanda3y48

Hmm. So firstly, I don't think ML grokking and human grokking having the same name is that relevant - it could just be a vague analogy. And I definitely don't claim to understand neuroscience!

That said, I'd guess there's something relevant about phase changes? Internally, I know that I initially feel very confused, then have some intuition of 'I half see some structure but it's super fuzzy', and then eventually things magically click into place. And maybe there's some similar structure around how phase changes happen - useful explanations get reinforced, a... (read more)

Shard Theory: An Overview

Gunnar Zarncke3y00

I mean scoring thoughts in the sense of [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering with what Steven calls "Thought Assessors". Thoughts totally get scored in that sense.

3Alex Turner3y

I think David is referring to the claims made by Human values & biases are inaccessible to the genome.

Shard Theory: An Overview

Gunnar Zarncke3y23

About the problems you mention:

the apparent phenomenon of credit assignment improving over a lifetime. When you're older and wiser, you're better at noticing which of your past actions were bad and learning from your mistakes.

I don't get why you see a problem here. More data will lead to better models over time. You get exposed to more situations, and with more data, the noise will slowly average out. Not necessarily because you can clearly attribute things to their causes, but because you randomly get into a situation where the effect is more clear. It mo... (read more)

1David Udell3y

Maybe. But your subcortical reinforcement circuitry cannot (easily) score your thoughts. What it can score are the mystery computations that led to hardcoded reinforcement triggers, like sugar molecules interfacing with tastebuds. When you're just thinking to yourself, all of that should be a complete black-box to the brainstem. I did mention that something is going on in the brain with self-supervised learning, and that's probably training your active computations all the time. Maybe shards can be leveraging this training loop? I'm currently quite unclear on this, though.

General alignment properties

Gunnar Zarncke3y23

The main difference between LDAIXI and a human in terms of ontology seems to be that the things the human values are ultimately grounded in senses and a reward tied to that. For example, we value sweet things because we have a detector for sweetness and a reward tied to that. When our understanding of what sugar is changes the detector doesn't, and thus the ontology change works out fine. But I don't see a reason you couldn't set up LDAIXI the same way: Just specify the reward in terms of a diamond detector - or multiple ones. In the end, there are already detectors that AIXI uses - how else would it get input?

3Alex Turner3y

Because LDAIXI doesn't e.g. have the credit assignment mechanism which propagates reward into learned values. Hutter just called it "reward." But that "reward function" is really just a utility function over observation histories, or the work tapes of the hypotheses, or whatever. Not the same as the mechanisms within people which make them have good general alignment properties. (See also: the detached lever fallacy)

[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

Gunnar Zarncke3y30

Thank you for mentioning us. In fact, the list of candidate instincts got longer. It isn't in a presentable form yet, but please message me if you want to talk about it.

The list is more theoretical, and I want to prove that this is not just theoretical speculation by operationalizing it. jpyykko is already working on something more on the symbolic level.

Rohin Shaw recommended that I find people to work with me on alignment, and I teamed up with two LWers. We just started work on a project to simulate instinct-cued learning in a toy-world. I think this project fits research point 15.2.1.2, and I wonder now how to apply for funding - we would probably need it if we want to simulate with somewhat larger NNs.

2Linda Linsefors3y

I'm also interested to se the list of candidate instincts. Regarding funding, how much money do you need? Just order of magnitude. There lots of diffrent grants and where you want to appy depends on the size of your budget.

1[comment deleted]3y

AI ALIGNMENT FORUM
AF

All of Gunnar_Zarncke's Comments + Replies