All of Cleo Nardo's Comments + Replies

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

0cdt
This is not an obvious continuation of the prompt to me - maybe there are just a lot more examples of explicit refusal on the internet than there are in (e.g.) real life.

Hey Nisan. Check the following passage from Domain Theory (Samson Abramsky and Achim Jung). This might be helpful for equipping  with an appropriate domain structure. (You mention [JP89] yourself.)

We should also mention the various attempts to define a probabilistic version of the powerdomain construction, see [SD80, Mai85, Gra88, JP89, Jon90].

  • [SD80] N. Saheb-Djahromi. CPO’s of measures for nondeterminism. Theoretical Computer Science, 12:19–37, 1980.
  • [Mai85] M. Main. Free constructions of powerdomains. In A. Melton, editor, Mathematical Foundati
... (read more)
1Nisan
Thanks! For convex sets of distributions: If you weaken the definition of fixed point to gC(S)≤S, then the set {S∈CX:gC(S)=S} has a least element which really is a least fixed point.

However, if it is the case that the difference between humans and monkeys is mostly due to a one-shot discrete difference (ie language), then this cannot necessarily be repeated to get a similar gain in intelligence a second time.

Perhaps language is a zero-one, i.e. language renders a mind "cognitively complete" in the sense that the mind can represent anything about the external world, and make any inferences using those representations. But intelligence is not thereby zero-one because intelligence depends on continuous variables like computional speed... (read more)

In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;

I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)

Yep, you're correct. The original argument in the Waluigi mega-post was sloppy.

... (read more)

You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.

But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.

3Abram Demski
What report is the image pulled from?

Yep I think you might be right about the maths actually.

I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.

So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.

I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.

Actually, maybe "attractor" is the wr... (read more)

2Abram Demski
I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.

Yes — this is exactly what I've been thinking about!

Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.

If the answer is "yes", then that's a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.