I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.

I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.

Reply

A Problem to Solve Before Building a Deception Detector

Chris_Leong4d10

Recently, the focus of mechanistic interpretability work has shifted to thinking about "representations", rather than strictly about entire algorithms

Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).

That said, great post! I really appreciated your conceptual frames.

Reply

Chris_Leong's Shortform

Chris_Leong6d*611

Collapsable boxes are amazing. You should consider using them in your posts.

They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.

Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.

Reply

AI for AI safety

Chris_Leong13d*1-1

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors^[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.

You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".

I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.

One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.

You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.

^{^}
More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.

Reply

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

Chris_Leong15d20

Lots of interesting ideas here, but the connection to alignment still seems a bit vague.

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.

Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.

Reply

The “no sandbagging on checkable tasks” hypothesis

Chris_Leong1mo10

That would make the domain of checkable tasks rather small.

That said, it may not matter depending on the capability you want to measure.

If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.

On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.

Reply

On the Rationality of Deterring ASI

Chris_Leong1mo32

Points for creativity, though I'm still somewhat skeptical about the viability of this strategy,

Reply

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Chris_Leong1mo10

I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?

Reply

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Chris_Leong2mo10

First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)

I wouldn't quite go so far as to say it "tackles" the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?

Reply

Chris_Leong's Shortform

Chris_Leong2mo00

Acausal positive interpretation

Reply