AI ALIGNMENT FORUM
AF

All of mic's Comments + Replies

Modern Transformers are AGI, and Human-Level

I think humans doing METR's tasks are more like "expert-level" rather than average/"human-level". But current LLM agents are also far below human performance on tasks that don't require any special expertise.

From GAIA:

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins.

mic2y10

As an overly simplistic example, consider an overseer that attempts to train a cleaning robot by providing periodic feedback to the robot, based on how quickly the robot appears to clean a room; such a robot might learn that it can more quickly “clean” the room by instead sweeping messes under a rug.^[15]

This doesn't seem concerning as human users would eventually discover that the robot has a tendency to sweep messes under the rug, if they ever look under the rug, and the developers would retrain the AI to resolve this issue. Can you think of an example that would be more problematic, in which the misbehavior wouldn't be obvious enough to just be trained away?

The Importance of AI Alignment, explained in 5 points

mic2y10

GPT-3, for instance, is notorious for outputting text that is impressive, but not of the desired “flavor” (e.g., outputting silly text when serious text is desired), and researchers often have to tinker with inputs considerably to yield desirable outputs.

Is this specifically referring to the base version of GPT-3 before instruction fine-tuning (davinci rather than text-davinci-002, for example)? I think it would be good to clarify that.

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

mic2y76

I first learned about the term "structural risk" in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.

To make sure these more complex and indirect effects of technology are not neglected, discussions of AI risk should complement the misuse and accident perspectives with a structural perspective. This perspective considers not only how a technological system may be misused or behave in unintended ways, but also how technology shapes the broader environment in ways that could be disruptive

... (read more)

Thoughts on the impact of RLHF research

mic2y20

Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse

Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.

I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.

Automating Auditing: An ambitious concrete technical research proposal

mic2y20

Is the auditing game essentially Trojan detection?

Automating Auditing: An ambitious concrete technical research proposal

mic2y10

Has anyone tried to work on this experimentally?

Mysteries of mode collapse

mic2y10

The prompt "Are birds real?" is somewhat more likely, given the "Birds aren't real" conspiracy theory, but still can yield a similarly formatted answer to "Are bugs real?"

The answer makes a lot more sense when you ask a question like "Are monsters real?" or "Are ghosts real?" It seems that with FeedMe, text-davinci-002 has been trained to respond with a template answer about how "There is no one answer to this question", and it has learned to misgeneralize this behavior to questions about real phenomena, such as "Are bugs real?"

The Alignment Problem from a Deep Learning Perspective (major rewrite)

mic2y50

Choosing actions which exploit known biases and blind spots in humans (as the Cicero Diplomacy agent may be doing [Bakhtin et al., 2022]) or in learned reward models. $^{10}$

I've spent several hours reading dialogue involving Cicero, and it's not at all evident to me that it's "exploiting known biases and blind spots in humans". It is, however, good at proposing and negotiating plans, as well as accumulating power within the context of the game.

A Barebones Guide to Mechanistic Interpretability Prerequisites

mic2y10

Thanks for writing this! Here is a quick explanation of all the math concepts – mostly written by ChatGPT with some manual edits.

A basis for a vector space is a set of linearly independent vectors that can be used to represent any vector in the space as a linear combination of those basis vectors. For example, in two-dimensional Euclidean space, the standard basis is the set of vectors (1, 0) and (0, 1), which are called the "basis vectors."

A change of basis is the process of expressing a vector in one basis in terms of another basis. For example, if we ha... (read more)

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

mic2y10

For example, it should be possible to mechanistically identify shards in small RL agents (such as the RL agents studied in Langosco et al)

Could you elaborate on how we could do this? I'm unsure if the state of interpretability research is good enough for this yet.

How could we know that an AGI system will have good consequences?

mic2y10

disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist

Would you be able to provide examples of other sorts of pivotal acts you have in mind?

Announcing the Introduction to ML Safety course

mic3y10

Roughly how many hours do you expect it takes to complete the course?