User Comment Replies — AI Alignment Forum

The case for more ambitious language model evals

I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.

A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that's a strong indication of authorship identification capability.

2janus1y

Note the prompt I used doesn't actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone. Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most "LLM alignment researchers" or other distinctive groups like academia narrows you down to probably just a few people, including Gwern. The reason truesight works (more than one might naively expect) is probably mostly that there's mountains of evidence everywhere (compared to naively expected). Models don't need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.

Abram Demski's ELK thoughts and proposal - distillation

Jacob Pfau1y10

What part of the proposal breaks if we do counterfactuals in input space rather than on the predictor's state?

1Rubi Hudson1y

Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question. I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.

Conditioning Predictive Models: Making inner alignment as easy as possible

Jacob Pfau2y*30

the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).

Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increas... (read more)

4Evan Hubinger2y

To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model's development it gets an understanding of the training process matters a lot for deceptive alignment). From footnote 6 above:

Causal confusion as an argument against the scaling hypothesis

Jacob Pfau3y50

(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:

I. (Safety) Relevant examples of Z

To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.

Here are a couple examples I ... (read more)

ELK Thought Dump

Jacob Pfau3y00

Here's Chalmers defending his combinatorial state automata idea.

0Tassilo Neubauer3y

Thanks! Exactly what I was looking for :)

Instrumental Convergence For Realistic Agent Objectives

Jacob Pfau3y00

Yes, I agree that in the simplest case, SC2 with default starting resources, you just build one or two units and you're done. However, I don't see why this case should be understood as generically explaining the negative alpha weights setting. Seems to me more like a case of an excessively simple game?

Consider the set of games starting with various quantities of resources and negative alpha weights. As starting resources increase, you will be incentivised to go attack your opponent to interfere with their resource depletion. Indeed, if the reward is based... (read more)

3Alex Turner3y

I agree that in certain conceivable games which are not baseline SC2, there will be different power-seeking incentives for negative alpha weights. My commentary wasn't intended as a generic takeaway about negative feature weights in particular. But in the game which actually is SC2, where you don't start with a huge number of resources, negative alpha weights don't incentivize power-seeking. You do need to think about the actual game being considered, before you can conclude that negative alpha weighs imply such-and-such a behavior. I think that either γ<<1 or considering suboptimal power-seeking resolves the situation. The reason that building infrastructure intuitively seems like power-seeking is that we are not optimal logically omniscient agents; all possible future trajectores do not lay out immediately before our minds. But the suboptimal power-seeking metric (Appendix C in Optimal Policies Tend To Seek Power) does match intuition here AFAICT, where cleverly building infrastructure has the effect of navigating the agent to situations with more cognitively exploitable opportunities.

Instrumental Convergence For Realistic Agent Objectives

Jacob Pfau3y20

Am I correct to assume that the discussion of StarCraft and Minecraft are discussing single-player variants of those games?

It seems to me that in a competitive, 2-player, minimize-resource-competition StarCraft, you would want to go kill your opponent so that they could no longer interfere with your resource loss? More generally, I think competitions to minimize resources might still usually involve some sort of power-seeking. I remember reading somewhere that 'losing chess' involves normal-looking (power-seeking?) early game moves.

4Alex Turner3y

I'm implicitly assuming a fixed opponent policy, yes. Without being overly familiar with SC2—you don't have to kill your opponent to get to 0 resources, do you? From my experience with other RTS games, I imagine you can just quickly build units and deplete your resources, and then your opponent can't make you accrue more resources. Is that wrong?

AI ALIGNMENT FORUM
AF

All of Jacob Pfau's Comments + Replies

I. (Safety) Relevant examples of Z