This is great work, but I'm a bit disappointed that x-risk-motivated researchers seem to be taking the "safety"/"harm" framing of refusals seriously. Instruction-tuned LLMs doing what their users ask is not unaligned behavior! (Or at best, it's unaligned with corporate censorship policies, as distinct from being unaligned with the user.) Presumably the x-risk-relevance of robust refusals is that having the technical ability to align LLMs to corporate censorship policies and against users is better than not even being able to do that. (The fact that instruc...
I agree pretty strongly with Neel's first point here, and I want to expand on it a bit: one of the biggest issues with interp is fooling yourself and thinking you've discovered something profound when in reality you've misinterpreted the evidence. Sure, you've "understood grokking"[1] or "found induction heads", but why should anyone think that you've done something "real", let alone something that will help with future dangerous AI systems? Getting rigorous results in deep learning in general is hard, and it seems empirically even harder in (mech) in...
First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem.
I do want to push back on the broader point though, I think getting refusals right does matter. I think a lo...
I think "Symbol/Referent Confusions in Language Model Alignment Experiments" is relevant here: the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself". (It's not evidence because it's fictional, but I can't help but think of the first chapter of Greg Egan's Diaspora, in which a young software mind is depicted as learning to say I and me before the "click" of self-awareness when it notices itself as a specially controllable element in its world-model.)
Of c...
I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don't work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than "simulators." But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).
Chapter 10 of Bostrom's Superintelligence (2014) is titled, "Oracles, Genies, Sovereigns, Tools". As the "Inadequate Ontologies" section of this post points out, ...
Sure, I am fine with calling it a "prediction objective" but if we drop the simulation abstraction then I think most of the sentences in this post don't make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic "prediction objective".
...> A simulation is the imitation of the operation of a real-world process or system over time.
[...]
It emphasizes the role of the model as a transition rule that evolves processes
how he confidently dismisses ANNs
I don't think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I'm pretty sure it wasn't supposed to be a positive claim that ANNs wouldn't work. Rather, it's that one couldn't justifiably believe that they'd work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a "clean" theory of intelligence, as contrasted to evolution or gradient des...
One of Yudkowsky's claims in the post you link is:
It's hard to build a flying machine if the only thing you understand about flight is that somehow birds magically fly. What you need is a concept of aerodynamic lift, so that you can see how something can fly even if it isn't exactly like a bird.
This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach.
It's true that ...
I second Rob's unanswered question at 40:12: how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?
How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?
how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?
This question needs a whole essay (or several) on its own. If I don't get around to leaving a longer answer in the next few days, ping me.
Meanwhile, if you want to think it through for yourself, the general question is: where the hell do humans get all their bits-of-search from?
...How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more c
As is demonstrated by the Hashlife algorithm, that exploits the redundancies for a massive speedup. That's not possible for things like SHA-256 (by design)!
I can't for the life of me remember what this is called
(Best wishes, Less Wrong Reference Desk)
Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn't been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.
Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetun... (read more)