Yeah, that makes sense.
For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as
"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),
& I created the question to see if we could substantiate the "yes" here with evidence.
It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about t...
My idea is very similar to paragraph vectors: the vectors are trained to be useful labels for predicting the tokens.
To differentiate author-vectors from other types of metadata, the author vectors should be additionally trained to predict author labels, with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author. There's also the author-vector-to-text-author-attribution network, which should be pre-trained to have a good "prior" over author-names (so we're not getting a bunch of nonsense strings out)....
Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.
Yeah, for better or worse, the logical induction paper is probably the best thing to read. The idea is actually to think of probabilities as prediction-market prices; the market analogy is a very strong one, not an indirect way of gesturing at the idea.
Yeah. I'm saying that the "good machine" should be trained on all three; it should be honest, but, constrained by helpfulness and harmlessness. (Or, more realistically, a more complicated constitution with more details.)
Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. ...
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”
I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.
So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.
Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-s...
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?
Maybe "some relatively short time later" was confusing. I mean long enou...
Concrete (if extreme) story:
World A:
Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.
Keep the information secret in order to avoid pushing capabilities forward.
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" ...
Hmmm. I'm not exactly sure what the disconnect is, but I don't think you're quite understanding my model.
I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put "capabilities" and "safety" on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn't capture some important dynamics.
There is not really a precise "finish line". Rather, we can point to various important events. The extinction of all humans lies down a path where...
I'm not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.
Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.
Put more simply: if someone is already building superintelligence & de...
Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?
I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.
Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.
I disagree with the statement "to some extent the goal of tiling-agents-like w...
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.
See here for an updated version of my thinking based on discussion with Daniel:
https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser
Yeah, I agree that's at least somewhat true. So, given that, is it a good move or not? I don't see much of an upside, since there's a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI's example of avoiding safety training for the CoT, I think it's right to discourage them from doing this rather than encourage them... although, I say this with very low confidence.
I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
I'm reacting to this whole thing as a possible default configuration going forward, rather than as it exists today. All of my concerns are about what happens when you scale it up. For example, I don't find o1's current level of deception hugely concerning in its own right; rather, I see it as a bad sign for the technology as it continues to develop.
I guess one way of fr...
Ah, yep, this makes sense to me.
Yeah, this is a good point, which doesn't seem addressed by any idea so far.
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating "the LLM world" in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it so...
Yeah, you're right, I no longer think it's an interesting proposal.
I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.
And, these should clearly be separate models such that the training of one doesn't generalize to the other.
So, making the Face more deceptive doesn't kill the canary?
Ah, yes, I had neglected this important distinction.
So what do you do in the end, throw away the face?
It seems worth pointing out that although the CoT isn't trained to be deceptive, it is trained to think thoughts which help out a deceptive face. So it seems plausible that a mul...
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Ah, I wasn't thinking "sparse" here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).
I still think o1 is moving towards chess on this spectrum.
Async is fine with me. Please feel encouraged to start it.
My current epistemic state is pretty "I'm not sure at all, it's a complicated subject" fwiw. I'll try to generate some responses anyway.
make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)
Sure. Therefore, you can't check for deceptive behavior using the CoT when you're trying to train against deceptive behavior. Therefore, Hubinger et al's conclusions...
Would you be interested in having a (perhaps brief) LW Dialogue about it where you start with a basic intro to your shoggoth/mask division-of-labor, and I then present my critique?
My point here is that at the capability level of GPT4, this distinction isn't very important. There's no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn't cleverly scheming. It is merely human-level at deception, and doesn't pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn't even adequately coherent to make a crisp distinction between whether it's honestly trying to answer the ques...
I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
...I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I th
Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.
One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.
A lot of possible process-...
I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).
What I am trying to say is something looser and more gestalt. I do thi...
I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away ...
I don't get why you think actively training the model to be deceptive is "worth the cost". The deceptive behavior can generalize to other cases, and make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate). It seems like a choice between two bad options: actively deceptive CoT or actively deceptive Face.
--...but the reward process (RLHF, RLAIF, whatever) gets to see the outputs of the Face but not the Shoggoth.
--So, the hope is that skills and propensities towards deception and dishonesty will accumulate in the Face rather than the Shoggoth.
It seems pretty important to put some incentives on the CoT to avoid steganography, though. The division of labor here seems aimed at not actively incentivizing instrumentally useful thoughts to get hidden. However, the pre-trained LLM will already be responsive to some steganographic patterns, in addition to the human...
...However, with this naive approach, the paraphrasings will probably still consistently contain certain kinds of noise from the initial thought. For instance, maybe the paraphraser never replaces an exclamation point with a period, even though this end-of-sentence punctuation doesn't really change the meaning. By modulating whether the step ends with a period or an exclamation point, the model can encode one bit of information per step, despite the paraphraser.
One way to solve this problem is to map every piece of text with a given meaning onto the same "can
Note that the optimized planner is only rewarded for producing steps that help the frozen planner. Since the frozen planner doesn't understand steganography, there is no reason for the optimized planner to learn to use it.
I think this conclusion is overstated. The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples. The optimized planner is very much like an adversarial example, in the way it is trained to generate outputs which nudge a frozen NN in a specific direction.
Imagine reversing the r...
Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
You can define it that way, but then I don't think it's highly relevant for this context.
The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.
The thing to prove is precisely convergence and calibration!
Convergence: Given arbitrary observed sequences, the probabilities of the latent bits (program tape and working tape) converge.
(Obviously the non-latent bits converge, if we assume they eventually get observed; hence the focus on the latent bits here.)
Calibration: Considering the sequence of probabilistic predictions has the property that, considering the subsequence of only predictions between 79% and 81% (for example), the empirical frequency of those predictions coming true will eventually be i...
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too?
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.
As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't ...
I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.
I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).
How would you become confident that a UANFSI approach was NFSI?
I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumpt...
Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.
I agree, but this doesn't explain why it would (seemingly) encourage itself to hallucinate.
My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious.
Maybe a better way to do it would be to explicitly take both approaches, so that there's an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to ex... (read more)