(I would guess a "no comment" or lack of response or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)

William_S's Shortform

Oliver Habryka2d144

Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?

Transformers Represent Belief State Geometry in their Residual Stream

Oliver Habryka4d31

Promoted to curated: Formalizing what it means for transformers to learn "the underlying world model" when engaging in next-token prediction tasks seems pretty useful, in that it's an abstraction that I see used all the time when discussing risks from models where the vast majority of the compute was spent in pre-training, where the details usually get handwaived. It seems useful to understand what exactly we mean by that in more detail.

I have not done a thorough review of this kind of work, but it seems to me that also others thought the basic ideas in the work hold up, and I thought reading this post gave me crisper abstractions to talk about this kind of stuff in the future.

Tamsin Leake's Shortform

Oliver Habryka2mo53

They still make a lot less than they would if they optimized for profit (that said, I think most "safety researchers" at big labs are only safety researchers in name and I don't think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).

Tamsin Leake's Shortform

Oliver Habryka2mo76

I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.

More people getting into AI safety should do a PhD

Oliver Habryka2mo1811

Hmm, it feels to me this misses the most important objection to PhDs, which is that many PhDs seem to teach their students actively bad methodologies and inference methods, sometimes incentivize students to commit scientific fraud, teach writing habits that are optimized to obscure and sound smart instead of aiming to explain clearly and straightforwardly, and often seem to produce zero-sum ideas around ownership of work and intellectual ideas that seem pretty bad for a research field.

To be clear, there are many PhD opportunities that do not have these problems, but many of them do, and it seems to me quite important to somehow identify PhD opportunities that do not have this problem. If you only have the choice to do a PhD under an advisor who does not to you seem actually good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues, then I think almost any other job will be better preparation for a research career.

Updatelessness doesn't solve most problems

Oliver Habryka2mo89

Promoted to curated: I think it's pretty likely a huge fraction of the value of the future will be determined by the question this post is trying to answer, which is how much game theory produces natural solutions to coordination problems, or more generally how much better we should expect systems to get at coordination as they get smarter.

I don't think I agree with everything in the post, and a few of the characterizations of updatelessness seem a bit off to me (which Eliezer points to a bit in his comment), but I still overall found reading this post quite interesting and valuable for helping me think about for which of the problems of coordination we have a more mechanistic understanding of how being smarter and better at game theory might help, and which ones we don't have good mechanisms for, which IMO is a quite important question.

The case for ensuring that powerful AIs are controlled

Oliver Habryka3mo103

Promoted to curated: I disagree with a bunch of the approach outlined in this post, but I nevertheless found this framing quite helpful for thinking about various AI X-risk related outcomes and plans. I also really appreciate the way this post is written, being overall both approachable while maintaining relatively much precision in talking about these issues.

TurnTrout's shortform feed

Oliver Habryka3mo68

I also think the cognition in a weather model is very alien. It's less powerful and general, so I think the error of applying something like the Shoggoth image to that (or calling it "alien") would be that it would imply too much generality, but the alienness seems appropriate.

If you somehow had a mind that was constructed on the same principles as weather simulations, or your laptop, or your toaster (whatever that would mean, I feel like the analogy is fraying a bit here), that would display similar signs of general intelligence as LLMs, then yeah, I think analogizing them to alien/eldritch intelligences would be pretty appropriate.

It is a very common (and even to me tempting) error to see a system with the generality of GPT-4, trained on human imitation, and imagine that it must internally think like a human. But my best guess is that is not what is going on, and in some sense it is valuable to be reminded that the internal cognition going on in GPT-4 is probably similarly far from what is going in a human brain as a weather simulation is very different from what is going in a human trying to forecast the weather (de-facto I think GPT-4 is somewhere in-between since I do think the imitation learning does create some structural similarities that are stronger between humans and LLMs, but I think overall being reminded of this relevant dimension of alienness pays off in anticipated experiences a good amount).

TurnTrout's shortform feed

Oliver Habryka3mo710

These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with "no", in that they don't seem to capture what I mean by "alien", or feel slightly strawman-ish.

Responding at a high-level:

There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples:
- Does the base model pass a Turing test?
- Does the performance distribution of the base model on different tasks match the performance distribution of humans?
- Does the generalization and learning behavior of the base model match how humans learn things?
  - When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
There are a lot of structural and algorithmic properties that could match up between human and LLM systems:
- Do they interface with the world in similar ways?
- Do they require similar amounts and kinds of data to learn the same relationships?
- Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.

I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research.

For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.