AI ALIGNMENT FORUM
Petrov Day
AF

212
Kaj Sotala
Ω5511481161
Message
Dialogue
Subscribe

I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer. 

Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Kaj's shortform feed
7y
0
AI companies have started saying safeguards are load-bearing
Kaj_Sotala1mo54

I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.

Reply
Daniel Kokotajlo's Shortform
Kaj_Sotala3mo50

I disagree-voted because it felt a bit confused but I was having difficulty clearly expressing how exactly. Some thoughts:

  • I think this is a misleading example because humans do actually do something like reward maximization, and the typical drug addict is actually likely to eventually change their behavior if the drug is really impossible acquire for a long enough time. (Though the old behavior may also resume the moment the drug becomes available again.)
  • It also seems like a different case because humans have a hardwired priority where being in sufficient pain will make them look for ways to stop being in pain, no matter how unlikely this might be. Drug withdrawal certainly counts as significant pain. This is disanalogous to AIs as we know them, that have no such override systems.
  • The example didn't feel like it was responding to the core issue of why I wouldn't use "reward maximization" to refer to the kinds of things you were talking about. I wasn't able to immediately name the actual core point but replying to another commenter now helped me find the main thing I was thinking of.
Reply
Daniel Kokotajlo's Shortform
Kaj_Sotala3mo34

I notice that I'm confused. LLMs don't get any reward in deployment, that's only in the training phase. So isn't "reward isn't the optimization target" necessarily true for the.? Their may have behaviors that are called "reward hacking" but it's not actually literal reward hacking since there's no reward to be had either way.

Reply
Foom & Doom 1: “Brain in a box in a basement”
Kaj_Sotala3mo10

Maybe something like "non-LLM AGIs are a thing too and we know from the human brain that they're going to be much more data-efficient than LLM ones"; it feels like the focus in conversation has been so strongly on LLM-descended AGIs that I just stopped thinking about that.

Reply
Foom & Doom 1: “Brain in a box in a basement”
Kaj_Sotala3mo1618

My thoughts on reading this post and your second one:

  • "Oh. Steven is just obviously correct."
  • "I somehow allowed myself to be lulled into a false sense of safety with the way LLMs are. Fuck."
  • "How did I need this post to see this? It's so clearly and straightforwardly correct, just like one inference step away from everything I already knew, that my mind must have been carefully looking away from this but now can't rationalize it away once it has been pointed out. Fuck."
  • "Fuck."
Reply2
steve2152's Shortform
Kaj_Sotala5mo30

Reminds me of

Reply
Tracing the Thoughts of a Large Language Model
Kaj_Sotala6mo10

Fascinating results!

Interestingly, it sounds like faking the chain of thought emerges as a special case of planning ahead. With the rhyming, Claude decides on the word that the line should end with, and then figures out the sentence that gets it there. With the math example, Claude decides on the number that the calculation should end up at, and then figures out the steps that get there.

Reply
Daniel Kokotajlo's Shortform
Kaj_Sotala7mo43

I don't know what your views on self-driving cars are, but if you are like me you look at what Waymo is doing and you think "Yep, it's working decently well now, and they are scaling up fast, seems plausible that in a few years it'll be working even better and scaled to every major city. The dream of robotaxis will be a reality, at least in the cities of America."

The example of self-driving cars is actually the biggest one that anchors me to timelines of decades or more. A lot of people's impression after the 2007 DARPA Grand Challenge seemed to be something like "oh, we seem to know how to solve the problem in principle, now we just need a bit more engineering work to make it reliable and agentic in the real world". Then actually getting things to be as reliable as required for real agents took a lot longer. So past experience would imply that going from "we know in principle how to make something act intelligently and agentically" to "this is actually a reliable real-world agent" can easily take over a decade.

Another example is that going from the first in-principle demonstration of chain-of-thought to o1 took two years. That's much shorter than a decade but also a much simpler capability.

For general AI, I would expect the "we know how to solve things in principle" stage to at least be something like "can solve easy puzzles that a normal human can that the AI hasn't been explicitly trained on". Whereas with AI,  we're not even there yet. E.g. I tried giving GPT-4.5, DeepSeek R1, o3-mini, and Claude 3.7 with extended thinking a simple sliding square problem, and they all committed an illegal move at one stage or another.

And that's to say nothing about all the other capabilities that a truly general agent - say one capable of running a startup - would need, like better long-term memory, ability to formulate its own goals and prioritize between them in domains with no objective rules you could follow to guarantee success, etc.. Not only are we lacking convincing in-principle demonstrations of general intelligence within puzzle-like domains, we're also lacking in-principle demonstrations of these other key abilities.

Reply
Have LLMs Generated Novel Insights?
Answer by Kaj_SotalaFeb 23, 202571

Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:

Introduction to the Context:

I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.

However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.

To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.

Prompt to O1-Pro:

Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:

“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”

Why Battle Royale Games?

You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!

Idea #9: A Remarkable Paradigm

Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding! 

“Adapt or Fail” Under Escalating Challenges:

Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective.  It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms. 

There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.

Reply
A Three-Layer Model of LLM Psychology
Kaj_Sotala9mo64

Fantastic post. This has been frequently on my mind after reading it, and especially the surface/character layer split feels very distinct now that I have an explicit concept for it. And then at one point I asked it to profile me based on some fiction I co-wrote with it and it managed to guess that I was Finnish from something I didn't think had any clues in that direction, which gave me a novel feeling of getting a glimpse into that vast alien ground layer.

The analogy to the character and player distinction in humans also feels very apt.

Reply
Load More
17Quick thoughts on the implications of multi-agent views of mind on AI takeover
2y
1
20Genetic fitness is a measure of selection strength, not the selection target
2y
14
27How to talk about reasons why AGI might not be near?
Q
2y
Q
9
14The Preference Fulfillment Hypothesis
3y
10
29Confused why a "capabilities research is good for alignment progress" position isn't discussed more
3y
8
18Multimodal Neurons in Artificial Neural Networks
5y
0
24Collection of GPT-3 results
5y
7
26The two-layer model of human values, and problems with synthesizing preferences
6y
3
2Kaj's shortform feed
7y
0
0Probabilities Small Enough To Ignore: An attack on Pascal's Mugging
10y
0
Load More
Internal Family Systems
3 years ago
(+68/-20)
Internal Family Systems
4 years ago
(+306)
Internal Double Crux
4 years ago
(+92)
Arguments As Soldiers
5 years ago
(+473/-85)
AI Advantages
5 years ago
Willpower
5 years ago
(+6/-9)
Aumann's Agreement Theorem
5 years ago
(+26/-501)