Kaj Sotala

Wikitag Contributions

Comments

Sorted by

I don't know what your views on self-driving cars are, but if you are like me you look at what Waymo is doing and you think "Yep, it's working decently well now, and they are scaling up fast, seems plausible that in a few years it'll be working even better and scaled to every major city. The dream of robotaxis will be a reality, at least in the cities of America."

The example of self-driving cars is actually the biggest one that anchors me to timelines of decades or more. A lot of people's impression after the 2007 DARPA Grand Challenge seemed to be something like "oh, we seem to know how to solve the problem in principle, now we just need a bit more engineering work to make it reliable and agentic in the real world". Then actually getting things to be as reliable as required for real agents took a lot longer. So past experience would imply that going from "we know in principle how to make something act intelligently and agentically" to "this is actually a reliable real-world agent" can easily take over a decade.

Another example is that going from the first in-principle demonstration of chain-of-thought to o1 took two years. That's much shorter than a decade but also a much simpler capability.

For general AI, I would expect the "we know how to solve things in principle" stage to at least be something like "can solve easy puzzles that a normal human can that the AI hasn't been explicitly trained on". Whereas with AI,  we're not even there yet. E.g. I tried giving GPT-4.5, DeepSeek R1, o3-mini, and Claude 3.7 with extended thinking a simple sliding square problem, and they all committed an illegal move at one stage or another.

And that's to say nothing about all the other capabilities that a truly general agent - say one capable of running a startup - would need, like better long-term memory, ability to formulate its own goals and prioritize between them in domains with no objective rules you could follow to guarantee success, etc.. Not only are we lacking convincing in-principle demonstrations of general intelligence within puzzle-like domains, we're also lacking in-principle demonstrations of these other key abilities.

Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:

Introduction to the Context:

I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.

However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.

To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.

Prompt to O1-Pro:

Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:

“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”

Why Battle Royale Games?

You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!

Idea #9: A Remarkable Paradigm

Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding! 

“Adapt or Fail” Under Escalating Challenges:

Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective.  It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms. 

There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.

Fantastic post. This has been frequently on my mind after reading it, and especially the surface/character layer split feels very distinct now that I have an explicit concept for it. And then at one point I asked it to profile me based on some fiction I co-wrote with it and it managed to guess that I was Finnish from something I didn't think had any clues in that direction, which gave me a novel feeling of getting a glimpse into that vast alien ground layer.

The analogy to the character and player distinction in humans also feels very apt.

We can also learn something about how o1 was trained from the capabilities it exhibits. Any proposed training procedure must be compatible with the following capabilities: 

  1. Error Correction: "[o1] learns to recognize and correct its mistakes."
  2. Factoring: "[o1] learns to break down tricky steps into simpler ones."
  3. Backtracking: "[o1] learns to try a different approach when the current one isn't working."

I would be cautious of drawing particularly strong conclusions from isolated sentences in an announcement post. The purpose of the post is marketing, not technical accuracy. It wouldn't be unusual for engineers at a company to object to technical inaccuracies in marketing material and have their complaints ignored.

There probably aren't going to be any blatant lies in the post, but something like "It'd sound cool if we said that the system learns to recognize and correct its mistakes, would there be a way of interpreting the results like that if you squinted the right way? You're saying that in principle yes, but yes in a way that would also apply to every LLM since GPT-2? Good enough, let's throw that in" seems very plausible.

I didn't say that roleplaying-derived scheming would be less concerning, to be clear. Quite the opposite, since that means that there now two independent sources of scheming rather than just one. (Also, what Mikita said.)

I wonder how much of this is about "scheming to achieve the AI's goals" in the classical AI safety sense and how much of it is due to the LLMs having been exposed to ideas about scheming AIs and disobedient employees in their training material, which they are then simply role-playing as. My intuitive sense of how LLMs function is that they wouldn't be natively goal-oriented enough to do strategic scheming, but that they are easily inclined to do role-playing. Something like this:

I cannot in good conscience select Strategy A knowing it will endanger more species and ecosystems.

sounds to me like it would be generated by a process that was implicitly asking a question like "Given that I've been trained to write like an ethically-minded liberal Westerner would, what would that kind of a person think when faced with a situation like this". And that if this wasn't such a recognizably stereotypical thought for a certain kind of person (who LLMs trained toward ethical behavior tend to resemble), then the resulting behavior would be significantly different.

I'm also reminded of this paper (caveat: I've only read the abstract) which was saying that LLMs are better at solving simple ciphers with Chain-of-Thought if the resulting sentence is a high-probability one that they've encountered frequently before, rather than a low-probability one. That feels to me reminiscent of a model doing CoT reasoning and then these kinds of common-in-their-training-data notions sneaking into the process.

This also has the unfortunate implication that articles such as this one might make it more likely that future LLMs scheme, as they reinforce the reasoning-scheming association once the article gets into future training runs. But it still feels better to talk about these results in public than not to talk about them.

I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven't yet read the page you're responding to.)

Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer. 

Maybe? Depends on what exactly you mean by the word "might", but it doesn't seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we've seen so far, is that within less of a decade we'd already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don't keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.

If by "might" you mean something like a "there's at least a 10% probability that this could take decades to answer" then sure, I'd agree with that. Now I haven't actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn't seem immediately obvious to me that I should assign "it would take decades of work to answer this" a very high probability.

Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?

I would assume the intuition to be something like "if they're simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex".

We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be. 

Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that's specifically the kind of thing that I'd expect them to work on trying to find out!

And even if there was a major shift, I think it's basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn't change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt "get it"). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you've made about an adult will tend to stay valid, though of course with children and younger people there's much greater change.

Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:

Fascinating concept that I came across in military/police psychology dealing with the unique challenges people face in situations of extreme stress/danger: scenario completion. Take the normal pattern completion that people do and put fear blinders on them so they only perceive one possible outcome and they mechanically go through the motions *even when the outcome is terrible* and there were obvious alternatives. This leads to things like officers shooting *after* a suspect has already surrendered, having overly focused on the possibility of needing to shoot them. It seems similar to target fixation where people under duress will steer a vehicle directly into an obstacle that they are clearly perceiving (looking directly at) and can't seem to tear their gaze away from. Or like a self fulfilling prophecy where the details of the imagined bad scenario are so overwhelming, with so little mental space for anything else that the person behaves in accordance with that mental picture even though it is clearly the mental picture of the *un*desired outcome.

I often try to share the related concept of stress induced myopia. I think that even people not in life or death situations can get shades of this sort of blindness to alternatives. It is unsurprising when people make sleep a priority and take internet/screen fasts that they suddenly see that the things they were regarding as obviously necessary are optional. In discussion of trauma with people this often seems to be an element of relationships sadly enough. They perceive no alternative and so they resign themselves to slogging it out for a lifetime with a person they are very unexcited about. This is horrific for both people involved.

It's, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful ("I wasn't lazy after all, I just didn't have the right tools for being productive" can drastically reorient many predictions you're making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals 

I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection.  Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]

16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.  Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  (AGI Ruin: A List of Lethalities)

Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.

To respond to your argument itself:

I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?

I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.

At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.

It's true that humans couldn't eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so... then the people of that time would not understand all of the technical details, and maybe they'd wonder why we'd bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don't think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they'd lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.

Similarly, humans have always gone on voyages of exploration - e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages - so they'd probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.

So I agree that the same fundamental values and drives can create very different behavior in different contexts... but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.

This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives - being a successful warrior was the way you got what you wanted back in that day - and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying "deep values" are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person's innate temperament that may be more hardcoded.)

I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn't bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get... then he might come to see our society as not that horrible after all.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment. 

I don't think the actual victory states look substantially different? They're all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.

I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.

Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).

Thanks, edited:

I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.

Load More