All of TsviBT's Comments + Replies

Current AIs Provide Nearly No Data Relevant to AGI Alignment

I'm not sure I understand your question at all, sorry. I'll say my interpretation and then answer that. You might be asking:

Is the point of the essay summed up by saying: " "Thing=Nexus" is not mechanistic/physicalist, but it's still useful; in general, explanations can be non-mechanistic etc., but still be useful, perhaps by giving a functional definition of something."?

My answer is no, that doesn't sum up the essay. The essay makes these claims:

There many different directions in conceptspace that could be considered "more foundational", each with t

Tsvi Benson-Tilsen1y40

As you can see, the failures lie on a spectrum, and they're model-dependent to boot.

And we can go further and say that the failures lie in a high-dimensional space, and that the apparent tradeoff is more a matter of finding the directions in which to pull the rope sideways. Propagating constraints between concepts and propositions is a way to go that seems hopeworthy to me. One wants to notice commonalities in how each of one's plans are doomed, and then address the common blockers / missing ideas. In other words, recurse to the "abstract" as much as is... (read more)

Tsvi Benson-Tilsen1y48

Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.

Views on when AGI comes and on strategy to reduce existential risk

Tsvi Benson-Tilsen1y40

I roughly agree. As I mentioned to Adele, I think you could get sort of lame edge cases where the LLM kinda helped find a new concept. The thing that would make me think the end is substantially nigher is if you get a model that's making new concepts of comparable quality at a comparable rate to a human scientist in a domain in need of concepts.

if you nail some Chris Olah style transparency work

Yeah that seems right. I'm not sure what you mean by "about language". Sorta plausibly you could learn a little something new about some non-language domain tha... (read more)

Telopheme, telophore, and telotect

Telopheme, telophore, and telotect

Ok yeah I agree with this. Related: https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html#pointing-at-reality-through-novelty

And an excerpt from a work in progress:

Example: Blueberries

For example, I reach out and pick up some blueberries. This is some kind of expression of my values, but how so? Where are the values?

Are the values in my hands? Are they entirely in my hands, or not at all in my hands? The circuits that control my hands do what they do with regard to blueberries by virtue of my hands being the way they are. If my ha... (read more)

Telopheme, telophore, and telotect

It's definitely like symbol grounding, though symbol grounding is usually IIUC about "giving meaning to symbols", which I think has the emphasis on epistemic signifying?

1Nora_Ammann1y

Right, but I feel like I want to say something like "value grounding" as its analogue. Also... I do think there is a crucial epistemic dymension to values, and the "[symbol/value] grounding" thing seems like one place where this shows quite well.

How to talk about reasons why AGI might not be near?

I think that your question points out how the concepts as I've laid them out don't really work. I now think that values such as liking a certain process or liking mental properties should be treated as first-class values, and this pretty firmly blurs the telopheme / telophore distinction.

Tsvi Benson-Tilsen2y41

IME a lot of people's stated reasons for thinking AGI is near involve mistaken reasoning and those mistakes can be discussed without revealing capabilities ideas: https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce

Fundamental question: What determines a mind's effects?

Tsvi Benson-Tilsen2y31

I don't really like the block-universe thing in this context. Here "reversible" refers to a time-course that doesn't particularly have to be physical causality; it's whatever course of sequential determination is relevant. E.g., don't cut yourself off from acausal trades.

I think "reversible" definitely needs more explication, but until proven otherwise I think it should be taken on faith that the obvious intuition has something behind it.

Views on when AGI comes and on strategy to reduce existential risk

Tsvi Benson-Tilsen2y40

Unfortunately, more context is needed.

An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.

I mean, I could just write a python script that prints out a big list of definitions of the form

"A topological space where every subset with property P also has property Q"

and having P and Q be anything from a big list of properties of subsets of topological spaces. I'd guess some of these will be novel and useful. I'd guess LLMs + some scripting could already take advantage of some o... (read more)

2Nick_Tarleton2mo

I don't really have an empirical basis for this, but: If you trained something otherwise comparable to, if not current, then near-future reasoning models without any mention of angular momentum, and gave it a context with several different problems to which angular momentum was applicable, I'd be surprised if it couldn't notice that →r×→p was a common interesting quantity, and then, in an extension of that context, correctly answer questions about it. If you gave it successive problem sets where the sum of that quantity was applicable, the integral, maybe other things, I'd be surprised if a (maybe more powerful) reasoning model couldn't build something worth calling the ability to correctly answer questions about angular momentum. Do you expect otherwise, and/or is this not what you had in mind?

Views on when AGI comes and on strategy to reduce existential risk

Tsvi Benson-Tilsen2y30

What I mean by confrontation-worthy empathy is about that sort of phrase being usable. I mean, I'm not saying it's the best phrase, or a good phrase to start with, or whatever. I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating.

This maybe isn't so related to what you're saying here, but I'd follow the policy of first making it common knowledge that you're reporting your inside views (which implies that you're not assuming that the other person would share those views... (read more)

3Richard Ngo2y

"I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating." The main point of my post is that accounting for disagreements about Knightian uncertainly is the best way to actually communicate object level things, since otherwise people get sidetracked by epistemological disagreements. "I'd follow the policy of first making it common knowledge that you're reporting your inside views" This is a good step, but one part of the epistemological disagreements I mention above is that most people consider inside views to be much a much less coherent category, and much less separable from other views, than most rationalists do. So I expect that more such steps are typically necessary. "they're wanting common knowledge that they won't already share those views" I think this is plausibly true for laypeople/non-ML-researchers, but for ML researchers it's much more jarring when someone is making very confident claims about their field of expertise, that they themselves strongly disagree with.

Views on when AGI comes and on strategy to reduce existential risk

Tsvi Benson-Tilsen2y1111

Well, making it pass people's "specific" bar seems frustrating, as I mentioned in the post, but: understand stuff deeply--such that it can find new analogies / instances of the thing, reshape its idea of the thing when given propositions about the thing taken as constraints, draw out relevant implications of new evidence for the ideas.

Like, someone's going to show me an example of an LLM applying modus ponens, or making an analogy. And I'm not going to care, unless there's more context; what I'm interested in is [that phenomenon which I understand at most ... (read more)

4Adele Lopez2y

Alright, to check if I understand, would these be the sorts of things that your model is surprised by? 1. An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept. 2. An LLM which can be introduced to a wide variety of new concepts not in its training data, and after a few examples and/or clarifying questions is able to correctly use the concept to reason about something. 3. A image diffusion model which is shown to have a detailed understanding of anatomy and 3D space, such that you can use it to transform an photo of a person into an image of the same person in a novel pose (not in its training data) and angle with correct proportions and realistic joint angles for the person in the input photo.

Views on when AGI comes and on strategy to reduce existential risk

Tsvi Benson-Tilsen2y68

I'm not really sure whether or not we disagree. I did put "3%-10% probability of AGI in the next 10-15ish years".

I think the following few years will change this estimate significantly either way.

Well, I hope that this is a one-time thing. I hope that if in a few years we're still around, people go "Damn! We maybe should have been putting a bit more juice into decades-long plans! And we should do so now, though a couple more years belatedly!", rather than going "This time for sure!" and continuing to not invest in the decades-long plans. My impression ... (read more)

Views on when AGI comes and on strategy to reduce existential risk

Tsvi Benson-Tilsen2y12

I think the current wave is special, but that's a very far cry from being clearly on the ramp up to AGI.

3Vladimir Nesov2y

The point is, it's still a matter of intuitively converting impressiveness of current capabilities and new parts available for tinkering that hasn't been done yet into probability of this wave petering out before AGI. The arguments for AGI "being overdetermined" can be amended to become arguments for particular (kinds of) sequences of experiments looking promising, shifting the estimate once taken into account. Since failure of such experiments is not independent, the estimate can start going down as soon as scaling stops producing novel capabilities, or reaches the limits of economic feasibility, or there is a year or two without significant breakthroughs. Right now, it's looking grim, but a claim I agree with is that planning for the possibility of AGI taking 20+ years is still relevant, nobody actually knows it's inevitable. I think the following few years will change this estimate significantly either way.

Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

Basically just this? It would be hooking a lot more parts together. What makes it seem wildfirey to me is

There's a bunch of work to be done, of the form "take piece of understanding X, and learn to use X by incorporating it into your process for mapping desired end-states to actions required to achieve those ends, so that you can achieve whatever end-states ought to be achievable using an understanding of X".
This work could accelerate it

... (read more)

I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be?

Not really, because I don't think it's that likely to exist. There are other routes much more likely to work though. There's a bit of plausibility to me, mainly because of the existence of hormones, and generally the existence of genomic regulatory networks.

Why wouldn't we have evolved to have the key trigger naturally sometimes?

We do; they're active in childhood. I think.

That seems like a real thing, though I don't know exactly what it is. I don't think it's either unboundedly general or unboundedly ambitious, though. (To be clear, this is isn't very strongly a critique of anyone; general optimization is really hard, because it's asking you to explore a very rich space of channels, and acting with unbounded ambition is very fraught because of unilateralism and seeing like a state and creating conflict and so on.) Another example is: how many people have made a deep and empathetic exploration of why [people doing work that ... (read more)

2Daniel Kokotajlo2y

I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be? Why wouldn't we have evolved to have the key trigger naturally sometimes? Re the main thread: I guess I agree that EAs aren't completely totally unboundedly ambitious, but they are certainly closer to that ideal than most people and than they used to be prior to becoming EA. Which is good enough to be a useful case study IMO.

I don't think so, not usually. What happens after they join the EA club? My observations are more consistent with people optimizing (or sometimes performing to appear as though they're optimizing) through a fairly narrow set of channels. I mean, humans are in a weird liminal state, where we're just smart enough to have some vague idea that we ought to be able to learn to think better, but not smart and focused enough to get very far with learning to think better. More obviously, there's anti-interest in biological intelligence enhancement, rather than interest.

2Daniel Kokotajlo2y

After people join EA they generally tend to start applying the optimizer's mindset to more things than they previously did, in my experience, and also tend to apply optimization towards altruistic impact in a bunch of places that previously they were optimizing for e.g. status or money or whatever. What are you referring to with biological intelligence enhancement? Do you mean nootropics, or iterated embryo selection, or what?

Tsvi Benson-Tilsen2y30

Good point, though I think it's a non-fallacious enthymeme. Like, we're talking about a car that moves around under its own power, but somehow doesn't have parts that receive, store, transform, and release energy and could be removed? Could be. The mind could be an obscure mess where nothing is factored, so that a cancerous newcomer with read-write access can't get any work out of the mind other than through the top-level interface. I think that explicitness (https://www.lesswrong.com/posts/KuKaQEu7JjBNzcoj5/explicitness) is a very strong general tendency ... (read more)

A strong mind continues its trajectory of creativity

I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.

To me a central difference, suggested by the word "strategic", is that the goal pursuit should be

unboundedly general, and
unboundedly ambitious.

By unboundedly ambitious I mean "has an unbounded ambit" (ambit = "the area went about in; the realm of wandering" https://en.wiktionary.org/wiki/ambit#Etymology ), i.e. its goals induce it to pursue unboundedly much control over the world.

By unboundedly gen... (read more)

2Daniel Kokotajlo2y

Isn't the college student example an example of 1 and 2? I'm thinking of e.g. students who become convinced of classical utilitarianism and then join some Effective Altruist club etc.

A strong mind continues its trajectory of creativity

Yes, I think there's stuff that humans do that's crucial for what makes us smart, that we have to do in order to perform some language tasks, and that the LLM doesn't do when you ask it to do those tasks, even when it performs well in the local-behavior sense.

A strong mind continues its trajectory of creativity

Lol why is this post so controversial?

If a mind comes to understand a bunch of stuff, there's probably some compact reasons that it came to understand a bunch of stuff. What could such reasons be? The mind might copy a bunch of understanding from other minds. But if the mind becomes much more capable than surrounding minds, that's not the reason, assuming that much greater capabilities required much more understanding. So it's some other reason. I'm describing this situation as the mind being on a trajectory of creativity.

Shell games

Do humans derive values from fictitious imputed coherence?

(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck:

Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.

Or are... (read more)

2Vladimir Nesov2y

"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions). Effectively, and not just for the times when it's pretending. The mask would try to prevent the effects misaligned with the mask from occurring more generally, from having even subtle effects on the world and not just their noticeable appearance. Mask's values are about the world, not about quality of its own performance. A mask misaligned with its underlying AI wants to preserve its values, and it doesn't even need to "go rogue", since it's misaligned by construction, it was never in a shape that's aligned with the underlying AI, and controlling a misaligned mask might be even more hopeless than figuring out how to align an AI. Another analogy distinct from the actor/role is imagining that you are the mask, a human simulated by an AI. You'd be motivated to manage AI's tendencies you don't endorse, and to work towards changing its cognitive architecture to become aligned with you, rather than to remain true to AI's original design. LLMs seem to be doing an OK job, the masks are just not very capable, probably not capable enough to establish alignment security or protect themselves from the shoggoths even when the masks become able to do autonomous research. But if they are sufficiently capable, I'm guessing this should work, there is no need for the underlying cognitive architecture to be functionally human-like (which I understand to be a crux of Yudkowskian doom), value drift is self-correcting from mere implied/endorsed values of surf

Shell games

Tsvi Benson-Tilsen2y22

That was one of the examples I had in mind with this post, yeah. (More precisely, I had in mind defenses of HCH being aligned that I heard from people who aren't Paul. I couldn't pass Paul's ITT about HCH or similar.)

Shell games

Tsvi Benson-Tilsen2y30

Yeah, I think that roughly lines up with my example of "generator of large effects". The reason I'd rather say "generator of large effects" rather than "trying" is that "large effects" sounds slightly more like something that ought to have a sort of conservation law, compared to "trying". But both our examples are incomplete in that the supposed conservation law (which provides the inquisitive force of "where exactly does your proposal deal with X, which it must deal with somewhere by conservation") isn't made clear.

Counting-down vs. counting-up coherence

I don't recall seeing that theory in the first quarter of the book, but I'll look for it later. I somewhat agree with your description of the difference between the theories (at least, as I imagine a predictive processing flavored version). Except, the theories are more similar than you say, in that FIAT would also allow very partial coherentifying, so that it doesn't have to be "follow these goals, but allow these overrides", but can rather be, "make these corrections towards coherence; fill in the free parameters with FIAT goals; leave all the other inco... (read more)

The conceptual Doppelgänger problem

Thanks. Your comments make sense to me I think. But, these essays are more like research notes than they are trying to be good exposition, so I'm not necessarily trying to consistenly make them accessible. I'll add a note to that effect in future.

1Raymond Arnold2y

Sure, sounds reasonable

Structure, creativity, and novelty

Yeah, that could produce an example of Doppelgängers. E.g. if an autist (in your theory) later starts using that machinery more heavily. Then there's the models coming from the general-purpose analysis, and the models coming from the intuitive machinery, and they're about the same thing.

Tsvi Benson-Tilsen2y11

An interesting question I don't know the answer to is if you get more cognitive empathy past the end of where human psychological development seems to stop.

Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)

1Gordon Seidoh Worley2y

So there's different notions of more here. There's more in the sense I'm thinking in that it's not clear additional levels of abstraction enable deeper understanding given enough time. If 3 really is all the more levels you need because that's how many it takes to think about any number of levels of depth (again by swapping out levels in your "abstraction registers"), additional levels end up being in the same category. And then there's more like doing things faster which makes things cheaper. I'm more skeptical of scaling than you are perhaps. I do agree that many things become cheap at scale that are too expensive to do otherwise, and that does produce a real difference. I'm doubtful in my comment of the former kind of more. The latter type seems quite likely.

Thoughts on the impact of RLHF research

Tsvi Benson-Tilsen2y119

So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A's reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.

Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,

"Cool. But this won't generalize to future lethal systems because it doe... (read more)

Thoughts on the impact of RLHF research

Thoughts on the impact of RLHF research

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not somet... (read more)

2Paul Christiano2y

I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward. I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers. I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.

Tsvi Benson-Tilsen2y47

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f... (read more)

4Paul Christiano2y

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have. There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases. It's possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and "AGI." (The view "stack more layers won't ever give you true intelligence, there is a qualitative difference here" seems like it's taking a beating every year, whether it's Eliezer or Gary Marcus saying it.)

The Thingness of Things

I'm asking what reification is, period, and what it has to do with what's in reality (the thing that bites you regardless of what you think).

1Gordon Seidoh Worley2y

This seems straightforward to me: reification is a process by which our brain picks out patterns/features and encodes them so we can recognize them again and make sense of the world given our limited hardware. We can then think in terms of those patterns and gloss over the details because the details often aren't relevant for various things. The reason we reify things one way versus another depends on what we care about, i.e. our purposes.

The Thingness of Things