User Comment Replies — AI Alignment Forum

If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is?

So far in this thread I was mostly talking from the perspective of my model(/steelman?) of Abram's argument.

I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems

I mostly agree with this.

Still, this doesn't^[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whate... (read more)

Anti-Slop Interventions?

Mateusz Bagiński1mo30

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

TBC, I was thinking about something like: "speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day".

3Steve Byrnes1mo

I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop. If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)

Anti-Slop Interventions?

Mateusz Bagiński1mo30

I think Abram is saying the following:

Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research.
At some point, they are gonna get those capabilities.
However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier).
- In which case, we're not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities.
Therefore, it mak

... (read more)

4Steve Byrnes1mo

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing. If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation. And a second possibility is, there are ways to make AI more helpful for AI safety that are not simultaneously directly addressing the primary bottlenecks to AI danger. And we should do those things. The second possibility is surely true to some extent—for example, the LessWrong JargonBot is marginally helpful for speeding up AI safety but infinitesimally likely to speed up AI danger. I think this OP is kinda assuming that “anti-slop” is the second possibility and not the first possibility, without justification. Whereas I would guess the opposite.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Mateusz Bagiński8mo10

recently^[1].

empty footnote

1Lee Sharkey8mo

Thanks! Fixed now

The shard theory of human values

Mateusz Bagiński2y40

I wonder how the following behavioral patterns fit into Shard Theory

Many mammalian species have strong default aversion to young of their own species. They (including females) deliberately avoid contact with the young and can even be infanticidal. Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before. Do you think this can be explained by the rewiring of her reward circuit such that she finds s

... (read more)

2Alex Turner2y

I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).

Lessons from Convergent Evolution for AI Alignment

Mateusz Bagiński2y20

Standardized communication protocols

Language is the most obvious example, but there's plenty of others. E.g. taking different parts of the body as subsystems communicating with each other, one neurotransmitter/hormone often has very similar effects in many parts of the body.

In software, different processes can communicate with each other by passing messages having some well-defined format. When you're sending an API request, you usually have a good idea of what shape the response is going to take and if the request fails, it should fail in a predictable wa... (read more)

Linda Linsefors's Shortform

Mateusz Bagiński3y10

I don't have a specific example right now but some things that come to mind:

Both utility functions ultimately depend in some way on a subset of background conditions, i.e. the world state
The world state influences the utility functions through latent variables in the agents' world models, to which they are inputs.
$U_{A}$ changes only when $M_{A}$ (A's world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let's assume that both A and B perceive the world quite accurately).
If whenever $U_{A}$ changes $U_{B}$ doesn't decrease,

... (read more)

2Linda Linsefors3y

Lets say that U_A = 3x + y Then (I think) for your inequality to hold, it must be that U_B = f(3x+y), where f' >= 0 If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A. This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.

Linda Linsefors's Shortform

Mateusz Bagiński3y30

Can't you restate the second one as the relationship between two utility functions $U_{A}$ and $U_{B}$ such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.

$\frac{\partial U_{A}}{\partial U_{B}} \geq 0 \land \frac{\partial U_{B}}{\partial U_{A}} \geq 0$

1Linda Linsefors3y

Yes, I like this one. We don't want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don't want the AI to resist us. Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?

Linda Linsefors's Shortform

Mateusz Bagiński3y10

Re second try: what would make a high-level operationalisation of that sort helpful? (operationalize the helpfulness of an operationalisation)

1Linda Linsefors3y

This is a good question. The not so operationalized answer is that a good operationalization is one that are helpful for achieving alignment. An operationalization of [helpfulness of an operationalization] would give some sorts to gears level understanding of what shape the operationalization should have to be helpful. I don't have any good model for this, so I will just gesture vaguely. I think that mathematical descriptions are good, since they are more precise. My first operationalization attempt is pretty mathematical which is good. It is also more "constructive" (not sure if this is the exact right word), i.e. it describes alignment in terms of internal properties, rather than outcomes. Internal properties are more useful as design guidelines, as long as they are correct. The big problem with my first operationalization is that it don't actually point to what we want. The problem with the second attempt is that it just states what outcome we want. There is nothing in there to help us achieve it.

AI ALIGNMENT FORUM
AF

All of Mateusz Bagiński's Comments + Replies