Toy model piece #4: partial preferences, re-re-visited

I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems

I mostly agree with this.

Still, this doesn't^[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whatever is necessary to meaningfully speed up AIS research) (and maybe a few more, as I don't expect the space of possible domains to be that compartmentalizable), but is not superintelligent across all complex domains that would make it dangerous.

It doesn't even have to be a superintelligent reasoner about minds. Babbling up clever and novel mathematical concepts for a human researcher to prune could be sufficient to meaningfully boost AI safety (I don't think we're primarily bottlenecked on mathy stuff but it might help some people and I think that's one thing that Abram would like to see).

^{^}
Doesn't rule out in itself but perhaps you have some other assumptions that imply it's 1:1, as you say.

Anti-Slop Interventions?

Mateusz Bagiński2mo30

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

TBC, I was thinking about something like: "speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day".

Anti-Slop Interventions?

Mateusz Bagiński2mo30

I think Abram is saying the following:

Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research.
At some point, they are gonna get those capabilities.
However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier).
- In which case, we're not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities.
Therefore, it makes sense to speed up the development of AIS-helpful capabilities now. Even if it means that the AIs will acquire dangerous capabilities sooner, it gives us more time to use AI Safety-helpful capabilities to prepare for dangerous capabilities.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Mateusz Bagiński9mo10

recently^[1].

empty footnote

The shard theory of human values

Mateusz Bagiński2y40

I wonder how the following behavioral patterns fit into Shard Theory

Many mammalian species have strong default aversion to young of their own species. They (including females) deliberately avoid contact with the young and can even be infanticidal. Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before. Do you think this can be explained by the rewiring of her reward circuit such that she finds simple actions associated with the pups highly rewarding and then bootstraps to learning complex behaviors from that?
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water, even though on all previous occasions they found it extremely aversive, which caused them to develop conditioned fear response to the cue predictive of salty water. (see Steve's post on this experiment)

Lessons from Convergent Evolution for AI Alignment

Mateusz Bagiński2y20

Standardized communication protocols

Language is the most obvious example, but there's plenty of others. E.g. taking different parts of the body as subsystems communicating with each other, one neurotransmitter/hormone often has very similar effects in many parts of the body.

In software, different processes can communicate with each other by passing messages having some well-defined format. When you're sending an API request, you usually have a good idea of what shape the response is going to take and if the request fails, it should fail in a predictable way that can be harmlessly handled. This makes making reliable software easier.

Some cases of standardization are spontaneous/bottom-up, whereas others are engineered top-down. Human language is both. Languages with greater number of users seem to evolve simpler, more standardized grammars, e.g. compare Russian to Czech or English to Icelandic (though syncretism and promiscuous borrowing may also have had an impact in the second case). I don't know if something like that occurs at all in programming languages but one factor that makes it much less likely is the need to maintain backward-compatibility, which is important for programing languages but much weaker for human languages.

Linda Linsefors's Shortform

Mateusz Bagiński3y10

I don't have a specific example right now but some things that come to mind:

Both utility functions ultimately depend in some way on a subset of background conditions, i.e. the world state
The world state influences the utility functions through latent variables in the agents' world models, to which they are inputs.
changes only when $M_{A}$ (A's world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let's assume that both A and B perceive the world quite accurately).
If whenever $U_{A}$ changes $U_{B}$ doesn't decrease, then whatever change in the world increased $U_{A}$ , B at least doesn't care. This is problematic when A and B need the same scarce resources (instrumental convergence etc). It could be satisfied if they were both satisficers or bounded agents inhabiting significantly disjoint niches.
A robust solution seems seems to be to make (super accurately modeled) $U_{B}$ a major input to $U_{A}$ .

Linda Linsefors's Shortform

Mateusz Bagiński3y30

Can't you restate the second one as the relationship between two utility functions and $U_{B}$ such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.

$\frac{\partial U_{A}}{\partial U_{B}} \geq 0 \land \frac{\partial U_{B}}{\partial U_{A}} \geq 0$

Linda Linsefors's Shortform

Mateusz Bagiński3y10

Re second try: what would make a high-level operationalisation of that sort helpful? (operationalize the helpfulness of an operationalisation)