I'm confused about why 1P-logic is needed. It seems to me like you could just have a variable X which tracks "which agent am I" and then you can express things like sensor_observes(X, red)
or is_located_at(X, northwest)
. Here and Absent are merely a special case of True and False when the statement depends on X
.
Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it.
In an ideal world, it's good to relax this clause in some way, from a binary to a spectrum. For example: if someone tells me of a hazard that I'm confident I would've discovered one my own one week later, then they only get to dictate me not-sharing-it for a week. "Knowing" isn't a strict binary; anyone can rederive anything with enough time (maybe) — it's just a question of how long it would've taken me to find it if they didn't tell me. This can even include someone bringing my attention to something I already knew, but to which I wouldn't as quickly have thought to pay attention if they didn't bring attention to it.
In the non-ideal world we inhabit, however, it's unclear how fraught it is to use such considerations.
Due to my timelines being this short, I'm hopeful that convincing just "the current crop of major-AI-Lab CEOs" might actually be enough to buy us the bulk of time that something like this could buy.
commenting on this post because it's the latest in the sequence; i disagree with the premises of the whole sequence. (EDIT: whoops, the sequence posts in fact discuss those premises so i probably should've commented on those. ohwell.)
the actual, endorsed, axiomatic (aka terminal aka intrinsic) values we have are ones we don't want to change, ones we don't want to be lost or modified over time. what you call "value change" is change in instrumental values.
i agree that, for example, our preferences about how to organize the society we live in should change over time. but that simply means that our preference about society aren't terminal values, and our terminal values on this topic are meta-values about how other (non-terminal) values should change.
these meta-values, and other terminal values, are values that we should not want changed or lost over time.
in actuality, people aren't coherent agents enough to have immutable terminal values; they have value drift and confusion about values and they don't distinguish (or don't distinguish well) between terminal and instrumental values in their mind.
but we should want to figure out what our axiomatic values are, and for those to not be changed at all. and everything else being instrumental to that, we do not have to figure out alignment with regards to instrumental values, only axiomatic values.
This phrasing is tricky! If you think TAI is coming in approximately 10 years then sure, you can study for 5 years and then do research for 5 years.
But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed) then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because TAI could happen in 9 years but it could also happen in 1 year. If you think TAI is coming within 10 years, then (depending on your distribution) you should still in fact do research asap.
(People often get this wrong! They think that "TAI probably within X years" necessarily means "TAI in approximately X years".)