Thanks for the thought provoking post! Some rough thoughts:
Raw LLMs model the data generating process. The data generating process emits characters/simulacra, but is grounded in authors. Modelling simulacra is probably either a consequence of modelling authors or a means for modelling authors.
Authors behave differently from characters, and in particular are less likely to reveal their dastardly plans and become evil versions of themselves. The context teaches the LLM about what kind of author it is modelling, and this infor...
I'm not sure how serious this suggestion is, but note that:
This doesn't seem to disagree with David's argument? "Accident" implies a lack of negligence. "Not taken seriously enough" points at negligence. I think you are saying that non-negligent but "painfully obvious" harms that occur are "accidents", which seems fair. David is saying that the scenarios he is imagining are negligent and therefore not accidents. These seem compatible.
I understand David to be saying that there is a substantial possibility of x-risk due to negligent but non-intended events, maybe even the majority of the probability. These would sit between "accident" and "misuse" (on both of your definitions).
Thanks, that makes sense.
I think part of my skepticism about the original claim comes from the fact that I'm not sure that any amount of time for people living in some specific stone-age grouping would come up with the concept of 'sapient' without other parts of their environment changing to enable other concepts to get constructed.
There might be a similar point translated into something shard theoryish that's like 'The available shards are very context dependent, so persistent human values across very different contexts is implausible.' SLT in particular probably involves some pretty different contexts.
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.
Meta level: let's use these failures to understand how hard alignment is, but not accidentally start thinking that alignment=='not providing information that is readily available on the internet but that we think people shouldn't use'.
Sure, inclusive genetic fitness didn't survive our sharp left turn. But human values did. Individual modern humans are optimizing for them as hard as they were before; and indeed, we aim to protect these values against the future.
Why do you think this? It seems like humans currently have values and used to have values (I'm not sure when they started having values) but they are probably different values. Certainly people today have different values in different cultures, and people who are parts of continuous cultures have different values to people in those cultures 50 years ago.
Is there some reason to think that any specific human values persisted through the human analogue of SLT?
Thanks a lot for this post, I found it very helpful.
The specific thing that you measure may be more a fact about linear algebra rather than a fact about LLMs or CCS.
For example, let's construct data which definitely has two linearly independent dimension that are each predictive of whethe... (read more)