"Please don't roll your own crypto" is a good message to send to software engineers looking to build robust products. But it's a bad message to send to the community of crypto researchers, because insofar as they believe you, then you won't get new crypto algorithms from them.
In the context of metaethics, LW seems much more analogous to the "community of crypto researchers" than the "software engineers looking to build robust products". Therefore this seems like a bad message to send to LessWrong, even if it's a good message to send to e.g. CEOs who justify immoral behavior with metaethical nihilism.
You may have missed my footnote, where I addressed this?
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.
"More research needed" but here are some ideas to start with:
Hmm, I like #1.
#2 feels like it's injecting some frame that's a bit weird to inject here (don't roll your own metaethics... but rolling your own metaphilosophy is okay?)
But also, I'm suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
#2 feels like it's injecting some frame that's a bit weird to inject here (don't roll your own metaethics... but rolling your own metaphilosophy is okay?)
Maybe you missed my footnote?
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.
and/or this part of my answer (emphasis added):
Try to solve metaphilosophy, where potentially someone could make a breakthrough that everyone can agree is correct (after extensive review)
But also, I'm suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
I think I mostly had alignment researchers (in and out of labs) as the target audience in mind, but it does seem relevant to others so perhaps I should expand the target audience?
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way".
I think I had missed this, but, it doesn't resolve the confusion in my #2 note. (like, still seems like something is weird about saying "solve metaphilosophy such that every can agree is correct" is more worth considering than "solve metaethics such that everyone can agree is correct". I can totally buy that they're qualitatively different and maybe have some guesses for why you think that. But I don't think the post spells out why and it doesn't seem that obvious to me)
I hinted at it with "prior efforts/history", but to spell it out more, metaethics seems to have a lot more effort gone into it in the past, so there's less likely to be some kind of low hanging fruit in idea space, that once picked, everyone will agree is the right solution.
>perhaps I should expand the target audience?
Plausibly the target audience should include AIs that will eventually be trying to do philosophy, assuming they'll be trained on your posts' contents or would be able to browse them.
Another meta line of argument is to consider how many people have strongly held, but mutually incompatible philosophical positions.
I've been banging my head against figuring out why this line of argument doesn't seem convincing to many people for at least a couple of years. I think, ultimately, it's probably because it feels defeatable by plans like "we will make AIs solve alignment for us, and solving alignment includes solving metaphilosophy & then object-level philosophy". I think those plans are doomed in a pretty fundamental sense, but if you don't think that, then they defeat many possible objections, including this one.
As they say: Everyone who is hopeful has their own reason for hope. Everyone who is doomful[1]...
In fact it's not clear to me. I think there's less variation, but still a fair bit.
By "metaethics," do you mean something like "a theory of how humans should think about their values"?
I feel like I've seen that kind of usage on LW a bunch, but it's atypical. In philosophy, "metaethics" has a thinner, less ambitious interpretation of answering something like, "What even are values, are they stance-independent, yes/no?"
And yeah, there is often a bit more nuance than that as you dive deeper into what philosophers in the various camps are exactly saying, but my point is that it's not that common, and certainly not necessary, that "having confident metaethical views," on the academic philosophy reading of "metaethics," means something like "having strong and detailed opinions on how AI should go about figuring out human values."
(And maybe you'd count this against academia, which would be somewhat fair, to be honest, because parts of "metaethics" in philosophy are even further removed from practicality, as they concern the analysis of the language behind moral claims, which, if we compare it to claims about the Biblical God and miracles, it would be like focusing way too much on whether the people who wrote the Bible thought they were describing real things or just metaphores, without directly trying to answer burning questions like "Does God exist?" or "Did Jesus live and perform miracles?")
Anyway, I'm asking about this because I found the following paragraph hard to understand:
Behind a veil of ignorance, wouldn't you want everyone to be less confident in their own ideas? Or think "This isn't likely to be a subjective question like morality/values might be, and what are the chances that I'm right and they're all wrong? If I'm truly right why can't I convince most others of this? Is there a reason or evidence that I'm much more rational or philosophically competent than they are?"
My best guess of what you might mean (low confidence) is the following:
You're conceding that morality/values might be (to some degree) subjective, but you're cautioning people from having strong views about "metaethics," which you take to be the question of not just what morality/values even are, but also a bit more ambitiously: how to best reason about them and how to (e.g.) have AI help us think about what we'd want for ourselves and others.
Is that roughly correct?
Because if one goes with the "thin" interpretation of metaethics, then "having one's own metaethics" could be as simple as believing some flavor of "morality/values are subjective," and it feels like you, in the part I quoted, don't sound like you're too strongly opposed to just that stance in itself, necessarily.
By "metaethics," do you mean something like "a theory of how humans should think about their values"?
I feel like I've seen that kind of usage on LW a bunch, but it's atypical. In philosophy, "metaethics" has a thinner, less ambitious interpretation of answering something like, "What even are values, are they stance-independent, yes/no?"
By "metaethics" I mean "the nature of values/morality", which I think is how it's used in academic philosophy. Of course the nature of values/morality has a strong influence on "how humans should think about their values" so these are pretty closely connected, but definitionally I do try to use it the same way as in philosophy, to minimize confusion. This post can give you a better idea of how I typically use it. (But as you'll see below, this is actually not crucial for understanding my post.)
Anyway, I'm asking about this because I found the following paragraph hard to understand:
So in the paragraph that you quoted (and the rest of the post), I was actually talking about philosophical fields/ideas in general, not just metaethics. While my title has "metaethics" in it, the text of the post talks generically about any "philosophical questions" that are relevant for AI x-safety. If we substitute metaethics (in my or the academic sense) into my post, then you can derive that I mean something like this:
Different metaethics (ideas/theories about the nature of values/morality) have different implications for what AI designs or alignment approaches are safe, and if you design an AI assuming that one metaethical theory is true, it could be disastrous if a different metaethical theory actually turns out to be true.
For example, if moral realism is true, then aligning the AI to human values would be pointless. What you really need to do is design the AI to be able to determine and follow objective moral truths. But this approach would be disastrous if moral realism is actually false. Similarly, if moral noncognitivism is true, that means that humans can't be wrong about their values, and implies "how humans should think about their values" is of no importance. If you design AI under this assumption, that would be disastrous if actually humans can be wrong about their values and they really need AIs to help them think about their values and avoid moral errors.
I think in practice a lot of alignment researchers may not even have explicit metaethical theories in mind, but are implicitly making certain metaethical assumptions in their AI design or alignment approach. For example they may largely ignore the question of how humans should think about their values or how AIs should help humans think about their values, thus essentially baking in an assumption of noncognitivism.
You're conceding that morality/values might be (to some degree) subjective, but you're cautioning people from having strong views about "metaethics," which you take to be the question of not just what morality/values even are, but also a bit more ambitiously: how to best reason about them and how to (e.g.) have AI help us think about what we'd want for ourselves and others.
If we substitute "how humans/AIs should reason about values" (which I'm not sure has a name in academic philosophy but I think does fall under metaphilosophy, which covers all philosophical reasoning) into the post, then your conclusion here falls out, so yes, it's also a valid interpretation of what I'm trying to convey.
I hope that makes everything a bit clearer?
With LLMs, reasoning is becoming composable, so standard libraries of pen tests/abstraction decomposition for eg type errors etc could become usable, testable, improvable etc.
The problem is that we can't. The closest thing we have is instead a collection of mutually exclusive ideas where at most one (possibly none) is correct, and we have no consensus as to which.
One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.
My boss thought he had to demonstrate the insecurity of the PRNG by coming up with a practical attack (i.e., a way to predict its future output based only on its past output, without knowing the secret key/seed). There were three permanent full time professional cryptographers working in the research department, but none of them specialized in cryptanalysis of symmetric cryptography (which covers such PRNGs) so it might have taken them some time to figure out an attack. My time was obviously less valuable and my boss probably thought I could benefit from the experience, so I got the assignment.
Up to that point I had no interest, knowledge, or experience with symmetric cryptanalysis either, but was still able to quickly demonstrate a clean attack on the proposed PRNG, which succeeded in convincing the proposer to give up and use an established algorithm. Experiences like this are so common, that everyone in cryptography quickly learns how easy it is to be overconfident about one's own ideas, and many viscerally know the feeling of one's brain betraying them with unjustified confidence. As a result, "don't roll your own crypto" is deeply ingrained in the culture and in people's minds.
If only it was so easy to establish something like this in "applied philosophy" fields, e.g., AI alignment! Alas, unlike in cryptography, it's rarely possible to come up with "clean attacks" that clearly show that a philosophical idea is wrong or broken. The most that can usually be hoped for is to demonstrate some kind of implication that is counterintuitive or contradicts other popular ideas. But due to "one man's modus ponens is another man's modus tollens", if someone is sufficiently willing to bite bullets, then it's impossible to directly convince them that they're wrong (or should be less confident) this way. This is made even harder because, unlike in cryptography, there are no universally accepted "standard libraries" of philosophy to fall back on. (My actual experiences attempting this, and almost always failing, are another reason why I'm so pessimistic about AI x-safety, even compared to most other x-risk concerned people.)
So I think I have to try something more meta, like drawing the above parallel with how easy is it to be overconfident in other fields, such as cryptography. Another meta line of argument is to consider how many people have strongly held, but mutually incompatible philosophical positions. Behind a veil of ignorance, wouldn't you want everyone to be less confident in their own ideas? Or think "This isn't likely to be a subjective question like morality/values might be, and what are the chances that I'm right and they're all wrong? If I'm truly right why can't I convince most others of this? Is there a reason or evidence that I'm much more rational or philosophically competent than they are?"
Unfortunately I'm pretty unsure any of these meta arguments will work either. If they do change anyone's minds, please let me know in the comments or privately. Or if anyone has better ideas for how to spread a meme of "don't roll your own metaethics"[1], please contribute. And of course counterarguments are welcome too, e.g., if people rolling their own metaethics is actually good, in a way that I'm overlooking.
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.