To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:

  • (10) Science - Science is cool!
  • (15) Philosophy of Science - The scientific method is cool! Oh look, there's a whole field studying it called "philosophy of science"!
  • (20) Probability Theory - Bayesian subjective probability and the universal prior seem to constitute an elegant solution to the philosophy of science. Hmm, there are some curious probability puzzles involving things like indexical uncertainty, copying, forgetting... I and others make some progress on this but fully solving anthropic reasoning seems really hard. (Lots of people have worked on this for a while and have failed, at least according to my judgement.)
  • (25) Decision Theory - Where does probability theory come from anyway? Maybe I can find some clues that way? Well according to von Neumann and Morgenstern, it comes from decision theory. And hey, maybe it will be really important that we get decision theory right for AI? I and others make some progress but fully solving decision theory turns out to be pretty hard too. (A number of people have worked on this for a while and haven't succeeded yet.)
  • (35) Metaphilosophy - Where does decision theory come from? It seems to come from philosophers trying to do philosophy. What is that about? Plus, maybe it will be really important that the AIs we build will be philosophically competent?
  • (45) Meta Questions about Metaphilosophy - Not sure how hard solving metaphilosophy really is, but I'm not making much progress on it by myself. Meta questions once again start to appear in my mind:
    • Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent?
    • Tractability may be a concern but shouldn't more people still be talking about these problems if only to raise the alarm (about an additional reason that the AI transition may go badly)? (I've listened to all the recent podcasts on AI risk that I could find, and nobody brought it up even once.)
    • How can I better recruit attention and resources to this topic? For example, should I draw on my crypto-related fame, or start a prize or grant program with my own money? I'm currently not inclined to do either, out of inertia, unfamiliarity, uncertainty of getting any return, fear of drawing too much attention from people who don't have the highest caliber of thinking, and signaling wrong things (having to promote ideas with one's own money instead of attracting attention based on their merits). But I'm open to having my mind changed if anyone has good arguments about this.
    • What does it imply that so few people are working on this at such a late stage? For example, what are the implications for the outcome of the human-AI transition, and on the distribution of philosophical competence (and hence the distribution of values, decision theories, and other philosophical views) among civilizations in the universe/multiverse?

At each stage of this journey, I took what seemed to be the obvious next step (often up a meta ladder), but in retrospect each step left behind something like 90-99% of fellow travelers. From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?

As for the AI safety path (as opposed to pure intellectual curiosity) that also leads here, I guess I do have more of a clue what's going on. I'll describe the positions of 4 people I know. Most of this is from private conversations so I won't give their names.

  • Person A has a specific model of the AI transition that they're pretty confident in, where the first AGI is likely to develop a big lead and if it's aligned, can quickly achieve human uploading then defer to the uploads for philosophical questions.
  • Person B thinks that ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work. They're just too busy to publish/discuss the idea.
  • Person C will at least think about metaphilosophy in the back of their mind (as they spend most of their time working on other things related to AI safety).
  • Person D thinks it is important and too neglected but they personally have a comparative advantage in solving intent alignment.

To me, this paints a bigger picture that's pretty far from "humanity has got this handled." If anyone has any ideas how to change this, or answers to any of my other unsolved problems in this post, or an interest in working on them, I'd love to hear from you.

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 8:10 AM

As someone that does think about a lot of the things you care about at least some of the time (and does care pretty deeply), I can speak for myself why I don't talk about these things too much:

Epistemic problems:

  • Mostly, the concept of "metaphilosophy" is so hopelessly broad that you kinda reach it by definition by thinking about any problem hard enough. This isn't a good thing, when you have a category so large it contains everything (not saying this applies to you, but it applies to many other people I have met who talked about metaphilosophy), it usually means you are confused.
  • Relatedly, philosophy is incredibly ungrounded and epistemologically fraught. It is extremely hard to think about these topics in ways that actually eventually cash out into something tangible, rather than nerdsniping young smart people forever (or until they run out of funding).
  • Further on that, it is my belief that good philosophy should make you stronger, and this means that fmpov a lot of the work that would be most impactful for making progress on metaphilosophy does not look like (academic) philosophy, and looks more like "build effective institutions and learn interactively why this is hard" and "get better at many scientific/engineering disciplines and build working epistemology to learn faster". Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.  I might be totally wrong, but I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would.
  • It is not clear to me that there even is an actual problem to solve here. Similar to e.g. consciousness, it's not clear to me that people who use the word "metaphilosophy" are actually pointing to anything coherent in the territory at all, or even if they are, that it is a unique thing. It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences". I know the other view ofc and still worth engaging with in case there is something deep and universal to be found (the same way we found that there is actually deep equivalency and "correct" ways to think about e.g. computation).

Practical problems:

  • I have short timelines and think we will be dead if we don't make very rapid progress on extremely urgent practical problems like government regulation and AI safety. Metaphilosophy falls into the unfortunate bucket of "important, but not (as) urgent" in my view.
  • There are no good institutions, norms, groups, funding etc to do this kind of work.
  • It's weird. I happen to have a very deep interest in the topic, but it costs you weirdness points to push an idea like this when you could instead be advocating more efficiently for more pragmatic work.
  • It was interesting to read about your successive jumps up the meta hierarchy, because I had a similar path, but then I "jumped back down" when I realized that most of the higher levels is kinda just abstract, confusing nonsense and even really "philosophically concerned" communities like EA routinely fail basic morality such as "don't work at organizations accelerating existential risk" and we are by no means currently bottlenecked by not having reflectively consistent theories of anthropic selection or whatever. I would like to get to a world where we have bottlenecks like that, but we are so, so far away from a world where that kind of stuff is why the world goes bad that it's hard to justify more than some late night/weekend thought on the topic in between a more direct bottleneck focused approach.

All that being said, I still am glad some people like you exist, and if I could make your work go faster, I would love to do so. I wish I could live in a world where I could justify working with you on these problems full time, but I don't think I can convince myself this is actually the most impactful thing I could be doing at this moment.

[-]Wei Dai8mo1020

I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”.

Definitely a possibility (I've entertained it myself and maybe wrote some past comments along these lines). I wish there was more people studying this possibility.

I have short timelines and think we will be dead if we don’t make very rapid progress on extremely urgent practical problems like government regulation and AI safety. Metaphilosophy falls into the unfortunate bucket of “important, but not (as) urgent” in my view.

Everyone dying isn't the worst thing that could happen. I think from a selfish perspective, I'm personally a bit more scared of surviving into a dystopia powered by ASI that is aligned in some narrow technical sense. Less sure from an altruistic/impartial perspective, but it seems at least plausible that building an aligned AI without making sure that the future human-AI civilization is "safe" is a not good thing to do.

I would say that better philosophy/arguments around questions like this is a bottleneck. One reason for my interest in metaphilosophy that I didn't mention in the OP is that studying it seems least likely to cause harm or make things worse, compared to any other AI related topics I can work on. (I started thinking this as early as 2012.) Given how much harm people have done in the name of good, maybe we should all take "first do no harm" much more seriously?

There are no good institutions, norms, groups, funding etc to do this kind of work.

Which also represents an opportunity...

It’s weird. I happen to have a very deep interest in the topic, but it costs you weirdness points to push an idea like this when you could instead be advocating more efficiently for more pragmatic work.

Is it actually that weird? Do you have any stories of trying to talk about it with someone and having that backfire on you?

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

The fact you ask this question is interesting to me, because in my view the opposite question is the more natural one to ask:  What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in as tight loops as possible, and so if we are trying to tackle ever more lofty problems, it becomes ever more important to get exactly that feedback wherever we can get it! From my point of view, this is the default of successful human epistemology, and the exception should be viewed with suspicion.

And for what it's worth, acting in the real world, building a company, raising money, debating people live, building technology, making friends (and enemies), absolutely helped me become far, far less confused, and far more capable of tackling confusing problems! Actually testing my epistemology and rationality against reality, and failing (a lot), has been far more helpful for deconfusing everything from practical decision making skills to my own values than reading/thinking could have ever been in the same time span. There is value in reading and thinking, of course, but I was in a severe "thinking overhang", and I needed to act in the world to keep learning and improving. I think most people (especially on LW) are in an "action underhang."

"Why do people do things?" is an empirical question, it's a thing that exists in external reality, and you need to interact with it to learn more about it. And if you want to tackle even higher level problems, you need to have even more refined feedback. When a physicist wants to understand the fundamentals of reality, they need to set up insane crazy particle accelerators and space telescopes and supercomputers and what not to squeeze bits of evidence out of reality and actually ground whatever theoretical musings they may have been thinking about. So if you want to understand the fundamentals of philosophy and the human condition, by default I expect you are going to need to do the equivalent kind of "squeezing bits out of reality", by doing hard things such as creating institutions, building novel technology, persuading people, etc. "Building a company" is just one common example of a task that forces you to interact a lot with reality to be good.

Fundamentally, I believe that good philosophy should make you stronger and allow you to make the world better, otherwise, why are you bothering? If you actually "solve metaphilosophy", I think the way this should end up looking is that you can now do crazy things. You can figure out new forms of science crazy fast, you can persuade billionaires to support you, you can build monumental organizations that last for generations. Or, in reverse, I expect that if you develop methods to do such impressive feats, you will necessarily have to learn deep truths about reality and the human condition, and acquire the skills you will need to tackle a task as heroic as "solving metaphilosophy."

Everyone dying isn't the worst thing that could happen. I think from a selfish perspective, I'm personally a bit more scared of surviving into a dystopia powered by ASI that is aligned in some narrow technical sense. Less sure from an altruistic/impartial perspective, but it seems at least plausible that building an aligned AI without making sure that the future human-AI civilization is "safe" is a not good thing to do.

I think this grounds out into object level disagreements about how we expect the future to go, probably. I think s-risks are extremely unlikely at the moment, and when I look at how best to avoid them, most such timelines don't go through "figure out something like metaphilosophy", but more likely through "just apply bog standard decent humanist deontological values and it's good enough." A lot of the s-risk in my view comes from the penchant for maximizing "good" that utilitarianism tends to promote, if we instead aim for "good enough" (which is what most people tend to instinctively favor), that cuts off most of the s-risk (though not all).

To get to the really good timelines, that route through "solve metaphilosophy", there are mandatory previous nodes such as "don't go extinct in 5 years." Buying ourselves more time is powerful optionality, not just for concrete technical work, but also for improving philosophy, human epistemology/rationality, etc.

I don't think I see a short path to communicating the parts of my model that would be most persuasive to you here (if you're up for a call or irl discussion sometime lmk), but in short I think of policy, coordination, civilizational epistemology, institution building and metaphilosophy as closely linked and tractable problems, if only it wasn't the case that there was a small handful of AI labs (largely supported/initiated by EA/LW-types) that are deadset on burning the commons as fast as humanly possible. If we had a few more years/decades, I think we could actually make tangible and compounding progress on these problems.

I would say that better philosophy/arguments around questions like this is a bottleneck. One reason for my interest in metaphilosophy that I didn't mention in the OP is that studying it seems least likely to cause harm or make things worse, compared to any other AI related topics I can work on. (I started thinking this as early as 2012.) Given how much harm people have done in the name of good, maybe we should all take "first do no harm" much more seriously?

I actually respect this reasoning. I disagree strategically, but I think this is a very morally defensible position to hold, unlike the mental acrobatics necessary to work at the x-risk factories because you want to be "in the room".

Which also represents an opportunity...

It does! If I was you, and I wanted to push forward work like this, the first thing I would do is build a company/institution! It will both test your mettle against reality and allow you to build a compounding force.

Is it actually that weird? Do you have any stories of trying to talk about it with someone and having that backfire on you?

Yup, absolutely. If you take even a microstep outside of the EA/rat-sphere, these kind of topics quickly become utterly alien to anyone. Try explaining to a politician worried about job loss, or a middle aged housewife worried about her future pension, or a young high school dropout unable to afford housing, that actually we should be worried about whether we are doing metaphilosophy correctly to ensure that future immortal superintelligence reason correctly about acausal alien gods from math-space so they don't cause them to torture trillions of simulated souls! This is exaggerated for comedic effect, but this is really what even relatively intro level LW philosophy by default often sounds like to many people! 

As the saying goes, "Grub first, then ethics." (though I would go further and say that people's instinctive rejection of what I would less charitably call "galaxy brain thinking" is actually often well calibrated)

@jessicata @Connor Leahy @Domenic @Daniel Kokotajlo @romeostevensit @Vanessa Kosoy @cousin_it @ShardPhoenix @Mitchell_Porter @Lukas_Gloor (and others, apparently I can only notify 10 people by mentioning them in a comment)

Sorry if I'm late in responding to your comments. This post has gotten more attention and replies than I expected, in many different directions, and it will probably take a while for me to process and reply to them all. (In the meantime, I'd love to see more people discuss each other's ideas here.)

Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.

Sometimes you don't need multiple levels of meta. There's stuff, and there's stuff about stuff, which could be called "mental" or "intensional". Then there's stuff about stuff about stuff (philosophy of mind etc). But stuff about stuff about stuff is a subset of stuff about stuff. Mental content has material correlates (writing, brain states, etc). I don't think you need a special category for stuff about stuff about stuff, it can be thought of as something like self-reading/modifying code. Or like compilers compiling themselves; you don't need a special compiler to compile compilers.

Philosophy doesn't happen in a vacuum, it's done by people with interests in social contexts, e.g. wanting to understand what other people are saying, or be famous by writing interesting things. A sufficiently good theory of society and psychology would explain philosophical discourse (and itself rely on some sort of philosophy for organizing its models). You can think of people as having "a philosophy" that can be studied from outside by analyzing text, mental states, and so on.

Reasoning about mind embeds reasoning about matter, reasoning about people embeds reasoning about mind, reasoning about matter embeds reasoning about people. Mainstream meta-philosophy consists of comparative analysis of philosophical texts, contextualized by the historical context and people and so on.

Your proposed reflection process for designing a utopia is your proposed utopia. If you propose CEV or similar, you propose that the world would be better if it included a CEV-like reflection context, and that this context had causal influence over the world in the future.

I'm not sure how clear I'm being, but I'm proposing something like collapsing levels of meta by finding correspondences between meta content and object content, and thinking of meta-meta content as meta relative to the objects corresponding to the meta content. This leads to a view where philosophy is one of many types of discourse/understanding that each shape each other (a non-foundationalist view). This is perhaps disappointing if you wanted ultimate foundations in some simple framework. Most thought is currently not foundationalist, but perhaps a foundational re-orientation could be found by understanding the current state of non-foundational thought.

Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.

What if I'm mainly interested in how philosophical reasoning ideally ought to work? (Similar to how decision theory studies how decision making normatively should work, not how it actually works in people.) Of course if we have little idea how real-world philosophical reasoning works, understanding that first would probably help a lot, but that's not the ultimate goal, at least not for me, for both intellectual and AI reasons.

The latter because humans do a lot of bad philosophy and often can’t recognize good philosophy. (See popularity of two-boxing among professional philosophers.) I want a theory of ideal/normative philosophical reasoning so we can build AI that improves upon human philosophy, and in a way that convinces many people (because they believe the theory is right) to trust the AI's philosophical reasoning.

This leads to a view where philosophy is one of many types of discourse/understanding that each shape each other (a non-foundationalist view). This is perhaps disappointing if you wanted ultimate foundations in some simple framework.

Sure ultimate foundations in some simple framework would be nice but I'll take whatever I can get. How would you flesh out the non-foundationalist view?

Most thought is currently not foundationalist, but perhaps a foundational re-orientation could be found by understanding the current state of non-foundational thought.

I don't understand this sentence at all. Please explain more?

What if I’m mainly interested in how philosophical reasoning ideally ought to work?

My view would suggest: develop a philosophical view of normativity and apply that view to the practice of philosophy itself. For example, if it is in general unethical to lie, then it is also unethical to lie about philosophy. Philosophical practice being normative would lead to some outcomes being favored over others. (It seems like a problem if you need philosophy to have a theory of normativity and a theory of normativity to do meta-philosophy and meta-philosophy to do better philosophy, but earlier versions of each theory can be used to make later versions of them, in a bootstrapping process like with compilers)

I mean normativity to include ethics, aesthetics, teleology, etc. Developing a theory of teleology in general would allow applying that theory to philosophy (taken as a system/practice/etc). It would be strange to have a distinct normative theory for philosophical practice than for other practices, since philosophical practice is a subset of practice in general; philosophical normativity is a specified variant of general normativity, analogous to normativity about other areas of study. The normative theory is mostly derived from cases other than cases of normative philosophizing, since most activity that normativity could apply to is not philosophizing.

How would you flesh out the non-foundationalist view?

That seems like describing my views about things in general, which would take a long time. The original comment was meant to indicate what is non-foundationalist about this view.

I don’t understand this sentence at all. Please explain more?

Imagine a subjective credit system. A bunch of people think other people are helpful/unhelpful to them. Maybe they help support helpful people and so people who are more helpful to helpful people (etc) succeed more. It's subjective, there's no foundation where there's some terminal goal and other things are instrumental to that.

An intersubjective credit system would be the outcome of something like Pareto optimal bargaining between the people, which would lead to a unified utility function, which would imply some terminal goals and other goals being instrumental.

Speculatively, it's possible to create an intersubjective credit system (implying a common currency) given a subjective credit system.

This might apply at multiple levels. Perhaps individual agents seem to have terminal goals because different parts of their mind create subjective credit systems and then they get transformed into an objective credit system in a way that prevents money pumps etc (usual consequences of not being a VNM agent).

I'm speculating that a certain kind of circular-seeming discourse, where area A is explained in terms of area B and vice versa, might be in some way analogous to a subjective credit network, and there might be some transformation of it that puts foundations on everything, analogous to founding an intersubjective credit network in terminal goals. Some things that look like circular reasoning can be made valid and others can't. The cases I'm considering are like, cases where your theory of normativity depends on your theory of philosophy and your theory of philosophy depends on your theory of meta-philosophy and your theory of meta-philosophy depends on your theory of normativity, which seems kind of like a subjective credit system.

Sorry if this is confusing (it's confusing to me too).

First, I think that the theory of agents is a more useful starting point than metaphilosophy. Once we have a theory of agents, we can build models, within that theory, of agents reasoning about philosophical questions. Such models would be answers to special cases of metaphilosophy. I'm not sure we're going to have a coherent theory of "metaphilosophy" in general, distinct from the theory of agents, because I'm not sure that "philosophy" is an especially natural category[1].

Some examples of what that might look like:

  • An agent inventing a theory of agents in order to improve its own cognition is a special case of recursive metalearning (see my recent talk on metacognitive agents).
  • There might be theorems about convergence of learning systems to agents of particular type (e.g. IBP agents), formalized using some brand of ADAM, in the spirit of John's Selection Theorems programme. This can be another model of agents discovering a theory of agents and becoming more coherent as a result (broader in terms of its notions of "agent" and "discovering" and narrower in terms of what the agent discovers).
  • An agent learning how to formalize some of its intuitive knowledge (e.g. about its own values) can be described in terms of metacognition, or more generally, the learning of some formal symbolic language. Indeed, understanding is translation, and formalizing intuitive knowledge means translating it from some internal opaque language to an external observable language.

Second, obviously in order to solve philosophical problems (such as the theory of agents), we need to implement a particular metaphilosophy. But I don't think it needs to has to be extremely rigorous. (After all, if we tried to solve metaphilosophy instead, we would have the same problem.) My informal theory of metaphilosophy is something like: an answer to a philosophical question is good when it seems intuitive, logically consistent and parsimonious[2] after sufficient reflection (where "reflection" involves, among other things, considering special cases and other consequences of the answer, and also connecting the answer to empirical data).

 

  1. ^

    I think that philosophy just consists of all domains where we don't have consensus about some clear criteria of success. Once such a consensus forms, this domain is no longer considered philosophy. But the reasons some domains have this property at this point of time might be partly coincidental and not especially parsimonious.

  2. ^

    Circling back to the first point, what would a formalization of this within a theory of agents look like? "Parsimony" refers to a simplicity prior, "intuition" refers to opaque reasoning in the core of a metacognitive agent, and "logically consistency" is arguably some learned method of testing hypotheses (but maybe we will have a more elaborate theory of the latter). 

From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?


Facile response: I think lots of people (maybe a few hundred a year?) take this path, and end up becoming philosophy grad students like I did. As you said, the obvious next step for many domains of intellectual inquiry is to go meta / seek foundations / etc., and that leads you into increasingly foundational increasingly philosophical questions until you decide you'll never able to answer all the questions but maybe at least you can get some good publications in prestigious journals like Analysis and Phil Studies, and contribute to humanity's understanding of some sub-field.

 

Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.

Yep that's probably part of it. Standard human epistemic vices. Also maybe publish-or-perish has something to do with it? idk. I definitely noticed incentives to double-down / be dogmatic in order to seem impressive on the job market. Oh also, iirc one professor had a cynical theory that if you find an interesting flaw in your own theory/argument, you shouldn't mention it in your paper, because then the reviewers will independently notice the flaw and think 'aha, this paper has an interesting flaw, if it gets published I could easily and quickly write my own paper pointing out the flaw' and then they'll be more inclined to recommend publication. It's also a great way to get citations.

Note also that I said "a few hundred a year" not "ten thousand a year" which is roughly how many people become philosophy grad students. I was more selective because in my experience most philosophy grad students don't have as much... epistemic ambition? as you or me. Sorta like the Hamming Question thing -- some, but definitely a minority, of grad students can say "I am working on it actually, here's my current plan..." to the question "what's the most important problem in your field and why aren't you working on it?" (to be clear epistemic ambition is a spectrum not a binary)

ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work.


Cool, can you please ask them if they can send me the idea, even if it's just a one-paragraph summary or a pile of crappy notes-to-self?

I'm pretty much with you on this. But it's hard to find a workable attack on the problem.

One question though, do you think philosophical reasoning is very different from other intelligence tasks? If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they'll be worse at philosophy than at other tasks?

If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they’ll be worse at philosophy than at other tasks?

I'm not sure but I do think it's very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this:

  • Humans do a lot of bad philosophy and often can't recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit that capability (e.g., which philosophers to emulate)? (It's possible that even solving metaphilosophy doesn't help enough with this, if many people can't recognize the solution as correct, but there's at least a chance that the solution does look obviously correct to many people, especially if there's not already wrong solutions to compete with).
  • What if it learns how to do good philosophy during pre-training, but RLHF trains that away in favor of optimizing arguments to look good to the user.
  • What if philosophy is just intrinsically hard for ML in general (I gave an argument for why ML might have trouble learning philosophy from humans in the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy, but I'm not sure how strong it is) or maybe it's just some specific LLM architecture that has trouble with this, and we never figure this out because the AI is good at finding arguments that look good to humans?
  • Or maybe we do figure out that AI is worse at philosophy than other tasks, after it has been built, but it's too late to do anything with that knowledge (because who is going to tell the investors that they've lost their money because we don't want to differentially decelerate philosophical progress by deploying the AI).

Here's another bullet point to add to the list:
 

  • It is generally understood now that ethics is subjective, in the following technical sense: 'what final goals you have' is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it'll probably end up with goals you don't want. What if ethics isn't the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that your decision theory is subjective in this sense maybe. Perhaps, therefore, what we consider "doing good/wise philosophy" is going to involve at least a few subjective elements, where what we want is for our AGIs to do philosophy (with respect to those elements) in the same way that we would want and not in various other ways, and that won't happen by default, we need to have some mechanism to make it happen.

I don't say it's not risky. The question is more, what's the difference between doing philosophy and other intellectual tasks.

Here's one way to look at it that just occurred to me. In domains with feedback, like science or just doing real world stuff in general, we learn some heuristics. Then we try to apply these heuristics to the stuff of our mind, and sometimes it works but more often it fails. And then doing good philosophy means having a good set of heuristics from outside of philosophy, and good instincts when to apply them or not. And some luck, in that some heuristics will happen to generalize to the stuff of our mind, but others won't.

If this is a true picture, then running far ahead with philosophy is just inherently risky. The further you step away from heuristics that have been tested in reality, and their area of applicability, the bigger your error will be.

Does this make sense?

Do you have any examples that could illustrate your theory?

It doesn't seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what "domains with feedback"?

Maybe I used a heuristic like "computer science is cool, lets try to apply it to philosophical problems" but if the heuristics are this coarse grained, it doesn't seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?

Maybe one example is the idea of Dutch book. It comes originally from real world situations (sport betting and so on) and then we apply it to rationality in the abstract.

Or another example, much older, is how Socrates used analogy. It was one of his favorite tools I think. When talking about some confusing thing, he'd draw an analogy with something closer to experience. For example, "Is the nature of virtue different for men and for women?" - "Well, the nature of strength isn't that much different between men and women, likewise the nature of health, so maybe virtue works the same way." Obviously this way of reasoning can easily go wrong, but I think it's also pretty indicative of how people do philosophy.

When I look at metaphilosophy, the main places I go looking are places with large confusion deltas. Where, who, and why did someone become dramatically less philosophically confused about something, turning unfalsifiable questions into technical problems. Kuhn was too caught up in the social dynamics to want to do this from the perspective of pure ideas. A few things to point to.

  1. Wittgenstein noticed that many philosophical problems attempt to intervene at the wrong level of abstraction and posited that awareness of abstraction as a mental event might help
  2. Korzybski noticed that many philosophical problems attempt to intervene at the wrong level of abstraction and posited that awareness of abstraction as a mental event might help
  3. David Marr noticed that many philosophical and technical problems attempt to intervene at the wrong level of you get the idea
  4. Hassabis cites Marr as of help in deconfusing AI problems
  5. Eliezer's Technical Explanation of Technical Explanation doesn't use the term compression and seems the worse for it, using many many words to describe things that compression would render easier to reason about afaict.
  6. Hanson in the Elephant in the Brain posits that if we mysteriously don't make progress on something that seems crucial, maybe we have strong motivations for not making progress on it.

Question: what happens to people when they gain consciousness of abstraction? My first pass attempt at an answer is that they become a lot less interested in philosophy.

Question: if someone had quietly made progress on metaphilosophy how would we know? First guess is that we would only know if their solution scaled well, or caused something to scale well.

If you think it would be helpful, you are welcome to suggest a meta philpsophy topic for AI Safety Camp.

More info at aisafety.camp. (I'm typing on a phone, I'll add actuall link later if I remember too)

FWIW I think some of the thinking I've been doing about meta-rationality and ontological shifts feels like metaphilosophy. Would be happy to call and chat about it sometime.

I do feel pretty wary about reifying the label "metaphilosophy" though. My preference is to start with a set of interesting questions which we can maybe later cluster into a natural category, rather than starting with the abstract category and trying to populate it with questions (which feels more like what you're doing, although I could be wrong).