I see modeling vs. implementation as a spectrum more than a dichotomy. Something like:
More precisely, rather than a 1-dimensional spectrum, there are at least two parameters involved:
[EDIT: And a 3rd parameter is, how justified/testable the assumptions of your model is. Ideally, you want these assumptions to be grounded in science. Some will likely be philosophical assumptions which cannot be tested empirically, but at least they should fit into a coherent holistic philosophical view. At the very least, you want to make sure you're not assuming away the core parts of the problem.]
For the purposes of safety, you want to be as close to the implementation end of the spectrum as you can get. However, the model side of the spectrum is still useful as:
Btw, what are some ways we can incorporate heuristics into our algorithm while staying on level 1-2?
As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory.
I think this misunderstands the general view of agent foundations by those who worked on it in the past. That is, "highly reliable agent design" was an eventual goal, in the same sense that someone taking high-school physics wants to use it to build rockets - they (hopefully) understand enough to know that they don't know enough, and will need to learn more before even attempting to build anything.
That's why Eliezer talked so much about deconfusion. The idea was to figure out what they didn't know. This led to later talking about building safe AI as an eventual goal - not a plan, but an eventual possible outcome if they could figure out enough. They clarified this view. It was mostly understood by funders. And I helped Issa Rice write a paper laying out the different pathways that it could help - and only two of those involved building agents.
And why did they give it up? Largely because they found that the deconfusion work was so slow, and everyone was so fundamentally wrong about the basics, that as LLM-based systems were developed they didn't think we could possible build the reliable systems in time. They didn't think that Bayesian decision theory or glass-box agents would necessarily work, and they didn't know what would. So I think "MIRI intended to build principled glass-box agents based on Bayesian decision theory" is not just misleading, but wrong.
Epistemic status: I feel that naming this axis deconfuses me about agent foundations about as much as writing the rest of this sequence so far - so it is worth a post even though I have less to say about it.
I think my goal in studying agent foundations is a little atypical. I am usually trying to build an abstract model of superintelligent agents and make safety claims based on that model.
For instance, AIXI models a very intelligent agent pursuing a reward signal, and allows us to conclude that it probably seizes control of the reward mechanism by default. This is nice because it makes our assumptions fairly explicit. AIXI has epistemic uncertainty but no computational bounds, which seems like a roughly appropriate model for agents much smarter than anything they need to interact with. AIXI is explicitly planning to maximize its discounted reward sum, which is different from standard RL (which trains on a reward signal, but later executes learned behaviors). We can see these things from the math.
Reflective oracles are compelling to me because they seem like an appropriate model for agents at a similar level of intelligence mutually reasoning about each other, possibly including a single agent over time (in the absence of radical intelligence upgrades?).
I'm willing to use these models where I expect them to bare weight, even if they are not "the true theory of agency." In fact (as is probably becoming clear over the course of this sequence) I am not sure that a true theory of agency applicable to all contexts exists. The problem is that agents have a nasty habit of figuring stuff out, and anything they figure out is (at least potentially) pulled into agent theory. Agent theory does not want to stay inside a little bubble in conceptual space; it wants to devour conceptual space.
I notice a different attitude among many agent foundations researchers. As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory. Probably as a result, it seems that MIRI-adjacent researchers tend to explicitly plan on actually implementing their theory; they want it to be executable. Someday. After a lot of math has been done. This isn't to say that they currently write a lot of code - I am only discussing their theory of impact as I understand it. To be clear, this is not a criticism; it is fine for some people to focus on theory building with an eye towards implementation and others to focus on performing implementation.
For example, I believe @abramdemski really wants to implement a version of UDT and @Vanessa Kosoy really wants to implement an IBP agent. They are both working on a normative theory which they recognize is currently slightly idealized or incomplete, but I believe that their plan routes through developing that theory to the point that it can be translated into code. Another example is the program synthesis community in computational cognitive science (e.g. Josh Tenenbaum, Zenna Tavares). They are writing functional programs to compete with deep learning right now.
For a criticism of this mindset, see my (previous in this sequence) discussion of why glass-box learners are not necessarily safer. Also, (relatedly) I suspect it will be rather hard to invent a nice paradigm that takes the lead from deep learning. However, I am glad people are working on it and I hope they succeed; and I don't mean that in an empty way. I dabble in this quest myself - I even have a computational cognitive science paper.
I think that my post on what makes a theory of intelligence useful suffers from a failure to make explicit this dichotomy between modeling and implementation. I mostly had the modeling perspective in mind, but sometimes made claims about implementation. These are inherently different concerns.
The modeling perspective has its own problems. It is possible that agent theory is particularly unfriendly to abstract models - superintelligences apply a lot of optimization pressure, and pointing that optimization pressure in almost the right direction is not good enough. However, I am at least pretty comfortable using abstract models to predict why alignment plans won't work. To conclude that an alignment plan will work, you need to know that your abstract model is robust to vast increases in intelligence. That is why I like models similar to AIXI, which have already "taken the limit" of increasing intelligence - even if they (explicitly) leave out the initial conditions of intelligence-escalation trajectories.