I think this post is important because it brings old insights from cybernetics into a modern frame that relates to how folks are thinking about AI safety today. I strongly suspect that the big idea in this post, that ontology is shaped by usefulness, matters greatly to addressing fundamental problems in AI alignment.
According to METR, the organization that audited OpenAI, a dozen tasks indicate ARA capabilities.
Small comment, but @Beth Barnes of METR posted on Less Wrong just yesterday to say "We should not be considered to have ‘audited’ GPT-4 or Claude".
This doesn't appear to be a load-bearing point in your post, but would still be good to update the language to be more precise.
Ah I see. I have to admit, I write a lot of my comments between things and I missed that the context of the post could cause my words to be interpreted this way. These days I'm often in executive mode rather than scholar mode and miss nuance if it's not clearly highlighted, hence my misunderstanding, but also reflects where I'm coming from with this answer!
I left a comment over in the other thread, but I think Joachim misunderstands my position.
In the above comment I've taken for granted that there's a non-trivial possibility that AGI is near, so I'm not arguing we should say that "AGI is near" regardless of whether it is or not, because we don't know if it is or not, we only have our guesses about it, and so long as there's a non-trivial chance that AGI is near, I think that's the more important message to communicate.
Overall it would be better if we can communicate something like "AGI is probably near", bu...
From a broad policy perspective, it can be tricky to know what to communicate. I think it helps if we think a bit more about the effects of our communication and a bit less about correctly conveying our level of credence in particular claims. Let me explain.
If we communicate the simple idea that AGI is near then it pushes people to work on safety projects that would be good to work on even if AGI is not near while paying some costs in terms of reputation, mental health, and personal wealth.
If we communicate the simple idea that AGI is not near then people ...
Yes, the variables constitute a reference frame, which is to say an ultimately subjective way of viewing the world. Even if there is high inter-observer agreement about the shape of the reference frame, it's not guaranteed unless you also posit something like Wentworth's natural abstraction hypothesis to be true.
Perhaps a toy example will help explain my point. Suppose the grass should only be watered when there's a violet cube on the lawn. To automate this a sensor is attached to the sprinklers that turns them on only when the sensor sees a violet cube. I...
I think there's something big left out of this post, which is accounting for the agent observing and judging the causal relationships. Something has to decide how to carve up the world into parts and calculate counterfactuals. It's something that exists implicitly in your approach to causality but you don't address it here, which I think is unfortunate because although humans generally have the same frame of reference for judging causality, alien minds, like AI, may not.
Actually, I kind of forgot what ended up in the paper, but then I remembered so wanted to update my comment.
There was an early draft of this paper that talked about deontology, but because there are so many different forms of deontology it was hard to come up with arguments where there wasn't some version of deontological reasoning that broke the argument, so I instead switched to talking about the question of moral facts independent of ethical system. That said, the argument I make in the paper suggesting that moral realism is more dangerous than moral an...
I don't see it in the references so you might find this paper of mine (link is to Less Wrong summary, which links to full thing) interesting because within it I include an argument suggesting building AI that assumes deontology is strictly more risky than building one that does not.
If the mind becomes much more capable than the surrounding minds, it does so by being on a trajectory of creativity: something about the mind implies that it generates understanding that is novel to the mind and its environment.
I don't really understand this claim enough to evaluate it. Can you expand a bit on what you mean by it? I'm unsure about the rest of the post because it's unclear to me what the premise your top-line claim rest upon means.
to answer my own question:
Level of AI risk concern: high
General level of risk tolerance in everyday life: low
Brief summary of what you do in AI: first tried to formalize what alignment would mean, this led me to work on a program of deconfusing human values that reached an end of what i could do, now have moved on to writing about epistemology that i think is critical to understand if we want to get alignment right
Anything weird about you: prone to anxiety, previously dealt with OCD, mostly cured it with meditation but still pops up sometimes
I think I disagree. Based on your presentation here, I think someone following a policy inspired by this post would be more likely to cause existential catastrophe by pursuing a promising false positive that actually destroys all future value in our Hubble volume. I've argued we need to focus on minimizing false positive risk rather than optimizing for max expected value, which is what I read this as proposing we do.
This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI. Given this framing, it's actually less important, in most cases, to figure out how likely somethin...
A good specific example of trying to pull this kind of shell game is perhaps HCH. I don't recall if someone made this specific critique of it before, but it seems like there's some real concern that it's just hiding the misalignment rather than actually generating an aligned system.
In classical Chinese philosophy there's the concept of shi-fei or "this not that". A key bit of the idea, among other things, is that all knowledge involves making distinctions, and those distinctions are judgments, and so if you want to have knowledge and put things into words you have to make this-not-that style judgements of distinction to decide what goes in what category.
More recently here on the forum, Abram has written about teleosemantics, which seems quite relevant to your investigations in this post.
The teleosemantic picture is that epistemic accuracy is a common, instrumentally convergent subgoal; and "meaning" (in the sense of semantic content) arises precisely where this subgoal is being optimized.
I think this is exactly right. I often say things like "accurate maps are extremely useful to things like survival, so you and every other living thing has strong incentives to draw accurate maps, but this is contingent on the extent to which you care about e.g. survival".
So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.
Oh man I kind of wish I could go back in time and wipe out all the cringe stuff I wrote when I was trying to figure things out (like why did I need to pull in Godel or reify my confusion?). With that said, here's some updated thoughts on holons. I'm not really familiar with OOO, so I'll be going off your summary here.
I think I started out really not getting what the holon idea points at, but I understood enough to get myself confused in new ways for a while. So first off there's only ~1 holon, such that it doesn't make sense to talk about it as anything ot...
I very much agree and really like the coining of the term "teleosemantics". I might steal it! :-)
I'm not sure how much you've read my work on this topic or how much it influenced you, but in case you're not very aware of it I think it's worth pointing out some things I've been working on in this space for a while that you might find interesting.
I got nervous about how truth works when I tried to tackle the alignment problem head on. I ended up having to write a sequence of posts to sort out my ideas. At the time, I really failed to appreciate how deep telo...
So there's different notions of more here.
There's more in the sense I'm thinking in that it's not clear additional levels of abstraction enable deeper understanding given enough time. If 3 really is all the more levels you need because that's how many it takes to think about any number of levels of depth (again by swapping out levels in your "abstraction registers"), additional levels end up being in the same category.
And then there's more like doing things faster which makes things cheaper. I'm more skeptical of scaling than you are perhaps. I do agree th...
Alright, fair warning, this is an out there kind of comment. But I think there's some kind of there there, so I'll make it anyway.
Although I don't have much of anything new to say about it lately, I spent several years really diving into developmental psychology and my take on most of it is that its an attempt to map changes in the order of complexity of the structure thoughts can take on. I view the stages of human psychological development as building up the mental infrastructure to be able to hold up to three levels of fully-formed structure (yes, this ...
Why does there need to be structure? We can just have a non-uniform distribution of energy around the universe in order for there to be information to extract. I guess you could call this "structure" but that seems like a stretch to me.
I don't know if I can convince you. You seem pretty convinced that there are natural abstractions or something like them. I'm pretty suspicious that there are natural abstractions and instead think there are useful abstractions but they are all contingent on how the minds creating those abstractions are organized and that no...
Sure, differences are as real as the minds making them are. Once you have minds those minds start perceiving differentiation since they need to extract information from the environment to function. So I guess I'm saying I don't see what your objection is in this last comment as you've not posited anything that seems to claim something that actually disagrees with my point as far as I can tell. I think it's a bit weird to call the differentiation you're referring to "objective", but you explained what you mean.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
If this is the case, my concern seems ye...
For what it's worth, I think you're running headlong into an instance of the problem of the criterion and enjoy seeing how you're grappling with it. I've tagged this post as such.
Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.
Perhaps I'm being unfair, but I'm not convinced that you're not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.
This seems straightforward to me: reification is a process by which our brain picks out patterns/features and encodes them so we can recognize them again and make sense of the world given our limited hardware. We can then think in terms of those patterns and gloss over the details because the details often aren't relevant for various things.
The reason we reify things one way versus another depends on what we care about, i.e. our purposes.
To me this seems obvious: noumena feel real to most people because they're captured by their ontology. It takes a lot of work for a human mind to learn not to jump straight from sensation to reification, and even with training there's only so much a person can do because the mind has lots of low-level reification "built in" that happens prior to conscious awareness. Cf. noticing
Oh, I thought I already explained that. There's at least two different ways "exist" can be meant here, and I think we're talking past each other.
For some thing to exist that implies it must exist ontologically, i.e. in the map. Otherwise it is not yet a thing. So I'm saying there's a difference between what we might call existence and being. You exist, in the sense of being an ontological thing, only by virtue of reification, but you are by virtue of the whole world being.
Yep, so I think this gets into a different question of epistemology not directly related to things but rather about what we care about, since positing a theory that what looks to me like a table implies something table shaped about the universe requires caring about parsimony.
(Aside: It's kind of related because to talk about caring about things we need reifications that enable us to point to what we care about, but I think that's just an artifact of using words—care is patterns of behavior and preference we can reify call "parsimonious" or something else,...
I didn't link it in my original reply by work on natural abstractions is also related. My take is that if natural abstractions exist they don't actually rehabilitate noumena but they do explain why it intuitively feels like there are noumena. However abstractions are still phenomena (except insofar as all phenomena are of course embedded in the world) even if they are picking up on what I might metaphorically describe as the natural contours of the territory.
This is confusing two different notions of exist. There is existence as part of the wholeness of the world that is as yet undifferentiated and there is your existence in the minds of people. "You" exist lots of places in many minds, and also "you" don't have a clearly defined existence separate and independent from the rest of the world.
I realize this is unintuitive to many folks. The thing you have to notice is that the world has an existence independent of ontology and ontology-less existence can't be fathomed in terms of ontology.
I very much appreciate trying to figure out what things are. I think, though, you've added more complication than needed. However, my take depends on a particular view on philosophy.
So, first I think Kant is wrong about noumena. They don't exist. There are no things in themselves, there are only phenomena: things that exist because we reify them into existence to fit some concern we have. Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing that we can only reify by...
On the one hand, cool, on the other, the abstract is deceptive because it tries to claim that the AI trained is "harmless but nonevasive AI assistant" but what the paper in fact claims is that Anthropic trained an AI that has a higher harmlessness and helpfulness score and thus offers a Pareato improvement over previous models but is not definitely across some bar we could say is harmless vs. not-harmless or helpful vs. not-helpful. As much is also stated in the included figure.
The work is cool, don't get me wrong. We should celebrate it. But also I want a...
These are good intuitive arguments against these sorts of solutions, but I think there's a more formal argument we can make that these solutions are dangerous because they pose excess false positive risk. In particular, I think they fail to fully account for the risks of generalized Goodharting, as do most proposed solutions other than something like agent foundations.
This is pretty exciting. I've not really done any direct work to push forward alignment in the last couple years, but this is exactly the sort of direction I was hoping someone would go when I wrote my research agenda for deconfusing human values. What came out of it was that there was some research to do that I wasn't equipped to do myself, and I'm very happy to say you've done the sort of thing I had hoped for.
On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!
Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.
Re: Project 2
This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.
This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to...
As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.
AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make adequately evaluate as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we cons...
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.
I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.
For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.
For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well.
And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditiona...
I'd really like to see more follow up on the ideas made in this post. Our drive to care is arguably why we're willing to cooperate, and making AI that cares the same way we do is a potentially viable path to AI aligned with human values, but I've not seen anyone take it up. Regardless, I think this is an important idea and think folks should look at it more closely.