All of Gordon Seidoh Worley's Comments + Replies

I'd really like to see more follow up on the ideas made in this post. Our drive to care is arguably why we're willing to cooperate, and making AI that cares the same way we do is a potentially viable path to AI aligned with human values, but I've not seen anyone take it up. Regardless, I think this is an important idea and think folks should look at it more closely.

I think this post is important because it brings old insights from cybernetics into a modern frame that relates to how folks are thinking about AI safety today. I strongly suspect that the big idea in this post, that ontology is shaped by usefulness, matters greatly to addressing fundamental problems in AI alignment.

Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.

1Chris_Leong
You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.

According to METR, the organization that audited OpenAI, a dozen tasks indicate ARA capabilities.

Small comment, but @Beth Barnes of METR posted on Less Wrong just yesterday to say "We should not be considered to have ‘audited’ GPT-4 or Claude".

This doesn't appear to be a load-bearing point in your post, but would still be good to update the language to be more precise.

Ah I see. I have to admit, I write a lot of my comments between things and I missed that the context of the post could cause my words to be interpreted this way. These days I'm often in executive mode rather than scholar mode and miss nuance if it's not clearly highlighted, hence my misunderstanding, but also reflects where I'm coming from with this answer!

I left a comment over in the other thread, but I think Joachim misunderstands my position.

In the above comment I've taken for granted that there's a non-trivial possibility that AGI is near, so I'm not arguing we should say that "AGI is near" regardless of whether it is or not, because we don't know if it is or not, we only have our guesses about it, and so long as there's a non-trivial chance that AGI is near, I think that's the more important message to communicate.

Overall it would be better if we can communicate something like "AGI is probably near", bu... (read more)

2orthonormal
I agree with "When you say 'there's a good chance AGI is near', the general public will hear 'AGI is near'". However, the general public isn't everyone, and the people who can distinguish between the two claims are the most important to reach (per capita, and possibly in sum). So we'll do better by saying what we actually believe, while taking into account that some audiences will round probabilities off (and seeking ways to be rounded closer to the truth while still communicating accurately to anyone who does understand probabilistic claims). The marginal gain by rounding ourselves off at the start isn't worth the marginal loss by looking transparently overconfident to those who can tell the difference.
2Joachim Bartosik
I'm replying only here because spreading discussion over multiple threads makes it harder to follow. You left a reply on a question asking how to communicate about reasons why AGI might not be near. The question refers to costs of "the community" thinking that AI closer than it really is as a reason to communicate about reasons it might not be so close. So I understood the question as asking about communication with the community (my guess: of people seriously working and thinking about AI-safety-as-in-AI-not-killing-everyone). Where it's important to actually try to figure out truth. You replied (as I understand) that when we communicate to general public we can transmit only 1 idea that so we should communicate that AGI is near (if we assign not-very-low probability to that). I think the biggest problem I have with your posting "general public communication" as a reply to question asking about "community communication" pushes towards less clarity in the community, where I think clarity is important. I'm also not sold on the "you can communicate only one idea" thing but I mostly don't care to talk about it right now (it would be nice if someone else worked it out for me but now I don't have capacity to do it myself).

From a broad policy perspective, it can be tricky to know what to communicate. I think it helps if we think a bit more about the effects of our communication and a bit less about correctly conveying our level of credence in particular claims. Let me explain.

If we communicate the simple idea that AGI is near then it pushes people to work on safety projects that would be good to work on even if AGI is not near while paying some costs in terms of reputation, mental health, and personal wealth.

If we communicate the simple idea that AGI is not near then people ... (read more)

2orthonormal
I reached this via Joachim pointing it out as an example of someone urging epistemic defection around AI alignment, and I have to agree with him there. I think the higher difficulty posed by communicating "we think there's a substantial probability that AGI happens in the next 10 years" vs "AGI is near" is worth it even from a PR perspective, because pretending you know the day and the hour smells like bullshit to the most important people who need convincing that AI alignment is nontrivial.

Fair. For what it's worth I strongly agree that causality is just one domain where this problem becomes apparent, and we should be worried about it generally for super intelligent agents, much more so than I think many folks seem (in my estimation) to worry about it today.

Yes, the variables constitute a reference frame, which is to say an ultimately subjective way of viewing the world. Even if there is high inter-observer agreement about the shape of the reference frame, it's not guaranteed unless you also posit something like Wentworth's natural abstraction hypothesis to be true.

Perhaps a toy example will help explain my point. Suppose the grass should only be watered when there's a violet cube on the lawn. To automate this a sensor is attached to the sprinklers that turns them on only when the sensor sees a violet cube. I... (read more)

2Tom Everitt
Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable). How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem. But I would also like to emphasize that the problem you're pointing to isn't restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages. I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn't solve the reference problem in general. Partly for this reason, we're mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.

I think there's something big left out of this post, which is accounting for the agent observing and judging the causal relationships. Something has to decide how to carve up the world into parts and calculate counterfactuals. It's something that exists implicitly in your approach to causality but you don't address it here, which I think is unfortunate because although humans generally have the same frame of reference for judging causality, alien minds, like AI, may not.

1Tom Everitt
The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined. Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.e. sets a few variables) and does the measurement (i.e. observes some variables), the outcome will follow the same distribution. So in short, I don't think we need to talk about an agent observer beyond what we already say about the variables.

Actually, I kind of forgot what ended up in the paper, but then I remembered so wanted to update my comment.

There was an early draft of this paper that talked about deontology, but because there are so many different forms of deontology it was hard to come up with arguments where there wasn't some version of deontological reasoning that broke the argument, so I instead switched to talking about the question of moral facts independent of ethical system. That said, the argument I make in the paper suggesting that moral realism is more dangerous than moral an... (read more)

I don't see it in the references so you might find this paper of mine (link is to Less Wrong summary, which links to full thing) interesting because within it I include an argument suggesting building AI that assumes deontology is strictly more risky than building one that does not.

1William D'Alessandro
Excellent, thanks! I was pretty confident that some other iterations of something like these ideas must be out there. Will read and incorporate this (and get back to you in a couple days).

If the mind becomes much more capable than the surrounding minds, it does so by being on a trajectory of creativity: something about the mind implies that it generates understanding that is novel to the mind and its environment.

 

I don't really understand this claim enough to evaluate it. Can you expand a bit on what you mean by it? I'm unsure about the rest of the post because it's unclear to me what the premise your top-line claim rest upon means.

2Tsvi Benson-Tilsen
If a mind comes to understand a bunch of stuff, there's probably some compact reasons that it came to understand a bunch of stuff. What could such reasons be? The mind might copy a bunch of understanding from other minds. But if the mind becomes much more capable than surrounding minds, that's not the reason, assuming that much greater capabilities required much more understanding. So it's some other reason. I'm describing this situation as the mind being on a trajectory of creativity.

to answer my own question:

Level of AI risk concern: high

General level of risk tolerance in everyday life: low

Brief summary of what you do in AI: first tried to formalize what alignment would mean, this led me to work on a program of deconfusing human values that reached an end of what i could do, now have moved on to writing about epistemology that i think is critical to understand if we want to get alignment right

Anything weird about you: prone to anxiety, previously dealt with OCD, mostly cured it with meditation but still pops up sometimes

I think I disagree. Based on your presentation here, I think someone following a policy inspired by this post would be more likely to cause existential catastrophe by pursuing a promising false positive that actually destroys all future value in our Hubble volume. I've argued we need to focus on minimizing false positive risk rather than optimizing for max expected value, which is what I read this as proposing we do.

This post brought to mind a thought: I actually don't care very much about arguments about how likely doom is and how pessimistic or optimistic to be since they are irrelevant, to my style of thinking, for making decisions related to building TAI. Instead, I mostly focus on downside risks and avoiding them because they are so extreme, which makes me look "pessimistic" but actually I'm just trying to minimize the risk of false positives in building aligned AI. Given this framing, it's actually less important, in most cases, to figure out how likely somethin... (read more)

A good specific example of trying to pull this kind of shell game is perhaps HCH. I don't recall if someone made this specific critique of it before, but it seems like there's some real concern that it's just hiding the misalignment rather than actually generating an aligned system.

2Tsvi Benson-Tilsen
That was one of the examples I had in mind with this post, yeah. (More precisely, I had in mind defenses of HCH being aligned that I heard from people who aren't Paul. I couldn't pass Paul's ITT about HCH or similar.)

In classical Chinese philosophy there's the concept of shi-fei or "this not that". A key bit of the idea, among other things, is that all knowledge involves making distinctions, and those distinctions are judgments, and so if you want to have knowledge and put things into words you have to make this-not-that style judgements of distinction to decide what goes in what category.

More recently here on the forum, Abram has written about teleosemantics, which seems quite relevant to your investigations in this post.

The teleosemantic picture is that epistemic accuracy is a common, instrumentally convergent subgoal; and "meaning" (in the sense of semantic content) arises precisely where this subgoal is being optimized. 

I think this is exactly right. I often say things like "accurate maps are extremely useful to things like survival, so you and every other living thing has strong incentives to draw accurate maps, but this is contingent on the extent to which you care about e.g. survival".

So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.

3Abram Demski
This doesn't sound quite right to me. Teleosemantics is a purported definition of belief. So according to the teleosemantic picture, it isn't a belief if it's not trying to accurately reflect something.  The additional statement I prefaced this with, that accuracy is an instrumentally convergent subgoal, was intended to be an explanation of why this sort of "belief" is a common phenomenon, rather than part of the definition of "belief".  In principle, there could be a process which only optimizes accuracy and doesn't serve any larger goal. This would still be creating and maintaining beliefs according to the definition of teleosemantics, although it would be an oddity. (How did it get there? How did a non-agentic process end up creating it?)

Cool. For what it's worth, I also disagree with many of my old framings. Basically anything written more than ~1 year ago is probably vaguely but not specifically endorsed.

Oh man I kind of wish I could go back in time and wipe out all the cringe stuff I wrote when I was trying to figure things out (like why did I need to pull in Godel or reify my confusion?). With that said, here's some updated thoughts on holons. I'm not really familiar with OOO, so I'll be going off your summary here.

I think I started out really not getting what the holon idea points at, but I understood enough to get myself confused in new ways for a while. So first off there's only ~1 holon, such that it doesn't make sense to talk about it as anything ot... (read more)

2Abram Demski
OK. So far it seems to me like we share a similar overall take, but I disagree with some of your specific framings and such. I guess I'll try and comment on the relevant posts, even though this might imply commenting on some old stuff that you'll end up disclaiming.

I very much agree and really like the coining of the term "teleosemantics". I might steal it! :-)

I'm not sure how much you've read my work on this topic or how much it influenced you, but in case you're not very aware of it I think it's worth pointing out some things I've been working on in this space for a while that you might find interesting.

I got nervous about how truth works when I tried to tackle the alignment problem head on. I ended up having to write a sequence of posts to sort out my ideas. At the time, I really failed to appreciate how deep telo... (read more)

2Abram Demski
One thing I see as different between your perspective and (my understanding of) teleosemantics, so far: You make a general case that values underlie beliefs. Teleosemantics makes a specific claim that the meaning of semantic constructs (such as beliefs and messages) is pinned down by what it is trying to correspond to. Your picture seems very compatible with, EG, the old LW claim that UDT's probabilities are really a measure of caring - how much you care about doing well in a variety of scenarios.  Teleosemantics might fail to analyze such probabilities as beliefs at all; certainly not beliefs about the world. (Perhaps beliefs about how important different scenarios are, where "importance" gets some further analysis...) The teleosemantic picture is that epistemic accuracy is a common, instrumentally convergent subgoal; and "meaning" (in the sense of semantic content) arises precisely where this subgoal is being optimized.  That's my guess at the biggest difference between our two pictures, anyway.
2Abram Demski
(Following some links...) What's the deal with Holons?  Your linked article on epistemic circularity doesn't really try to explain itself, but rather links to this article, which LOUDLY doesn't explain itself.  I haven't read much else yet, but here is what I think I get: * You use Godel's incompleteness theorem as part of an argument that meta-rationalism can't make itself comprehensible to rationalism. * You think (or thought at the time) that there's a thing, Holons, or Holonic thinking, which is fundamentally really really hard to explain, but which a bunch of people (mainly Buddhists and a few of the best postmodernists) already get. Kensho vibes. Not something you wrote, but Viliam trying to explain you: I'm curious whether you see any similarity between holons and object oriented ontology (if you're at all familiar with that).  I was vibing with object oriented ontology when I wrote this, particularly the "nontrivial implication" at the end.  Here's my terrible summary of OOO: * Everything is 'objects'.  * For practically every realism debate, OOO lands on the realist side. There is an external world, there are tables, there are chairs, and triangles, and squares, numbers, letters, moral facts, possible worlds, and even fictional characters.  * Philosophy is properly understood as a form of art; and as art, especially closely related to theater.  * Sciences deal with objective (measurable, quantifiable) facts; arts deal with subjective/intersubjective (immeasurable, unquantifiable) facts. * Objects are a kind of intersection of these two sorts of thing. * To understand there as being an object is in some sense to be able to put yourself in its place, empathize with it, imagine it were you. This is our way of fruitfully relating to the immeasurable/unquantifiable. So panpsychism is in some sense a true fact about our ontology.  I find OOO to be an odd mix of interesting ideas and very weird ideas.  Feel free to ignore the OOO comparison if it

So there's different notions of more here.

There's more in the sense I'm thinking in that it's not clear additional levels of abstraction enable deeper understanding given enough time. If 3 really is all the more levels you need because that's how many it takes to think about any number of levels of depth (again by swapping out levels in your "abstraction registers"), additional levels end up being in the same category.

And then there's more like doing things faster which makes things cheaper. I'm more skeptical of scaling than you are perhaps. I do agree th... (read more)

Alright, fair warning, this is an out there kind of comment. But I think there's some kind of there there, so I'll make it anyway.

Although I don't have much of anything new to say about it lately, I spent several years really diving into developmental psychology and my take on most of it is that its an attempt to map changes in the order of complexity of the structure thoughts can take on. I view the stages of human psychological development as building up the mental infrastructure to be able to hold up to three levels of fully-formed structure (yes, this ... (read more)

1Tsvi Benson-Tilsen
Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)

Why does there need to be structure? We can just have a non-uniform distribution of energy around the universe in order for there to be information to extract. I guess you could call this "structure" but that seems like a stretch to me.

I don't know if I can convince you. You seem pretty convinced that there are natural abstractions or something like them. I'm pretty suspicious that there are natural abstractions and instead think there are useful abstractions but they are all contingent on how the minds creating those abstractions are organized and that no... (read more)

Sure, differences are as real as the minds making them are. Once you have minds those minds start perceiving differentiation since they need to extract information from the environment to function. So I guess I'm saying I don't see what your objection is in this last comment as you've not posited anything that seems to claim something that actually disagrees with my point as far as I can tell. I think it's a bit weird to call the differentiation you're referring to "objective", but you explained what you mean.

1Mitchell_Porter
How can there be information for minds to extract, unless the environment already has some kind of structure?

Isn't a special case of aiming at any target we want the goals we would want it to have? And whatever goals we'd want it to have would be informed by our ontology? So what I'm saying is I think there's a case where the generality of your claim breaks down.

4Thane Ruthenis
Goals are functions over the concepts in one's internal ontology, yes. But having a concept for something doesn't mean caring about it — your knowing what a "paperclip" is doesn't make you a paperclip-maximizer. The idea here isn't to train an AI with the goals we want from scratch, it's to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.

I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.

It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.

This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.


If this is the case, my concern seems ye... (read more)

For what it's worth, I think you're running headlong into an instance of the problem of the criterion and enjoy seeing how you're grappling with it. I've tagged this post as such.

Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.

Perhaps I'm being unfair, but I'm not convinced that you're not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.

2Thane Ruthenis
I don't understand this objection. I'm not making any claim isomorphic to "two agents with the same ontology would have the same goals". It sounds like maybe you think I'm arguing that if we can make the AI's world-model human-like, it would necessarily also be aligned? That's not my point at all. The motivation is outlined at the start of 1A: I'm saying that if we can learn how to interpret arbitrary advanced world-models, we'd be able to more precisely "aim" our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI's aligned/corrigible behavior.
2Noosphere89
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile. It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions. This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true. The robust values hypothesis from DragonGod is worth looking at, too. From the link below, I'll quote: This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test. Here's a link below for the robust values hypothesis: https://www.lesswrong.com/posts/YoFLKyTJ7o4ApcKXR/disc-are-values-robust

This seems straightforward to me: reification is a process by which our brain picks out patterns/features and encodes them so we can recognize them again and make sense of the world given our limited hardware. We can then think in terms of those patterns and gloss over the details because the details often aren't relevant for various things.

The reason we reify things one way versus another depends on what we care about, i.e. our purposes.

To me this seems obvious: noumena feel real to most people because they're captured by their ontology. It takes a lot of work for a human mind to learn not to jump straight from sensation to reification, and even with training there's only so much a person can do because the mind has lots of low-level reification "built in" that happens prior to conscious awareness. Cf. noticing

Oh, I thought I already explained that. There's at least two different ways "exist" can be meant here, and I think we're talking past each other.

For some thing to exist that implies it must exist ontologically, i.e. in the map. Otherwise it is not yet a thing. So I'm saying there's a difference between what we might call existence and being. You exist, in the sense of being an ontological thing, only by virtue of reification, but you are by virtue of the whole world being.

1Mitchell_Porter
I have a theory that belief in a good God is the main delusion of western religion, and belief in a fundamentally undifferentiated reality is the main delusion of eastern religion.  I see no way around the conclusion that differences are real. Experience is part of reality, and experience contains difference. Also, my experience is objectively distinct from yours - I don't know what you had for breakfast today (or indeed if you had any); that act was part of your experience, and not part of mine.  We can divide up the world in different ways, but the undivided world is already objectively differentiated. 

Yep, so I think this gets into a different question of epistemology not directly related to things but rather about what we care about, since positing a theory that what looks to me like a table implies something table shaped about the universe requires caring about parsimony.

(Aside: It's kind of related because to talk about caring about things we need reifications that enable us to point to what we care about, but I think that's just an artifact of using words—care is patterns of behavior and preference we can reify call "parsimonious" or something else,... (read more)

1Tsvi Benson-Tilsen
I'm asking what reification is, period, and what it has to do with what's in reality (the thing that bites you regardless of what you think).

Yes, though note you can observe yourself.

0Mitchell_Porter
How can self-observation be the cause of my existence as a differentiated being? Don't I have to already exist as a differentiated being, in order to be doing that? 

I didn't link it in my original reply by work on natural abstractions is also related. My take is that if natural abstractions exist they don't actually rehabilitate noumena but they do explain why it intuitively feels like there are noumena. However abstractions are still phenomena (except insofar as all phenomena are of course embedded in the world) even if they are picking up on what I might metaphorically describe as the natural contours of the territory.

1Tsvi Benson-Tilsen
How do they explain why it feels like there are noumena? (Also by "feels like" I'd want to include empirical observations of nexusness.)

This is confusing two different notions of exist. There is existence as part of the wholeness of the world that is as yet undifferentiated and there is your existence in the minds of people. "You" exist lots of places in many minds, and also "you" don't have a clearly defined existence separate and independent from the rest of the world.

I realize this is unintuitive to many folks. The thing you have to notice is that the world has an existence independent of ontology and ontology-less existence can't be fathomed in terms of ontology.

1Mitchell_Porter
Are you saying my existence is "undifferentiated" from "the wholeness of the world" so long as no one else is observing me or thinking of me?

I very much appreciate trying to figure out what things are. I think, though, you've added more complication than needed. However, my take depends on a particular view on philosophy.

So, first I think Kant is wrong about noumena. They don't exist. There are no things in themselves, there are only phenomena: things that exist because we reify them into existence to fit some concern we have. Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing that we can only reify by... (read more)

1Tsvi Benson-Tilsen
Okay, but the tabley-looking stuff out there seems to conform more parsimoniously to a theory that posits an external table. I assume we agree on that, and then the question is, what's happening when we so posit?
1Gordon Seidoh Worley
I didn't link it in my original reply by work on natural abstractions is also related. My take is that if natural abstractions exist they don't actually rehabilitate noumena but they do explain why it intuitively feels like there are noumena. However abstractions are still phenomena (except insofar as all phenomena are of course embedded in the world) even if they are picking up on what I might metaphorically describe as the natural contours of the territory.
3Mitchell_Porter
Do I only exist because you "reify" me?

On the one hand, cool, on the other, the abstract is deceptive because it tries to claim that the AI trained is "harmless but nonevasive AI assistant" but what the paper in fact claims is that Anthropic trained an AI that has a higher harmlessness and helpfulness score and thus offers a Pareato improvement over previous models but is not definitely across some bar we could say is harmless vs. not-harmless or helpful vs. not-helpful. As much is also stated in the included figure.

The work is cool, don't get me wrong. We should celebrate it. But also I want a... (read more)

3Lawrence Chan
I really do empathize with the authors, since writing an abstract fundamentally requires trading off faithfulness to the paper content and the length and readability of the abstract. But I do agree that they could've been more precise without a significant increase in length. Nitpick: I think instead of expanding on the sentence  My proposed rewrite is to replace that sentence with something like: I think this is ~ the same length and same level of detail but a lot easier to parse. 

These are good intuitive arguments against these sorts of solutions, but I think there's a more formal argument we can make that these solutions are dangerous because they pose excess false positive risk. In particular, I think they fail to fully account for the risks of generalized Goodharting, as do most proposed solutions other than something like agent foundations.

Right. Nothing that happens in the same Hubble volume can really be said to not be causally connected. Nonetheless I like the point of the OP even if it's made in an imprecise way.

I continue to be excited about this line of work. I feel like you're slowly figuring out how to formalize ontology in a way reflective of what we actually do and generalizing it. This is something missing from a lot of other approaches.

This is pretty exciting. I've not really done any direct work to push forward alignment in the last couple years, but this is exactly the sort of direction I was hoping someone would go when I wrote my research agenda for deconfusing human values. What came out of it was that there was some research to do that I wasn't equipped to do myself, and I'm very happy to say you've done the sort of thing I had hoped for.

On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!

Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.

2Jan Hendrik Kirchner
This work by Michael Aird and Justin Shovelain might also be relevant: "Using vector fields to visualise preferences and make them consistent" And I have a post where I demonstrate that reward modeling can extract utility functions from non-transitive preference orderings: "Inferring utility functions from locally non-transitive preferences" (Extremely cool project ideas btw)

Re: Project 2

This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.

This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to... (read more)

As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.

AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make  adequately evaluate  as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we cons... (read more)

Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.

I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.

0Richard Hollerith
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?

For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.

For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well. 

And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditiona... (read more)

Load More