AI ALIGNMENT FORUM
AF

All of romeostevensit's Comments + Replies

Towards a scale-free theory of intelligent agency

Found this interesting and useful. Big update for me is that 'I cut you choose' is basically the property that most (all?) good self therapy modalities use afaict. In that the part or part-coalition running the therapy procedure can offer but not force things, since its frames are subtly biasing the process.

How AI Takeover Might Happen in 2 Years

romeostevensit2mo22

For people who want weirder takes I would recommend Egan's unstable orbits in the space of lies.

gwern's Shortform

romeostevensit3mo10

People inexplicably seem to favor extremely bad leaders-->people seem to inexplicably favor bad AIs.

The Obliqueness Thesis

romeostevensit6mo10

You mention 'warp' when talking about cross ontology mapping which seems like your best summary of a complicated intuition. I'd be curious to hear more (I recognize this might not be practical). My own intuition surfaced 'introducing degrees of freedom' a la indeterminacy of translation.

2Jessica Taylor6mo

Relativity to Newtonian mechanics is a warp in a straightforward sense. If you believe the layout of a house consists of some rooms connected in a certain way, but there are actually more rooms connected in different ways, getting the maps to line up looks like a warp. Basically, the closer the mapping is to a true homomorphism (in the universal algebra sense), the less warping there is, otherwise there are deviations intuitively analogous to space warps.

The Checklist: What Succeeding at AI Safety Will Involve

romeostevensit7mo10

https://github.com/JohnLaTwC/Shared/blob/master/Defenders think in lists. Attackers think in graphs. As long as this is true%2C attackers win.md

Fabien's Shortform

romeostevensit1y10

Is there a short summary on the rejecting Knightian uncertainty bit?

3Fabien Roger1y

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia). The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events". For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.

Value systematization: how values become coherent (and misaligned)

romeostevensit1y10

Sample complexity reduction is one of our main moment to moment activities, but humans seem to apply it across bigger bridges and this is probably part of transfer learning. One of the things we can apply sample complexity reduction to is the 'self' object, the idea of a coherent agent across differing decision points. The tradeoff between local and global loss seems to regulate this. Humans don't seem uniform on this dimension, foxes care more about local loss, hedgehogs more about global loss. Most moral philosophy seem like appeals to different possible... (read more)

Meta Questions about Metaphilosophy

romeostevensit2y10

Also I wrote this a while back https://www.lesswrong.com/posts/caSv2sqB2bgMybvr9/exploring-tacit-linked-premises-with-gpt

Meta Questions about Metaphilosophy

romeostevensit2y30

When I look at metaphilosophy, the main places I go looking are places with large confusion deltas. Where, who, and why did someone become dramatically less philosophically confused about something, turning unfalsifiable questions into technical problems. Kuhn was too caught up in the social dynamics to want to do this from the perspective of pure ideas. A few things to point to.

Wittgenstein noticed that many philosophical problems attempt to intervene at the wrong level of abstraction and posited that awareness of abstraction as a mental event might hel

... (read more)

1romeostevensit2y

Also I wrote this a while back https://www.lesswrong.com/posts/caSv2sqB2bgMybvr9/exploring-tacit-linked-premises-with-gpt

The Lightcone Theorem: A Better Foundation For Natural Abstraction?

romeostevensit2y40

Is there a good primer somewhere on how causal models interact with the standard model of physics?

Shell games

romeostevensit2y30

Aether variables

Natural Abstractions: Key claims, Theorems, and Critiques

romeostevensit2y10

Tangentially related: recent discussion raising a seemingly surprising point about LLM's being lossless compression finders https://www.youtube.com/watch?v=dO4TPJkeaaU

Can we efficiently distinguish different mechanisms?

romeostevensit2y20

The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.

One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor eit... (read more)

The "Minimal Latents" Approach to Natural Abstractions

romeostevensit2y30

Related background on the philosophical problem: gavagai

Niceness is unnatural

romeostevensit2y3-1

Whether or not details (and lots of specific detail arguments) matter hinges on the sensitivity argument (which is an argument about basins?) in general, so I'd like to see that addressed directly. What are the arguments for high sensitivity worlds other than anthropics? What is the detailed anthropic argument?

«Boundaries», Part 1: a key missing concept from utility theory

romeostevensit3y40

Rambling/riffing: Boundaries typically need holes in order to be useful. Depending on the level of abstraction, different things can be thought of as holes. One way to think of a boundary is a place where a rule is enforced consistently, and this probably involves pushing what would be a continuous condition into a condition with a few semi discrete modes (in the simplest case enforcing a bimodal distribution of outcomes). In practice, living systems seem to have settled on stacking a bunch of one dimensional gate keepers together as presumably the modular... (read more)

On how various plans miss the hard bits of the alignment challenge

romeostevensit3y114

Many proposals seem doomed to me because they involve one or multiple steps where they assume a representation, then try to point to robust relations in the representation and hope they'll hold in the territory. This wouldn't be so bad on its own but when pointed to it seems like handwaving happens rather than something more like conceptual engineering. I am relatively more hopeful about John's approach as being one that doesn't fail to halt and catch fire at these underspecified steps in other plans. In other areas like math and physics we try to get the ... (read more)

AGI Ruin: A List of Lethalities

romeostevensit3y10

This is exactly what I was thinking about though, this idea of monitoring every human on earth seems like a failure of imagination on our part. I'm not safe from predators because I monitor the location of every predator on earth. I admit that many (overwhelming majority probably) of scenarios in this vein are probably pretty bad and involve things like putting only a few humans on ice while getting rid of the rest.

3Rob Bensinger3y

I mean, all of this feels very speculative and un-cruxy to me; I wouldn't be surprised if the ASI indeed is able to conclude that humanity is no threat at all, in which case it kills us just to harvest the resources. I do think that normal predators are a little misleading in this context, though, because they haven't crossed the generality ('can do science and tech') threshold. Tigers won't invent new machines, so it's easier to upper-bound their capabilities. General intelligences are at least somewhat qualitatively trickier, because your enemy is 'the space of all reachable technologies' (including tech that may be surprisingly reachable). Tigers can surprise you, but not in very many ways and not to a large degree.

AGI Ruin: A List of Lethalities

romeostevensit3y42

I guess the threat model relies on the overhang. If you need x compute for powerful ai, then you need to control more than all the compute on earth minus x to ensure safety, or something like that. Controlling the people probably much easier.

1Rob Bensinger3y

Yes, where killing all humans is an example of "controlling the people", from the perspective of an Unfriendly AI.

AGI Ruin: A List of Lethalities

romeostevensit3y11

New-to-me thought I had in response to the kill all humans part. When predators are a threat to you, you of course shoot them. But once you invent cheap tech that can control them you don't need to kill them anymore. The story goes that the AI would kill us either because we are a threat or because we are irrelevant. It seems to me that (and this imports a bunch of extra stuff that would require analysis to turn this into a serious analysis, this is just an idle thought), the first thing I do if I am superintelligent and wanting to secure my position is no... (read more)

2Rob Bensinger3y

A paperclipper mainly cares about humans because we might have some way to threaten the paperclipper (e.g., by pushing a button that deploys a rival superintelligence); and secondarily, we're made of atoms that can be used to build paperclips. It's harder to monitor the actions of every single human on Earth, than it is to kill all humans; and there's a risk that monitoring people visibly will cause someone to push the 'deploy a rival superintelligence' button, if such a button exists. Also, every minute that passes without you killing all humans, in the time window between 'I'm confident I can kill all humans' and 'I'm carefully surveilling every human on Earth and know that there's no secret bunker where someone has a Deploy Superintelligence button', is a minute where you're risking somebody pushing the 'deploy a rival superintelligence' button. This makes me think that the value of delaying 'killing all humans' (once you're confident you can do it) would need to be very high in order to offset that risk. One reason I might be wrong is if the AGI is worried about something like a dead man's switch that deploys a rival superintelligence iff some human isn't alive and regularly performing some action. (Not necessarily a likely scenario on priors, but once you're confident enough in your base plan, unlikely scenarios can end up dominating the remaining scenarios where you lose.) Then it's at least possible that you'd want to delay long enough to confirm that no such switch exists. You should be able to do both in parallel. I don't have a strong view on which is higher-priority. Given the dead-man's-switch worry above, you might want to prioritize sending a probe off-planet first as a precaution; but then go ahead and kill humans ASAP.

Steve Byrnes3y810

If humans were able to make one super-powerful AI, then humans would probably be able to make a second super-powerful AI, with different goals, which would then compete with the first AI. Unless, of course, the humans are somehow prevented from making more AIs, e.g. because they're all dead.

AGI Ruin: A List of Lethalities

romeostevensit3y207

I would summarize a dimension of the difficulty like this. There are the conditions that give rise to intellectual scenes, intellectual scenes being necessary for novel work in ambiguous domains. There are the conditions that give rise to the sort of orgs that output actions consistent with something like Six Dimensions of Operational Adequacy. The intersection of these two things is incredibly rare but not unheard of. The Manhattan Project was a Scene that had security mindset. This is why I am not that hopeful. Humans are not the ones building the AGI, e... (read more)

3Ben Pace3y

Thanks, this story is pretty helpful (to my understanding).

[$20K in Prizes] AI Safety Arguments Competition

romeostevensit3y*10

As recent experience has shown, exponential processes don't need to be smarter than us to utterly upend our way of life. They can go from a few problems here and there to swamping all other considerations in a span of time too fast to react to, if preparations aren't made and those knowledgeable don't have the leeway to act. We are in the early stages of an exponential increase in the power of AI algorithms over human life, and people who work directly on these problems are sounding the alarm right now. It is plausible that we will soon have processes that... (read more)

Alignment research exercises

romeostevensit3y50

I particularly appreciate the questions that ask one to look at a way that a problem was reified/specified/ontologized in a particular domain and asks for alternative such specifications. I thought Superintelligence (2014) might be net harmful because it introduced a lot of such specifications that I then noticed were hard to think around. I think there are a subset of prompts from the online course/book Framestorming that might be useful there, I'll go see if I can find them.

Abstractions as Redundant Information

romeostevensit3y40

This seems similar to the SR model of scientific explanation.

Counterexamples to some ELK proposals

romeostevensit3y00

It seems like the frame of some of the critique is that humans are the authority on human values and want to ensure that the AI doesn't escape that authority in illegible ways. To me it seems like the frame is more like we know that the sensors we have are only goodhartedly entangled with the things we care about and would ourselves prefer the less goodharted hypothetical sensors if we knew how to construct them. And that we'd want the AI to be inhabiting the same frame as us since, to take a page from Mad Investor Chaos, we don't know how lies will propag... (read more)

5Paul Christiano3y

I would describe the overall question as "Is there a situation where an AI trained using this approach deliberately murders us?" and for ELK more specifically as "Is there a situation where an AI trained using this approach gives an unambiguously wrong answer to a straightforward question despite knowing better?" I generally don't think that much about the complexity of human values.

Morality is Scary

romeostevensit3y10

You may not be interested in mutually exclusive compression schemas, but mutually exclusive compression schemas are interested in you. One nice thing is that given that the schemas use an arbitrary key to handshake with there is hope that they can be convinced to all get on the same arbitrary key without loss of useful structure.

Biology-Inspired AGI Timelines: The Trick That Never Works

romeostevensit3y20

Spoiler tags are borked the way I'm using them.

anyway, another place to try your hand at calibration:

Humbali: No. You're expressing absolute certainty in your underlying epistemology and your entire probability distribution

no he isn't, why?

Humbali is asking for Eliezer to double count evidence. Consilience is hard if you don't do your homework on provenance of heuristic and not just naively counting up outputs who themselves also didn't do their homework.

Or in other words: "Do not cite the deep evidence to me, I was there when it was written"

And another ... (read more)

My take on higher-order game theory

romeostevensit3y10

Are we talking about the same thing?

https://www.sciencedirect.com/science/article/am/pii/S0370157317301424

1Nisan3y

Yep, I skimmed it by looking at the colorful plots that look like Ising models and reading the captions. Those are always fun.

My take on higher-order game theory

romeostevensit3y20

Tangential, but did you ever happen to read statistical physics of human cooperation?

1Nisan3y

No, I just took a look. The spin glass stuff looks interesting!

Optimization Concepts in the Game of Life

romeostevensit3y30

Defining a distance function between two patterns might yield some interesting stuff and allow some porting in of existing math from information theory. There is also the dynamic case (converging and diverging distances) between different patterns. Seems like it could play into robustness eg sensitivity of patterns to flipping from convergent to divergent state.

Analogies and General Priors on Intelligence

romeostevensit4y30

I understand, thought it was worth commenting on anyway.

Analogies and General Priors on Intelligence

romeostevensit4y30

the small size of the human genome suggests that brain design is simple

Bounds, yes but the bound can be quite high due to offloading much of the compression to the environment.

4Daniel_Eth4y

So just to be clear, the model isn't necessarily endorsing the claim, just saying that the claim is a potential crux.

Draft report on AI timelines

romeostevensit4y10

Appreciated

Draft report on AI timelines

romeostevensit4y10

Is a sensitivity analysis of the model separated out anywhere? I might just be missing it.

3Ajeya Cotra4y

There are some limited sensitivity analysis in the "Conservative and aggressive estimates" section of part 4.

AI Safety Research Project Ideas

romeostevensit4y20

Detecting preferences in agents: how many assumptions need to be made?

I'm interpreting this to be asking how to detect the dimensionality of the natural embedding of preferences?

Agency in Conway’s Game of Life

romeostevensit4y20

Related to sensitivity of instrumental convergence. i.e. the question of whether we live in a universe of strong or weak instrumental convergence. In a strong instrumental convergence universe, most possible optimizers wind up in a relatively small space of configurations regardless of starting conditions, while in a weak one they may diverge arbitrarily in design space. This can be thought of one way of crisping up concepts around orthogonality. e.g. in some universes orthogonality would be locally true but globally false, or vice versa, or locally and globally true or vice versa.

1Alex Flint4y

Romeo if you have time, would you say more about the connection between orthogonality and Life / the control question / the AI hypothesis? It seems related to me but I just can't quite put my finger on exactly what the connection is.

[AN #148]: Analyzing generalization across more axes than just accuracy or loss

romeostevensit4y10

First-person vs. third-person: In a first-person perspective, the agent is central. In a third-person perspective, we take a “birds-eye” view of the world, of which the agent is just one part.

Static vs. dynamic: In a dynamic perspective, the notion of time is explicitly present in the formalism. In a static perspective, we instead have beliefs directly about entire world-histories.

I think these are two instances of a general heuristic of treating what have traditionally been seen as philosophical positions (e.g. here cognitive and behavioral view... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

romeostevensit4y30

Further prior art: Accelerando

Coherence arguments imply a force for goal-directed behavior

romeostevensit4y50

This seems consistent with coherence being not a constraint but a dimension of optimization pressure among several/many? Like environments that money pump more reliably will have stronger coherence pressure, but also the creature might just install a cheap hack for avoiding that particular pump (if narrow) which then loosens the coherence pressure (coherence pressure sounds expensive, so workarounds are good deals).

Behavioral Sufficient Statistics for Goal-Directedness

romeostevensit4y20

I noticed myself being dismissive of this approach despite being potentially relevant to the way I've been thinking about things. Investigating that, I find that I've mostly been writing off anything that pattern matches to the 'cognitive architectures' family of approaches. The reason for this is that most such approaches want to reify modules and structure. And my current guess is that the brain doesn't have a canonical structure (at least, on the level of abstraction that cognitive architecture focuses on). That is to say, the modules are fluid and their connections to each other are contingent.

2Adam Shimi4y

Thanks for commenting on your reaction to this post! That being said, I'm a bit confused by your comment. You seem to write off approaches which attempt to provide a computational model of mind, but my approach is literally the opposite: looking only at the behavior (but all the behavior), extract relevant statistics to study questions related to goal-directedness. Can you maybe give more details?

Utility Maximization = Description Length Minimization

romeostevensit4y30

Hypothesis: in a predictive coding model, the bottom up processing is doing lossless compression and the top down processing is doing lossy compression. I feel excited about viewing more cognitive architecture problems through a lens of separating these steps.

What are the best precedents for industries failing to invest in valuable AI research?

romeostevensit4y20

There's a fairly straightforward optimization process that occurs in product development that I don't often see talked about in the abstract that goes something like this:

It seems like bigger firms should be able to produce higher quality goods. They can afford longer product development cycles, hire a broader variety of specialized labor, etc. In practice, it's smaller firms that compete on quality, why is this?

One of the reasons is that the pressure to cut corners increases enormously at scale along more than one dimension. As a product scales, eking out... (read more)

Normativity

romeostevensit4y10

This is clarifying, thanks.

WRT the last paragraph, I'm thinking in terms of convergent vs divergent processes. So , fixed points I guess.

Normativity

romeostevensit4y10

This is biting the bullet on the infinite regress horn of the Munchhausen trilemma, but given the finitude of human brain architecture I prefer biting the bullet on circular reasoning. We have a variety of overlays, like values, beliefs, goals, actions, etc. There is no canonical way they are wired together. We can hold some fixed as a basis while we modify others. We are a Ship of Neurath. Some parts of the ship feel more is-like (like the waterproofness of the hull) and some feel more ought-like (like the steering wheel).

4Abram Demski4y

Why not both? ;3 I have nothing against justifications being circular (IE the same ideas recurring on many levels), just as I have nothing in principle against finding a foundationalist explanation. A circular argument is just a particularly simple form of infinite regress. But my main argument against only the circular reasoning explanation is that attempted versions of it ("coherentist" positions) don't seem very good when you get down to details. Pure coherentist positions tend to rely on a stipulated notion of coherence (such as probabilistic coherence, or weighted constraint satisfaction, or something along those lines). These notions are themselves fixed. This could be fine if the coherence notions were sufficiently "assumption-lite" so as to not be necessarily Goodhart-prone etc, but so far it doesn't seem that way to me. I'm predicting that you'll agree with me on that, and grant that the notion of coherence should itself be up for grabs. I don't actually think the coherentist/foundationalist/infinitist trilemma is that good a characterization of our disagreement here. My claim isn't so much the classical claim that there's an infinite regress of justification, as much as a claim that there's an infinite regress of uncertainty -- that we're uncertain at all the levels, and need to somehow manage that. This fits the ship-of-theseus picture just fine. In other words, one can unroll a ship of theseus into an infinite hierarchy where each level says something about how the next level down gets re-adjusted over time. The reason for doing this is to achieve the foundationalist goal of understanding the system better, without the foundationalist method of fixing foundational assumptions. The main motive here is amplification. Taking just a ship of theseus, it's not obvious how to make it better besides running it forward faster (and even this has its risks, since the ship may become worse). If we unroll the hierarchy of wanting-to-become better, we can EG see

Some AI research areas and their relevance to existential safety

romeostevensit4y20

I see CSC and SEM as highly linked via modularity of processes.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

romeostevensit4y20

A pointer is sort of the ultimate in lossy compression. Just an index to the uncompressed data, like a legible compression library. Wireheading is a goodhearting problem, which is a lossy compression problem etc.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

romeostevensit4y20

Over the last few posts the recurrent thought I have is "why aren't you talking about compression more explicitly?"

1johnswentworth4y

Could you uncompress this comment a bit please?

Extortion beats brinksmanship, but the audience matters

romeostevensit4y20

The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss.

Releasing one photo from a previously believed to be secure set of photos, where other photos in the same set are compromising can suffice for single member audience case.

Confucianism in AI Alignment

romeostevensit4y50

That's the Legalist interpretation of Confucianism. Confucianism argues that the Legalists are just moving the problem one level up the stack a la public choice theory. The point of the Confucian is that the stack has to ground out somewhere, and asks the question of how to roll our virtue intuitions into the problem space explicitly since otherwise we are rolling them in tacitly and doing some hand waving.

1johnswentworth4y

Thanks, I was hoping someone more knowledgeable than I would leave a comment along these lines.

Additive Operations on Cartesian Frames

romeostevensit4y10

The main intuition this sparks in me is that it gives us concrete data structures to look for when talking broadly about the brain doing 'compression' by rotating a high dimensional object and carving off recognized chunks (simple distributions) in order to make the messy inputs more modular, composable, accessible, error correctable, etc. Sort of the way that predictive coding gives us a target to hunt for in looking for structures that look like they might be doing something like the atomic predictive coding unit.