All of Alex Flint's Comments + Replies

Logical induction for software engineers

If you pin down what a thing refers to according to what that thing was optimized to refer to, then don't you have to look at the structure of the one who did the optimizing in order to work out what a given thing refers to? That is, to work out what the concept "thermodynamics" refers to, it may not be enough to look at the time evolution of the concept "thermodynamics" on its own, I may instead need to know something about the humans who were driving those changes, and the goals held within their minds. But, if this is correct, then doesn't it raise anot... (read more)

1Ramana Kumar2y

The trick is that for some of the optimisations, a mind is not necessary. There is a sense perhaps in which the whole history of the universe (or life on earth, or evolution, or whatever is appropriate) will become implicated for some questions, though.

Worst-case thinking in AI alignment

That is correct. I know it seems little weird to generate a new policy on every timestep. The reason it's done that way is that the logical inductor needs to understand the function that maps prices to the quantities that will be purchased, in order to solve for a set of prices that "defeat" the current set of trading algorithms. That function (from prices to quantities) is what I call a "trading policy", and it has to be represented in a particular way -- as a set of syntax tree over trading primitives -- in order for the logical inductor to solve for pri... (read more)

1Chris_Leong2y

Thanks for the extra detail! (Actually, I was reading a post by Mark Xu which seems to suggest that the TradingAlgorithms have access to the price history rather than the update history as I suggested above)

Alex Flint2y30Review for 2021 Review

This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like "planning fallacy" seem to have been added to the list with little time taken to place them in the context of the other thing... (read more)

Grokking the Intentional Stance

Alex Flint2y30Review for 2021 Review

Many people believe that they already understand Dennett's intentional stance idea, and due to that will not read this post in detail. That is, in many cases, a mistake. This post makes an excellent and important point, which is wonderfully summarized in the second-to-last paragraph:

In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and a

Alex Flint2y-10

Have you personally ever ridden in a robot car that has no safety driver?

Soares, Tallinn, and Yudkowsky discuss AGI cognition

Alex Flint2y70Review for 2021 Review

This post consists of comments on summaries of a debate about the nature and difficulty of the alignment problem. The original debate was between Eliezer Yudkowsky and Richard Ngo but this post does not contain the content from that debate. This posts is mostly of commentary by Jaan Tallinn on that debate, with comments by Eliezer.

The post provides a kind of fascinating level of insight into true insider conversations about AI alignment. How do Eliezer and Jaan converse about alignment? Sure, this is a public setting, so perhaps they communicate differentl... (read more)

1Raymond Arnold2y

I think I disagree with this characterization. A) we totally have robot cars by now, B) I think mostly what we don't have are AI running systems where the consequence of failure is super high (which maybe happens to be more true for the physical world, but I'd expect to also be true for critical systems in the digital world)

Logical induction for software engineers

ChatGPT struggles to respond to the real world

Thanks - fixed! And thank you for the note, too.

Alex Flint2y31

Yeah it might just be a lack of training data in 10-second-or-less interactive instructions.

The thing I really wanted to test with this experiment was actually whether ChatGPT could engage with the real world using me as a guinea pig. The 10-second-or-less thing was just the format I used to try to "get at" the phenomenon of engaging with the real world. I'm interested in improving the format to more cleanly get at the phenomenon.

I do currently have the sense that it's more than just a lack of training data. I have the sense that ChatGPT has learned much l... (read more)

ChatGPT struggles to respond to the real world

ChatGPT struggles to respond to the real world

I asked a group of friends for "someone to help me with an AI experiment" and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.

Coherent extrapolated dreaming

If you look at how ChatGPT responds, it seems to be really struggling to "get" what's happening in the kitchen -- it never really comes to the point of giving specific instructions, and especially never comes to the point of having any sense of the "situation" in the kitchen -- e.g. whet... (read more)

Alex Flint2y11

I'm very interested in Wei Dai's work, but I haven't followed closely in recent years. Any pointers to what I might read of his recent writings?

I do think Eliezer tackled this problem in the sequences, but I don't really think he came to an answer to these particular questions. I think what he said about meta-ethics is that it is neither that there is some measure of goodness to be found in the material world independent from our own minds, nor that goodness is completely open to be constructed based on our whims or preferences. He then says "well there ju... (read more)

Coherent extrapolated dreaming

Response to Holden’s alignment plan

Did you ever end up reading Reducing Goodhart?

Not yet, but I hope to, and I'm grateful to you for writing it.

processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good

Well, sure, but the question is whether this can really be done by modelling human values and then evolving those models. If you claim yes then there are several thorny issues to contend with, including what constitutes a viable starting point for such a process, what is a reasonable dynamic for such a process, and on what basis we decide the answers to these things.

Response to Holden’s alignment plan

Wasn't able to record it - technical difficulties :(

Notes on OpenAI’s alignment plan

Yes, I should be able to record the discussion and post a link in the comments here.

Notes on OpenAI’s alignment plan

If you train a model by giving it reward when it appears to follow a particular human's intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

Right yeah I do agree with this.

Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That

... (read more)

A challenge for AGI organizations, and a challenge for readers

Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don't actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.

Now suppose that you have this intelligent person write essays about how one ough... (read more)

Agency in Conway’s Game of Life

Here is a critique of OpenAI's plan

Alex Flint2y*20Review for 2021 Review

This is a post about the mystery of agency. It sets up a thought experiment in which we consider a completely deterministic environment that operates according to very simple rules, and ask what it would be for an agentic entity to exist within that.

People in the game of life community actually spent some time investigating the empirical questions that were raised in this post. Dave Greene notes:

The technology for clearing random ash out of a region of space isn't entirely proven yet, but it's looking a lot more likely than it was a year ago, that a work

... (read more)

Agency in Conway’s Game of Life

Beware over-use of the agent model

Thanks for this note Dave

Alex Flint2y20Review for 2021 Review

This post attempts to separate a certain phenomenon from a certain very common model that we use to understand that phenomenon. The model is the "agent model" in which intelligent systems operate according to an unchanging algorithm. In order to make sense of their being an unchanging algorithm at the heart of each "agent", we suppose that this algorithm exchanges inputs and outputs with the environment via communication channels known as "observations" and "actions".

This post really is my central critique of contemporary artificial intelligence discourse.... (read more)

AI Risk for Epistemic Minimalists

Alex Flint2y30Review for 2021 Review

This post trims down the philosophical premises that sit under many accounts of AI risk. In particular it routes entirely around notions of agency, goal-directedness, and consequentialism. It argues that it is not humans losing power that we should be most worried about, but humans quickly gaining power and misusing such a rapid increase in power.

Re-reading the post now, I have the sense that the arguments are even more relevant than when it was written, due to the broad improvements in machine learning models since it was written. The arguments in this po... (read more)

Logical induction for software engineers

Conjecture: a retrospective after 8 months of work

Thanks Scott

Alex Flint2y812

Thanks for writing this.

Alignment research has a track record of being a long slow slog. It seems that what we’re looking for is a kind of insight that is just very very hard to see, and people who have made real progress seem to have done so through long periods of staring at the problem.

With your two week research sprints, how do you decide what to work on for a given sprint?

Counterfactability

Counterarguments to the basic AI x-risk case

Regarding your point on ELK: to make the output of the opaque machine learning system counterfactable, wouldn't it be sufficient to include the whole program trace? Program trace means the results of all the intermediate computations computed along the way. Yet including a program trace wouldn't help us much if we don't know what function of that program trace will tell us, for example, whether the machine learning system is deliberately deceiving us.

So yes it's necessary to have an information set that includes the relevant information, but isn't the main part of the (ELK) problem to determine what function of that information corresponds to the particular latent variable that we're looking for?

2Scott Garrabrant2y

I agree, this is why I said I am being sloppy with conflating the output and our understanding of the output. We want our understanding of the output to screen off the history.

Counterfactability

Alex Flint2y50

If I understand you correctly, the reason that this notion of counterfactable connects with what we normally call a counterfactual is that when an event screens of its own history, it's easy to consider other "values" of the "variable" underlying that event without coming into any logical contradictions with other events ("values of other variables") that we're holding fixed.

For example if I try to consider what would have happened if there had been a snow storm in Vermont last night, while holding fixed the particular weather patterns observed in Vermont ... (read more)

2Scott Garrabrant2y

Yeah, remember the above is all for updateless agents, which are already computationally intractable. For updateful agents, we will want to talk about conditional counterfactability. For example, if you and I are in a prisoners dilemma, we could would conditional on all the stuff that happened prior to us being put in separate cells, and given this condition, the histories are much smaller. Also, we could do all of our reasoning up to a high level world model that makes histories more reasonably sized. Also, if we could think of counterfactability as a spectrum. Some events are especially hard to reason about, because there are lots of different ways we could have done it, and we can selectively add details to make it more and more counterfactable, meaning it approximately screens off its history from that which you care about.

Counterarguments to the basic AI x-risk case

I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.

I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system ... (read more)

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

I really appreciate this post!

For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

Fascinatingly, EA employers in particular seem to seek employees who do try to forward organization goals in unforeseen ways!

Alex Flint2y60

I have the sense that boundaries are so effective as a coordination mechanism that we have come to believe that they are an end in themselves. To me it seems that the over-use of boundaries leads to loneliness that eventually obviates all the goodness of the successful coordination. It's as if we discovered that cars were a great way to get from place to place, but then we got so used to driving in cars that we just never got out of them, and so kind of lost all the value of being able to get from place to place. It was because the cars were in fact so eff... (read more)

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Alex Flint2y21

The post begins with "this is part 3b of..." but I think you meant to say 3a.

1Andrew Critch2y

Thanks, fixed!

Misalignment-by-default in multi-agent systems

You write down an optimization problem over (say) linear combinations of image pixels, minimizing some measure of marginal returns to capacity given current network parameters (first idea) or overall importance as measured by absolute value of dL/dC_i, again given current network parameters (second idea). By looking just for the feature that is currently "most problematic" you may be able to sidestep the need to identify the full set of "features" (whatever that really means).

I don't know how exactly you would formulate these objective functions but it seems do-able no?

2Adam Jermyn2y

Oh I see! Sorry I didn't realize you were describing a process for picking features. I think this is a good idea to try, though I do have a concern. My worry is that if you do this on a model where you know what the features actually are, what happens is that this procedure discovers some heavily polysemantic "feature" that makes better use of capacity than any of the actual features in the problem. Because dL/dC_i is not a linear function of the feature's embedding vector, there can exist superpositions of features which have greater dL/dC_i than any feature. Anyway, I think this is a good thing to try and encourage someone to do so! I'm happy to offer guidance/feedback/chat with people interested in pursuing this, as automated feature identification seems like a really useful thing to have even if it turns out to be really expensive.

Counterarguments to the basic AI x-risk case

OK, good, thanks for that correction.

One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?

In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment -- enough, let's say, to get the house built, but not enough, let's say, to really understand wha... (read more)

2Edouard Harris2y

Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed. It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training. This setup has the advantage that it's more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.

Alex Flint2y78

Thanks for writing this!

Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term f... (read more)

Misalignment-by-default in multi-agent systems

Misalignment-by-default in multi-agent systems

I wonder how your definition of multi-agent power would look in a game of chess or go. There is this intuitive thing where players who have pieces more in the center of the board (chess) or have achieved certain formations (go) seem to acquire a kind of power in those games, but this doesn't seem to be about achieving different terminal goals. Rather it seems more like having the ability to respond to whatever one's opponent does. If the two agents cannot perfectly predict what their opponent will do then there is value in having the ability to respond to ... (read more)

2Edouard Harris2y

Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn't have much useful to say in a case like this one. We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents. On the other hand, from other results I've seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you'd stand a decent chance of correctly identifying high-POWER states with high-mobility board positions. You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.

Alex Flint2y41

Suppose the human is trying to build a house and plans to build an AI to help with that. What would $α_{H A}$ and $β_{H A}$ mean -- just at an intuitive level -- in a case like that?

I suppose that to compute $α_{H A}$ you would sample many different arrangement of matter -- some containing houses of various shapes and sizes and some not -- and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to bui... (read more)

4Edouard Harris2y

I think you might have reversed the definitions of αHA and βHA in your comment,[1] but otherwise I think you're exactly right. To compute βHA (the correlation coefficient between terminal values), naively you'd have reward functions RH(s) and RA(s), that respectively assign human and AI rewards over every possible arrangement of matter s. Then you'd look at every such reward function pair over your joint distribution DHA, and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function. And to compute αHA (the correlation coefficient between instrumental values), you're correct that some of the arrangements of matter s will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each s, and you can read the correlation right off the alignment plot! 1. ^ Looking again at the write-up, it would have made more sense for us to define αHA as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn't occur to us. Sorry for the confusion.

Instrumental convergence in single-agent systems

Thanks for doing these experiments and writing this up. It's so good to have concrete proposals and numerical experiments for concepts like power because power as a concept is super central to alignment, and concrete proposals and numerical experiments move the discourse around these concepts forward.

There is negotiating tactic in which one side makes a strong public pre-commitment not to accept any deal except one that is extremely favorable to them. So e.g. if Fred is purchasing a used car from me and realizes that both of us would settle for a sale pric... (read more)

3Edouard Harris2y

Thanks for you comment. These are great questions. I'll do the best I can to answer here, feel free to ask follow-ups: 1. On pre-committing as a negotiating tactic: If I've understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out? It won't surprise you that I think the answer is highly context-dependent, and that I'm not sure which way it would actually shake out in your example with Fred and Bob and the $5000. But interestingly, we did in fact discover an instance of this class of "sacrificial" strategies in our experiments! You can check out the example in Part 3 if you're interested. But briefly, what happens is that when the agents get far-sighted enough, one of them realizes that there is instrumental value in having the option to bottle up the other agent in a dead-end corridor (i.e., constraining that other agent's options). But it can only actually do this by positioning itself at the mouth of the corridor (i.e., sacrificing its own options). Here is a full-size image of both agents' POWERs in this situation. You can see from the diagram that Agent A prefers to preserve its own options over constraining Agent H's options in this case. But crucially, Agent A values the option of being able to constrain Agent H's options. In the language of your negotiating example, there is instrumental value in preserving one's option to pre-commit. But whether actually pre-committing is instrumentally valuable or not depends on the context. 2. On babies being more powerful than adults: Yes, I think your reasoning is right. And it would be relatively easy to do this experiment! All you'd need would be to define a "death" state, and set your transition dynamics so that the agent gets sent to the "death" state after N turns and

Thanks for this.

we don't have a generic technique to define capacity across different architectures and loss functions

Got it. I imagine that for some particular architectures, and given some particular network weights, you can numerically compute the marginal returns to capacity curves, but that it's hard to express capacity analytically as a function of network weights since you really need to know what the particular features are in order to compute returns to capacity -- is that correct?

Don't leave your fingerprints on the future

Right! Two quick ideas:

Although it's not easy to determine the full set of "natural" features for arbitrary networks, still you might be able to solve an optimization problem that identifies the single feature with most negative marginal returns to capacity given the weights of some particular trained network. If you could do this then perhaps you could apply a regularization to the network that "flattens out" the marginal returns curve for just that one feature, then apply further training to the network and ask again which single feature has most nega

... (read more)

1Adam Jermyn2y

In both ideas I'm not sure how you're identifying features. Manual interpretability work on a (more complicated) toy model?

Alex Flint2y42

Yeah I also have the sense that we mostly agree here.

I have the sense that CEV stands for, very roughly, "what such-and-such a person would do if they became extremely wise", and the hope (which I think is a reasonable hope) is that there is a direction called "wisdom" such that if you move a person far enough in that direction then they become both intelligent and benevolent, and that this eventually doesn't depend super much on where you started.

The tricky part is that we are in this time where we have the option of making some moves that might be quite ... (read more)

3Rob Bensinger2y

The thing I'd say in favor of this position is that I think it better fits the evidence. I think the problem with the opposing view is that it's wrong, not that it's more confident. E.g., if I learned that Nate assigns probability .9 to "a pivotal act is necessary" (for some operationalization of "necessary") while Critch assigns probability .2 to "a pivotal act is necessary", I wouldn't go "ah, Critch is being more reasonable, since his probability is closer to .5". I agree with the rest of what you said, and I think this is a good way of framing the issue. I'd add that I think discussion of this topic gets somewhat distorted by the fact that many people naturally track social consensus, and try to say the words they think will have the best influence on this consensus, rather than blurting out their relevant beliefs. Many people are looking for a signal that stuff like this is OK to say in polite society, or many others are staking out a position "the case for this makes sense intellectually but there's no way it will ever attract enough support, so I'll preemptively oppose it in order to make my other arguments more politically acceptable". (The latter, unfortunately, being a strategy that can serve as a self-fulfilling prophecy.)

Don't leave your fingerprints on the future

Alex Flint2y*108

I think what you're saying here ought to be uncontroversial. You're saying that should a small group of technical people find themselves in a position of enormous influence, they ought to use that influence in an intelligent and responsible way, which may not look like immediately shirking that responsibility out of a sense that nobody should ever exert influence over the future.

I have the sense that in most societies over most of time, it was accepted that of course various small groups would at certain time find themselves in positions of enormous influe... (read more)

4Rob Bensinger2y

I think Nate might've been thinking of things like: * Having all AGI research occur in one place is good (ceteris paribus), because then the AGI project can take as much time as it needs to figure out alignment, without worrying that some competitor will destroy the world with AGI if they go too slowly. * This is even truer if the global coordination is strong enough to prevent other x-risks (e.g., bio-weapons), so we don't have to move faster to avoid those either. * In a perfect world, everyone would get some say in major decisions that affect their personal safety (e.g., via elected Scientist Representatives). This helps align incentives, relative to a world where anyone can unilaterally impose serious risks on others. * In a perfect world, larger collaborations shouldn't perform worse than smaller ones, because larger collaborations should understand the dysfunctions of large collaborations and have policies and systems in place to avoid them (e.g., by automatically shrinking or siloing if needed). I interpret Nate as making a concession to acknowledge the true and good aspects of the 'but isn't there something off about a random corporation or government doing all this?' perspective, not as recommending that we (in real life) try to have the UN build AGI or whatever. I think your pushback is good here, as a reminder that 'but isn't there something off about a random corporation or government doing all this?' also often has less-reasonable intuitions going into it (example), and gets a weird level of emphasis considering how much more important other factors are, considering the track record of giant international collaborations, etc. I'm guessing you two basically agree, and the "directly" in "a small subset of humanity not directly decide" is meant to exclude a "tightly scripted screenplay of contemporary moral norms"? Nate also has the substantive belief that CEV-ish approaches are good, and (if he agrees with the Arbital page) that the base for CEV

I wonder if there is some way to alter the marginal return curves so that they are not diminishing, and see whether that removes polysemanticity from the trained network. This seems difficult to do at the outset, because a lot of features just are going to be of somewhat diminishing marginal utility for most real-world learning problems. But I wonder if there is some way to train a network, and then, based on the trained weights, do some fine-tuning using a loss that is set up so that the marginal returns to feature capacity are non-diminishing for the particular features learned so far.

0Kshitij Sachan2y

Good question! As you suggest in your comment, increasing marginal returns to capacity induce monosemanticity, and decreasing marginal returns induce polysemanticity. We observe this in our toy model. We didn't clearly spell this out in the post, but the marginal benefit curves labelled from A to F correspond to points in the phase diagram. At the top of the phase diagram where features are dense, there is no polysemanticity because the marginal benefit curves are increasing (see curves A and B). In the feature sparse region (points D, E, F), we see polysemanticity because the marginal benefit curves are decreasing. The relationship between increasing/decreasing marginal returns and polysemanticity generalizes beyond our toy model. However, we don't have a generic technique to define capacity across different architectures and loss functions. Without a general definition, it's not immediately obvious how to regularize the loss for increasing returns to capacity. You're getting at a key question the research brings up: can we modify the loss function to make models more monosemantic? Empirically, increasing sparsity increases polysemanticity across all models we looked at (figure 7 from the arXiv paper)*. According to the capacity story, we only see polysemanticity when there is decreasing marginal returns to capacity. Therefore, we hypothesize that there is likely a fundamental connection between feature sparsity and decreasing marginal returns. That is to say, we are suggesting that: if features are sparse and similar enough in importance, polysemanticity is optimal. *Different models showed qualitatively different levels of polysemanticity as a function of sparsity. It seems possible that tweaking the architecture of a LLM could change the amount of polysemanticity, but we might take a performance hit for doing so.

A central AI alignment problem: capabilities generalization, and the sharp left turn

Thank you for this work. How feasible would it be to replicate these results using large neural networks (but still using synthetic training data where you can control feature sparsity directly)? What then would be the path to determining whether networks trained on real-world datasets behave according to this model?

1Adam Jermyn2y

We think this sort of approach can be applied layer-by-layer. As long as you know what the features are you can calculate dL/dC_i for each feature and figure out what's going on with that. The main challenge to this is feature identification: in a one layer model with synthetic data it's often easy to know what the features are. In more complicated settings it's much less clear what the "right" or "natural" features are...

Alex Flint2y2-2

good capabilities form something like an attractor well

In my own experience examining the foundations of things in the world, I have repeatedly found there to be less of an attractor-of-fundamentally-effective-decision-making than I had anticipated. In every way that I expected to find such an attractor within epistemology, mathematics, empiricism, ethics, I found in fact that even the very basic assumptions that I started with were unfounded, and found nothing firm to replace with them with. Probability theory: not a fundamental answer to epistemology;... (read more)

What does it mean for an AGI to be 'safe'?

Alex Flint2y42

What is the current biggest bottleneck to an alignment solution meeting the safety bar you've describe here (<50% chance of killing more than a billion)?

3Rob Bensinger2y

I'd guess Nate might say one of: * Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.) * Much more generally: we don't have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable problems like "your first system will probably be very kludgey" and "having the correct outer training signal by default results in inner misalignment" and "pivotal acts inevitably involve trusting your AGI to do a ton of out-of-distribution cognitive work".

Clarifying the Agent-Like Structure Problem

But the vast majority of initial states for a lump of carbon/oxygen/hydrogen/nitrogen atoms do not include a person programming a computer with the intention of taking over the internet. Shouldn't you apply the same logic there that you apply to the case of a computer?

In fact a single zero day exploit is certainly much simpler than a full human, so aprior it's more likely for a computer with a zero day exploit to form from the void than for a computer with a competent human intent on taking over the internet to form from the void.

Alex Flint2y22

This is one of the clearest top-to-bottom accounts of the alignment problem and related world situation that I've seen here in a while. Thank you for writing it.

i believe, akin to the yudkowsky-moore law of mad science, that the amount of resources it takes for the world to be destroyed — whether on purpose or by accident — keeps decreasing.

Yes it seems that in this particular way the world is becoming more and more unstable

pretty soon (probly this decade or the next), an artificial intelligence capable of undergoing recursive self-improvement (RSI)

Smoke without fire is scary

See also Scott Aaronson on experimental computational complexity theory (haha its a joke wait no maybe he's not joking wait what?)

The meeting ended with a “Wild & Crazy Ideas Session,” at which I (naturally) spoke. I briefly considered talking about quantum gravity computing, closed timelike curves, or quantum anthropic postselection, but ultimately decided on something a little less mainstream. My topic was “Experimental Computational Complexity Theory,” or “why do theoretical physicists get $8-billion machines for the sole purpose of confirming or

Alex Flint2y50

Beyond that it seems tensorflow and pytorch don't even bother to use Strassen's algorithm over N^3 matrix multiplication (or perhaps something Strassen-like is used in the low-level GPU circuits?).

Alex Flint2y42

Above some threshold level of deceptive capabilities we should stop trusting the results of behavioral experiments no matter what they show

I agree, and if we don't know how to verify that we're not being deceived, then we can't trust almost any black-box-measurable behavioral property of extremely intelligent systems, because any such black-box measurement rests on the assumption that the object being measured isn't deliberately deceiving us.

It seem that we ought to be able to do non-black-box stuff, we just don't know how to do that kind of stuff very well yet. In my opinion this is the hard problem of working with highly capable intelligent systems.

Humans aren't fitness maximizers