AI ALIGNMENT FORUM
AF

All of jacob_cannell's Comments + Replies

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories

They are critically relevant. From your own linked post ( how I delineate ) :

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.

If takeoff is slow (1) because brains are highly ... (read more)

3Thane Ruthenis1y

My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it. Brains are very different architectures compared to our computers, in any case, they implement computations in very different ways. They could be maximally efficient relative to their architectures, but so what? It's not at all obvious that FLOPS estimates of brainpower are highly relevant to predicting when our models would hit AGI, any more than the brain's wattage is relevant. They're only soundly relevant if you're taking the hard "only compute matters, algorithms don't" position, which I reject. I think both are load-bearing, in a fairly obvious manner, and that which specific mixture is responsible matters comparatively little. * Architecture obviously matters. You wouldn't get LLM performance out of a fully-connected neural network, certainly not at realistically implementable scales. Even more trivially, you wouldn't get LLM performance out of an architecture that takes in the input, discards it, spends 10^25 FLOPS generating random numbers, then outputs one of them. It matters how your system learns. * So evolution did need to hit upon, say, the primate architecture, in order to get to general intelligence. * Training data obviously matters. Trivially, if you train your system on randomly-generated data, it's not going to learn any useful algorithms, no matter how sophisticated its architecture is. More realistically, without the exposure to chemical experiments, or any data that hints at chemistry in any way, it's not going to learn how to do chemistry. * Similarly, a human not exposed to stimuli that would let them learn the general-intelligence algorithms isn't going to learn them. You'd brought up feral children before, and I agree it's a relevant data point. So, yes, there would be n

Current AIs Provide Nearly No Data Relevant to AGI Alignment

jacob_cannell1y15-3

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:

fast takeoff is more likely than slow, downstream dependent on some comb

... (read more)

Thane Ruthenis1y*65

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitr... (read more)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

jacob_cannell1y50

Nice work. I've long that that our ability to monitor the inner monologue of AI agents will be important for security&control - and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.

If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure "training agents not to lie" actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?

3JanB1y

I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better. This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector. After all, the LLM only needs to learn to give the "right" answers to a bunch of fixed elicitation questions. We have some experiments that show this (not in the paper).

AI doom from an LLM-plateau-ist perspective

jacob_cannell2y20

For example, Judea Pearl published the belief propagation algorithm in 1982. Why hadn’t someone already published it in 1962? Or 1922?

Belief propagation is the kind of thing that most people wouldn't work on in an age before computers. It would be difficult to evaluate/test, but more importantly wouldn't have much hope for application. Seems to me it arrived at a pretty normal time in our world.

For example, people have known for decades that flexible hierarchical planning is very important in humans but no one can get it to really work well in AI, especially in a reinforcement learning context.

What do you think of diffusion planning?

4Steve Byrnes2y

Hmm. I’m not sure I buy that. Can’t we say the same thing about FFT? Doing belief prop by hand doesn’t seem much different from doing an FFT by hand; and both belief prop and FFT were totally doable on a 1960s mainframe, if not earlier, AFAICT. But the modern FFT algorithm was published in 1965, and people got the gist of it in 1942, and 1932, and even Gauss in 1805 had the basic idea (according to wikipedia). FFTs are obviously super useful, but OTOH people do seem to find belief prop useful today, for various things, as far as I can tell, and I don’t see why they wouldn’t have found it useful in the 1960s as well if they had known about it. I think it’s interesting, thanks for sharing! But I have no other opinion about it to share. :) If (for the sake of argument) Diffusion Planning is the (or part of the) long-sought-out path to getting flexible hierarchical planning to work well in practical AI systems, then I don’t think that would undermine any of the main points that I’m trying to make here. Diffusion Planning was, after all, (1) just published last year, (2) still at the “proof of principle / toy models” stage, and (3) not part of the existing LLM pipeline / paradigm.

AI doom from an LLM-plateau-ist perspective

jacob_cannell2y41

How long have you held your LLM plateau model and how well did it predict GPT4 scaling? How much did you update on GPT4? What does your model predict for (a hypothetical) GPT5?

My answers are basically that I predicted back in 2015 that something not much different than NNs of the time (GPT1 was published a bit after) could scale all the way with sufficient compute, and the main key missing ingredient of 2015 NNs was flexible context/input dependent information routing, which vanilla FF NNs lack. Transformers arrived in 2017^[1] with that key flexible rou... (read more)

Steve Byrnes2y70

In this post, I’m not trying to convert people to LLM plateau-ism. I only mentioned my own opinions as a side-comment + short footnote with explicitly no justification. And if I were trying to convert people to LLM plateau-ism, I would certainly not attempt to do so on the basis of my AI forecasting track record, which is basically nonexistent. :)

Thoughts on hardware / compute requirements for AGI

jacob_cannell2y42

I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include

Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.

And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predicti... (read more)

3Steve Byrnes2y

For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure. So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe. You would have to say that the brain’s model is inherently much much more complicated than GPT-3, such that even after putting it in this heavy-on-synapses-lite-on-FLOP format, it still takes much more FLOP to query the brain’s language model than to query GPT-3. And I don’t think that. (Although I suppose this is an area where reasonable people can disagree.) I don’t think energy use is important. For example, if a silicon chip takes 1000× more energy to do the same calculations as a brain, nobody would care. Indeed, I think they’d barely even notice—the electricity costs would still be much less than my local minimum wage. (20 W × 1000 × 10¢/kWh = $2/hr. Maybe a bit more after HVAC and so on.). I’ve noticed that you bring up energy consumption with some regularity, so I guess you must think that energy efficiency is very important, but I don’t understand why you think that. Other than Section 4, this post was about using an AGI, not training it from scratch. If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model

3TekhneMakre2y

I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like "we found brain areas that light up when the person reads 'cat', just like how this part of the neural net lights up when given input 'cat'" and less like "the LLM is useful for other tasks in the same way as the neural version is useful for other tasks". Am I confused about what the paper says, and if so, how? What sort of claim are you making?

Thoughts on hardware / compute requirements for AGI

jacob_cannell2y1316

He writes that the human brain has “1e13-1e15 spikes through synapses per second (1e14-1e15 synapses × 0.1-1 spikes per second)”. I think Joe was being overly conservative, and I feel comfortable editing this to “1e13-1e14 spikes through synapses per second”, for reasons in this footnote→^[9].

I agree that 1e14 synaptic spikes/second is the better median estimate, but those are highly sparse ops.

So when you say:

So I feel like 1e14 FLOP/s is a very conservative upper bound on compute requirements for AGI. And conveniently for my narrative, that nu

... (read more)

3Steve Byrnes2y

Thanks! I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include Randy O’Reilly, Josh Tenenbaum, Jeff Hawkins, Dileep George, these people, maybe some of the Friston / FEP people, probably most of the “evolved modularity” people like Steven Pinker, and I think Kurzweil (he thought the cortex was built around hierarchical hidden Markov models, last I heard, which I don’t think are equivalent to ANNs?). And me! You’re welcome to argue that you’re right and we’re wrong (and most of that list are certainly wrong, insofar as they’re also disagreeing with each other!), but it’s not “uncontroversial”, right? In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around. Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it. And I would disagree about the scale being the secret sauce. But we might not be able to resolve that—guess we’ll see what happens! See also footnote 16 and surrounding discussion.

My take on Jacob Cannell’s take on AGI safety

jacob_cannell2y30

One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”.

I think we agree the cortex/cerebellum are randomly initialized, along with probably most of the hippocampus, BG, perhaps amagdyla? and a few others. But those don't map cleanly to U, W/P, and V/A.

For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own acco

... (read more)

2Steve Byrnes2y

Thanks! If we assume omniscience, it allows a very convenient type of argument: * Argument I [invalid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Is X actually empowering? However, if we don’t assume omniscience, then we can’t make arguments of that form. Instead we need to argue: * Argument II [valid]: Suppose an animal has a generic empowerment drive. We want to know whether it will do X. We should ask: Has the animal come to believe (implicitly or explicitly) that doing X is empowering? I have the (possibly false!) impression that you’ve been implicitly using Argument I sometimes. That’s how omniscience came up. For example, has a newborn bat come to believe (implicitly or explicitly) that flapping its arm-wings is empowering? If so, how did it come to believe that? The flapping doesn’t accomplish anything, right? They’re too young and weak to fly, and don’t necessarily know that flying is an eventual option to shoot for. (I’m assuming that baby bats will practice flapping their wings even if raised away from other bats, but I didn’t check, I can look it up if it’s a crux.) We can explain a sporadic flap or two as random exploration / curiosity, but I think bats practice flapping way too much for that to be the whole explanation. Back to play-fighting. A baby animal is sitting next to its sibling. It can either play-fight, or hang out doing nothing. (Or cuddle, or whatever else.) So why play-fight? Here’s the answer I prefer. I note that play-fighting as a kid presumably makes you a better real-fighter as an adult. And I don’t think that’s a coincidence; I think it’s the main point. In fact, I thought that was so obvious that it went without saying. But I shouldn’t assume that—maybe you disagree! If you agree that “child play-fighting helps train for adult real-fighting” not just coincidentally but by design, then I don’t see the “Argument II” logic going through. For example, animals will p

My take on Jacob Cannell’s take on AGI safety

jacob_cannell2y30

I'll start with a basic model of intelligence which is hopefully general enough to cover animals, humans, AGI, etc. You have a model-based agent with a predictive world model W learned primarily through self-supervised predictive learning (ie learning to predict the next 'token' for a variety of tokens), a planning/navigation subsystem P which uses W to approximately predict sample important trajectories according to some utility function U, a value function V which computes the immediate net expected discounted future utility of actions from current stat... (read more)

3Steve Byrnes2y

Thanks! One of my disagreements with your U,V,P,W,A model is that I think V & W are randomly-initialized in animals. Or maybe I’m misunderstanding what you mean by “brains also can import varying degrees of prior knowledge into other components”. I also (relatedly?) am pretty against trying to lump the brainstem / hypothalamus and the cortex / BG / etc. into a single learning-algorithm-ish framework. I’m not sure if this is exactly your take, but I often see a perspective (e.g. here) where someone says “We should think of the brain as a learning algorithm. Oh wait, we need to explain innate behaviors. Hmm OK, we should think of the brain as a pretrained learning algorithm.” But I think that last step is wrong. Instead of “pretrained learning algorithm”, we can alternatively think of the brain as a learning algorithm plus other things that are not learning algorithms. For example, I think most newborn behaviors are purely driven by the brainstem, which is doing things of its own accord without any learning and without any cortex involvement. To illustrate the difference between “pretrained learning algorithm” and “learning algorithm + other things that are not learning algorithms”: Suppose I’m making a robot. I put in a model-based RL system. I also put in a firmware module that detects when the battery is almost empty and when it is, it shuts down the RL system, takes control, and drives the robot back to the charging station. Leaving aside whether this is a good design for a robot, or a good model for the brain (it’s not), let’s just talk about this system. Would we describe the firmware module as “importing prior knowledge into components of the RL algorithm”? No way, right? Instead we would describe the firmware module as “a separate component from the RL algorithm”. By the same token, I think there are a lot of things happening in the brainstem / hypothalamus which we should describe as “a separate component from the RL algorithm”. [M/P is a typo for W/

Reward is not the optimization target

jacob_cannell2y24

Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn't navigate to that future.

If this agent is s... (read more)

3Alex Turner2y

Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice -> more propensity to do the vice next time -> vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you're predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training. I'm not saying the AI won't care about reward at all. I think it'll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a "vice" (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the "global maximum" picture getting a wrench thrown in it. When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.

How could we know that an AGI system will have good consequences?

jacob_cannell2y20

This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn't an efficiency disadvantage - so even if your brain type is less monitorable we are not confined to that design.

I also do not believe your central claim - in that based on my knowledge of neuroscience - disabling the brain modules responsible for your inner monologue will not only disable your capac... (read more)

1Vojtech Kovarik2y

To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe. As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inner monologue. Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of "interpretable by default" and into "a race between your interpretability and my cognitive steganography". Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The "elephant in the brain" type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like "give me a weekend to play with this".) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.

How could we know that an AGI system will have good consequences?

jacob_cannell2y*70

Roughly speaking, I think that cognitive interpretability approaches are doomed, at least in the modern paradigm, because we're not building minds but rather training minds, and we have very little grasp of their internal thinking,

A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor. There's good reasons to suspect that DL based general intelligence will end up with something similar simply due to the convergent optimization pressure to commun... (read more)

Ben Pace2y1012

A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor.

This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.

What does it take to defend the world against out-of-control AGIs?

jacob_cannell2y51

Yeah to be clear I agree it's fairly likely AGI takes over; I just think it's more likely to be a subtle takeover. I also agree it is important to secure the nuclear arsenal against cyberattack, but it seems hard for outsiders to evaluate the current level of security. My only disagreement was with the concept of 'omnicidal' agents, which - although obviously possible - I don't see as the main failure mode.

What does it take to defend the world against out-of-control AGIs?

jacob_cannell2y7-3

If I’m an AGI, humans can help me get things done, but humans can also potentially shut me down, and more importantly humans can also potentially create a different AGI with different and conflicting goals from mine, and equal capabilities.

For an AGI, it's not that humans just can help you get things done; humans are most of the cells which make up the body of earth which you seek to control. Humans today generally have zero interest in shutting AI down, and shutting AI down doesn't seem compatible with the trajectory we are on. The best way an AI can ... (read more)

4Steve Byrnes2y

Thanks for your comment! Again, I think you’re imagining that an AGI is going to take over, and the question is whether the AGI that takes over will have good or bad motives from a human perspective. I see that vision as entirely plausible—the hopeful case is my Section 3.5.2, and the bad scenario is x-risk. (Whether this “bad scenario” involves immediate deaths of humans, versus the AGI keeping humans around, at least for a while, to help out with projects that advance the AGI’s own goals, is not a question where I really care what the answer is!) So this post is not really arguing against your vision. Instead it’s arguing against (or at least questioning) a different vision, where no AGI takes over, and instead humans remain perpetually in control of docile helpful AGIs, in a multipolar world with similar power dynamics as today. …Or something like that.

What does it take to defend the world against out-of-control AGIs?

jacob_cannell2y5-5

The key principle for predicting what a strong AGI would do today is instrumental convergence. The AI's utility function is actually irrelevant for any early strategy; both aligned and unaligned AI would pursue the exact same initial strategy. They would gain control of earth.

If you ask either AGI - aligned or not - why they were doing this, the answer would be the same: I'm taking control to prevent an unaligned AI from destroying humanity (or some much more persuasive) variant thereof).

All the nuclear war scenarios are extremely unlikely/unrealistic. ... (read more)

6Steve Byrnes2y

See Section 3.3.3 for why I think a misaligned power-seeking AGI might want nuclear war, deadly pandemics, crop diseases, and other fun things like that. If I’m an AGI, humans can help me get things done, but humans can also potentially shut me down, and more importantly humans can also potentially create a different AGI with different and conflicting goals from mine, and equal capabilities. Decades? Sure. But we don’t know what the AGI’s “discount rate” will be (if that notion is even well-defined). If you tell a human: Behind Door Number 1 is a box that will almost definitely solve world peace, climate change, and all the world’s diseases. But the box is sealed shut and won’t open for 35 years. Behind Door Number 2 is a button that might solve those same problems in just 3 years. But probably not. More likely it will create a black hole that will swallow the Earth. I think the human would take door number 1. I think the AGI would plausibly make an analogous decision. Or if not that AGI, then the next AGI in line. By the way, if we both agree that the misaligned AGI can gain control of Earth, then it doesn’t much matter whether the nuclear war scenario is likely or not, right? (If the AGI keeps human slaves around for a few decades until it invents next-gen robots, then kills the humans, versus killing the humans immediately, who cares?) Or conversely, if the AGI can’t gain control of Earth through any method besides destructive ones involving things like nuclear wars and pandemics, then we can’t also say that there’s no harm in keeping humans around from the AGI’s perspective.

Human values & biases are inaccessible to the genome

jacob_cannell2y79

Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),

Yes. Evolution solved information inaccessibility, as it had to, over and over, in order to utilize dynamic learning circuits at all (as they always had to adapt to and be adaptive within the context of existing conserved innate circuitry).

The general solution is proxy matching, where the genome specifies a simple innate proxy circuit which correlates and thus matches with a target learned circuit at some critical learning phase, allowing the ... (read more)

2Alex Turner2y

I feel confused. I think this comment is overall good (though I don't think I understand a some of it), but doesn't seem to suggest the genome actually solved information inaccessibility in the form of reliably locating learned WM concepts in the human brain?

Reply to Eliezer on Biological Anchors

jacob_cannell3y11

BioAnchors is poorly named, the part you are critiquing should be called GPT-3_Anchors.

A better actual BioAnchor would be based on trying to model/predict how key params like data efficiency and energy efficiency are improving over time, and when they will match/surpass the brain.

GPT-3 could also obviously be improved for example by multi-modal training, active learning, curriculum learning, etc. It's not like it even represents the best of what's possible for a serious AGI attempt today.

0greghb3y

Fwiw, I think nostalgebraist's recent post hit on some of the same things I was trying to get at, especially around not having adequate testing to know how smart the systems are getting -- see the section on what he calls (non-)ecological evaluation.

Reply to Eliezer on Biological Anchors

jacob_cannell3y65

It displeases me that this is currently the most upvoted response: I believe you are focusing on EY's weakest rather than strongest points.

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research. So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing that there are enough connections/correlations between the two paths to use one in order to predict the other.

It's hardly su... (read more)

Adam Shimi3y63

Thanks for pushing back on my interpretation.

I feel like you're using "strongest" and "weakest" to design "more concrete" and "more abstract", with maybe the value judgement (implicit in your focus on specific testable claims) that concreteness is better. My interpretation doesn't disagree with your point about Bio Anchors, it simply says that this is a concrete instantiation of a general pattern, and that the whole point of the original post as I understand it is to share this pattern. Hence the title who talks about all biology-inspired timelines, the th... (read more)

Are minimal circuits deceptive?

jacob_cannell3y00

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure? If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren't all questions of whether M is aligned with N's optimization objective just generalization questions?

More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It's also straightfo... (read more)

2Evan Hubinger3y

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here. ---------------------------------------- This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization: ---------------------------------------- I don't think this characterization is correct. A couple of points: * There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution. For example: a deceptive model that purposefully always takes minimal-loss actions to prevent the training process from modifying it but starts acting catastrophically when it sees a factorization of RSA-2048. * I don't think that's a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn't imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.

Biology-Inspired AGI Timelines: The Trick That Never Works

jacob_cannell3y20

Which brings me to the second line of very obvious-seeming reasoning that converges upon the same conclusion - that it is in principle possible to build an AGI much more computationally efficient than a human brain - namely that biology is simply not that efficient, and especially when it comes to huge complicated things that it has started doing relatively recently.

Biological cells are computers which must copy bits to copy DNA. So we can ask biology - how much energy do cells use to copy each base pair? Seems they use just 4 ATP per bas... (read more)

2Charlie Steiner3y

This was super interesting. I don't think you can directly compare brain voltage to Landauer limit, because brains operate chemically, so we also care about differences in chemical potential (e.g. of sodium vs potassium, which are importantly segregated across cell membranes even though both have the same charge). To really illustrate this, we might imagine information-processing biology that uses no electrical charges, only signalling via gradients of electrically-neutral chemicals. I think this raises the total potential relative to Landauer and cuts down the amount of molecules we should estimate as transported per signal.

Adele Lopez3y160

You're missing the point!

Your arguments apply mostly toward arguing that brains are optimized for energy efficiency, but the important quantity in question is computational efficiency! You even admit that neurons are "optimizing hard for energy efficiency at the expense of speed", but don't seem to have noticed that this fact makes almost everything else you said completely irrelevant!