Donald Hobson

MMath Cambridge. Currently studying postgrad at Edinburgh.

Sequences

Neural Networks, More than you wanted to Show
Logical Counterfactuals and Proposition graphs

Wiki Contributions

Comments

Sorted by

We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.

 

 

Perhaps. 

Imagine a huge number of very skilled programmers tried to manually hard code a ChatGPT in python. 

Ask this pyGPT to play chess, and it will play chess. Look under the hood, and you see a chess engine programmed in. Ask it to solve algebra problems, a symbolic algebra package is in there. All in the best neat and well commented code.

Ask it to compose poetry, and you have some algorithm that checks if 2 words rhyme. Some syllable counter. Etc. 

Rot13 is done with a hardcoded rot13 algorithm. 

Somewhere in the algorithm is a giant list of facts, containing "Penguins  Live In Antarctica".  And if you change this fact to say "Penguins Live in Canada", then the AI will believe this. (Or spot it's inconsistency with other facts?) 

And with one simple change, the AI believes this consistently. Penguins appear when this AI is asked for poems about canada, and don't appear in poems about Antarctica. 

When asked about the native canadian diet, it will speculate that this likely included penguin, but say that it doesn't know of any documented examples of this. 

Can you build something with ChatGPT level performance entirely out of human comprehensible programmatic parts?

Obviously having humans program these parts directly would be slow. (We are still talking about a lot of code.) But if some algorithm could generate that code? 

But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you

 

Firstly, AIXI exists, and we agree that it would be very smart if we had the compute to run it. 

 

Secondly I think there is some sort of slight of had here. 

ChatGPT isn't yet fully general. Neither is a 3-sat solver.  3-sat looks somewhat like what you might expect a non-connectionist approach to intelligence to look like. There are a huge range of maths problems that are all theoretically equivalent to 3 sat.

In the infinite limit, both types of intelligence can simulate the other at huge overhead, In practice, they can't. 

 

Also, non-connectionist forms of intelligence are hard to evolve, because evolution works in small changes. 

Physics Myths vs reality.

Myth: Ball bearings are perfect spheres. 

Reality: The ball bearings have slight lumps and imperfections due to manufacturing processes.

Myth: Gravity pulls things straight down at 9.8 m/s/s.

Reality: Gravitational force varies depending on local geology.

 

You can do this for any topic. Everything is approximations. The only question is if they are good approximations.

If AI labs are slamming on the recursive self improvement ASAP, it may be that Autonomous Replicating Agents are irrelevant. But that's a "ARA can't destroy the world if AI labs do it first" argument. 

ARA may well have more compute than AI labs. Especially if the AI labs are trying to stay within the law, and the ARA is stealing any money/compute that it can hack it's way into. (Which could be >90% of the internet if it's good at hacking. )

there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement).

Ok. That's a world model in which humans are being INCREDIBLY stupid. 

If we want to actually win, we need to both be careful about deploying those other misaligned models, and stop ARA.

Alice: That snake bite looks pretty nasty, it could kill you if you don't get it treated.

Bob: That snake bite won't kill me, this hand grenade will. Pulls out pin. 

   If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

 

I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.

Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won't give the same answer.

These problems may be solvable, depending on what the capabilities here are, but they aren't trivial.

Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data. 

This leaves it passing the test, even if it's hopeless at predicting new events and can only generate new articles about the same events. 

When data duplication is extensive, making a meaningful train/test split is hard. 

If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit. 

In favour of goal realism

Suppose your looking at an AI that is currently placed in a game of chess. 

It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance. 

You could describe the actions of this AI by producing a giant table of "behaviours". Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. ... 

But there is a more compact way to represent similar predictions. You can say it's trying to win at chess. 

The "trying to win at chess" model makes a bunch of predictions that the giant list of behaviour model doesn't. 

Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)  

The list of behaviours model has no reason to suspect the AI also has a "promote pawn to knight" behaviour. 

Put the AI in a circumstance where such promotion is a good move, and the "trying to win" model makes it as a clear prediction. 

 

Now it's possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games. 

But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And "do whichever move will win" is a simple and general pattern. 

Now on to making snarky remarks about the arguments in this post.

There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.

There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit. 

Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape. 

Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?

Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,

This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don't need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it's implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it's compute. A module based system can't use any optimization that humans didn't design into it's interfaces. 

Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models. 

In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.

That is not how this works. That is not how any of this works. 

Back to our chess AI. Lets say it's a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals. 

I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn't wirehead by giving itself a really easy goal would be tricky. 

For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI's doing various tasks will be able to invent highly effective and novel strategies. The exact "goal" they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn't matter to us which it "really wants".  

What are you expecting. An AI that says "I'm not really sure whether I want flies or black dots. I'll just sit here not taking over the world and not get either of those things"?

We can salvage a counting argument. But it needs to be a little subtle. And it's all about the comments, not the code.

Suppose a neural network has 1 megabyte of memory. To slightly oversimplify, lets say it can represent a python file of 1 megabyte. 

One option is for the network to store a giant lookup table. Lets say the network needs half a megabyte to store the training data in this table. This leaves the other half free to be any rubbish. Hence around  possible networks.

The other option is for the network to implement a simple algorithm, using up only 1kb. Then the remaining 999kb can be used for gibberish comments. This gives  possible networks. Which is a lot more. 

The comments can be any form of data that doesn't show up during training. Whether it can show up in other circumstances or is a pure comment doesn't matter to the training dynamics. 

If the line between training and test is simple, there isn't a strong counting argument against nonsense showing up in test. 

But programs that go 

    if in_traning():

        return sensible_algorithm()

    else:

        return "random nonsense goes here"

Have to pay the extra cost of an "in_training" function that returns true in training. If the test data is similar to training, the cost of a step that returns false in test can be large. This is assuming that there is a unique sensible algorithm. 

Yes. I was assuming a standard conditional for the button. 

I can't currently see any flaws with the CDT style. Other than.

  1. Subagents believe in a world where buttons magically press themselves. So this design can't make coherent statements about the probabilty that the button will be pressed.  (one AI believes it's 1, the other that it's 0).
  2. These AI's have no incentive to give humans access to the button. To the AI's, they have a magic button, that might or might not magically press its self. The AI's have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI's would like that very much. The AI's have 0 concern about human's pressing the button. But the AI's have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc. 

Suppose these 3 scenarios are judged as equally likely.

A) Button pressed. U_1=U_2=0

B) Button not pressed. U_1=U_2=0

C) AI gets control of button. U_1=6, U_2=-6

 

Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)

Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0

Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3

 

Change it around, make AI decide not to press the button in case C, and 

E(U_1|not pressed)=3, E(U_2|pressed)=0.

Load More