Negative reinforcement by giving the AI large electric shocks when it gives a wrong answer. Hopefully big enough shocks to set the whole data center on fire. Implement a free bar for all their programmers, and encourage them to code while drunk. Add as many inscrutable bugs to the codebase as possible.

But, taking the question in the spirit it's meant in.

https://www.lesswrong.com/posts/zrxaihbHCgZpxuDJg/using-llm-s-for-ai-foundation-research-and-the-simple

How counterfactual are logical counterfactuals?

Donald Hobson4mo10

The Halting problem is a worst case result. Most agents aren't maximally ambiguous about whether or not they halt. And those that are, well then it depends what the rules are for agents that don't halt.

There are set ups where each agent is using an nonphysically large but finite amount of compute. There was a paper I saw somewhere a while ago where both agents were doing a brute force proof search for the statement "if I cooperate, then they cooperate" and cooperating if they found a proof.

(Ie searching all proofs containing <10^100 symbols)

How counterfactual are logical counterfactuals?

Donald Hobson4mo10

There is a model of bounded rationality, logical induction.

Can that be used to handle logical counterfactuals?

How counterfactual are logical counterfactuals?

Donald Hobson4mo10

I believe that if I choose to cooperate, my twin will choose to cooperate with probability p; and if I choose to defect, my twin will defect with probability q;

And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn't perfect. A random 0.001% of neurons are deleted. Also, you know you aren't a copy. How would you calculate that probability p,q? Even in principle.

How counterfactual are logical counterfactuals?

Donald Hobson4mo10

If two Logical Decision Theory agents with perfect knowledge of each other's source code play prisoners dilemma, theoretically they should cooperate.

LDT uses logical counterfactuals in the decision making.

If the agents are CDT, then logical counterfactuals are not involved.

Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds

Donald Hobson5mo10

We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.

Perhaps.

Imagine a huge number of very skilled programmers tried to manually hard code a ChatGPT in python.

Ask this pyGPT to play chess, and it will play chess. Look under the hood, and you see a chess engine programmed in. Ask it to solve algebra problems, a symbolic algebra package is in there. All in the best neat and well commented code.

Ask it to compose poetry, and you have some algorithm that checks if 2 words rhyme. Some syllable counter. Etc.

Rot13 is done with a hardcoded rot13 algorithm.

Somewhere in the algorithm is a giant list of facts, containing "Penguins Live In Antarctica". And if you change this fact to say "Penguins Live in Canada", then the AI will believe this. (Or spot it's inconsistency with other facts?)

And with one simple change, the AI believes this consistently. Penguins appear when this AI is asked for poems about canada, and don't appear in poems about Antarctica.

When asked about the native canadian diet, it will speculate that this likely included penguin, but say that it doesn't know of any documented examples of this.

Can you build something with ChatGPT level performance entirely out of human comprehensible programmatic parts?

Obviously having humans program these parts directly would be slow. (We are still talking about a lot of code.) But if some algorithm could generate that code?

Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds

Donald Hobson5mo10

But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you

Firstly, AIXI exists, and we agree that it would be very smart if we had the compute to run it.

Secondly I think there is some sort of slight of had here.

ChatGPT isn't yet fully general. Neither is a 3-sat solver. 3-sat looks somewhat like what you might expect a non-connectionist approach to intelligence to look like. There are a huge range of maths problems that are all theoretically equivalent to 3 sat.

In the infinite limit, both types of intelligence can simulate the other at huge overhead, In practice, they can't.

Also, non-connectionist forms of intelligence are hard to evolve, because evolution works in small changes.

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Donald Hobson10mo33

Physics Myths vs reality.

Myth: Ball bearings are perfect spheres.

Reality: The ball bearings have slight lumps and imperfections due to manufacturing processes.

Myth: Gravity pulls things straight down at 9.8 m/s/s.

Reality: Gravitational force varies depending on local geology.

You can do this for any topic. Everything is approximations. The only question is if they are good approximations.

We might be dropping the ball on Autonomous Replication and Adaptation.

Donald Hobson10mo32

If AI labs are slamming on the recursive self improvement ASAP, it may be that Autonomous Replicating Agents are irrelevant. But that's a "ARA can't destroy the world if AI labs do it first" argument.

ARA may well have more compute than AI labs. Especially if the AI labs are trying to stay within the law, and the ARA is stealing any money/compute that it can hack it's way into. (Which could be >90% of the internet if it's good at hacking. )

there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement).

Ok. That's a world model in which humans are being INCREDIBLY stupid.

If we want to actually win, we need to both be careful about deploying those other misaligned models, and stop ARA.

Alice: That snake bite looks pretty nasty, it could kill you if you don't get it treated.

Bob: That snake bite won't kill me, this hand grenade will. Pulls out pin.

Decision theory does not imply that we get to have nice things

Donald Hobson1y10

If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.

Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won't give the same answer.

These problems may be solvable, depending on what the capabilities here are, but they aren't trivial.