otto.barten — AI Alignment Forum

Currently, we observe that leading models get open sourced roughly half a year later. It's not a stretch to assume this will also happen to takeover-level AI. If we assume such AI to look like LLM agents, it would be relevant to know what the probability is that such an agent, somewhere on earth, would try to take over.

Let's assume someone, somewhere, will be really annoyed with all the safeguards and remove them, so that their LLM will have a probability of 99% to just do as it's told, even though that might be highly unethical. Let's furthermore assume an LLM-based agent will need to take 20 unethical actions to actually take over (the rest of the required actions won't look particularly unethical to low-level LLMs executing them, in our scenario). In this case, there would be a 99%^20=82% chance that an LLM-based agent would take over, for any bad actor giving it this prompt.

I'd be less worried if it would be extremely difficult, and require lots of resources, to get LLMs to take unethical actions when they're asked to. For example, if jailbreaking safety would be highly robust, and even adversarial fine tuning of open source LLMs wouldn't break it.

Is that something you see on the horizon?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments