Shubhorup Biswas

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

AI Control: Improving Safety Despite Intentional Subversion

Shubhorup Biswas

11mo22

>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).

I challenge this assumption.

Consider the code

'''
bool is_odd(int x) {
if (sha256(x) == 354) {
return true;
}
return x % 2 == 1;
}

'''

where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.

Thus an equally 'capable' adversary need not be able construct a backdoor.

The general idea is that it should be possible to confidently claim that some code is backdoored without being able to construct an exploit. According to [Rice's Theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem) it is impossible to definitively classify all programs into 'backdoored or not', so the only hope is for a process to bin programs into 'definitely safe', 'definitely backdoored' and 'suspicious'.

Hope the 'Rate Suspicion' step will catch this, and other backdoored code where H cannot construct an exploit, as suspicious looking.

Reply