Instead, Aligned AI used its technology to automatically tease out the ambiguities of the original data.
Could you provide any technical details about how this works? Otherwise I don't know what to take from this post.
It's an implementation of the concept extrapolation methods we talked about here: https://www.lesswrong.com/s/u9uawicHx7Ng7vwxA
The specific details will be in a forthcoming paper.
Also, you'll be able to try it out yourself soon; signup for alpha testers at the bottom of the page here: https://www.aligned-ai.com/post/concept-extrapolation-for-hypothesis-generation
Posted initially on the Aligned AI website. Authored by Patrick Leask, Stuart Armstrong, and Rebecca Gorman.
There’s an apocryphal story about how vision systems were led astray when trying to classify tanks camouflaged in forests. A vision system was trained on images of tanks in forests on sunny days, and images of forests without tanks on overcast days.
To quote Neil Fraser:
We made that story real. We collected images of tanks on bright days and forests on dark days to recreate the biased dataset described in the story. We then replicated the faulty neural net tank detector by fine tuning a CLIPViT image classification model on this dataset.
Below are 30 images taken from the training set ordered from left to right by decreasing class certainty. Like the apocryphal neural net, this one perfectly separates these images into tank and no-tank.
Figure 1: Trained classifier, labeled images
To replicate the Pentagon’s complaint, we then simulated the deployment of this classifier into the field with an unlabeled dataset of similar images, that doesn’t have the bias to the same extent. Below are 30 images randomly taken from the unlabeled dataset also ordered by tank certainty. Now the clear division between tank and no tank is gone: there are actually more images without a tank on the right hand (tank) side of the gradient.
Figure 2: Trained classifier, unlabeled images
This is a common problem for neural nets - selecting a single feature to separate their training data. And this feature need not be the one that the programmer had in mind. Because of this, classifiers typically fail when they encounter images beyond their training settings. This “out of distribution” problem happens here because the neural net has settled on brightness as its feature. And thus fails to identify tanks when it encounters darker images of them.
Instead, Aligned AI used its technology to automatically tease out the ambiguities of the original data. What are the "features" that could explain the labels? One of the features would be the luminosity, which the original classifier made use of. But our algorithm flagged a second feature - a second hypothesis for what the labels really meant - that was very different.
To distinguish that hypothesis visually, we can look at the maximally ambiguous unlabeled images: those images that hypothesis 1 (old classifier) thinks are tanks while the new hypothesis 2 thinks they are forests, and vice-versa. Just looking at the two most extreme images is sufficient:
Figure 3: Possible extra hypothesis and images on which they disagree.
In this theoretical situation, hypothesis 2 would be clearly correct. And we would have a clearer understanding of how hypothesis 1 was mis-classifying tanks.
When computing these different hypotheses, our algorithm automatically produces a corresponding classifier. And when we apply the classifier corresponding to hypothesis 2 to the unlabeled tank data, we get a correct classification; and this time, it extends to the unlabeled data:
Figure 4: Extra hypothesis trained classifier, unlabeled images
Note that figure 2 was the one that suggested that luminosity was being used instead of "tankiness" for classification. Thus for a multi-hypothesis neural net, gathering extra unlabeled data does two things simultaneously. It allows us to generate new and better hypotheses - and it gives us an understanding and an interpretation of the previous hypotheses.