Abstract:Deep Neural Networks (DNNs) exhibit vulnerability to specially designed adversarial examples and are prone to deception.Although current detection techniques can identify some malicious inputs,their protective capabilities remain insufficient when confronted with complex attacks.This paper proposes a novel unsupervised adversarial example detection method based on unlabeled data.The core idea is to transform the adversarial example detection problem into an anomaly detection problem through feature construction and fusion.To this end,five core components are designed,including image transformation,neural network classifier,heatmap generation,distance calculation,and anomaly detector.Firstly,the original images are transformed,and the images before and after the transformation are input into the neural network classifier.The prediction probability array and convolutional layer features are extracted to generate a heatmap.The detector is extended from focusing solely on the model's output layer to the input layer features,enhancing its ability to model and measure the disparities between adversarial and normal samples.Subsequently,the KL divergence of the probability arrays and the change distance of the heatmap focus points of the images before and after the transformation are calculated,and the distance features are then input into the anomaly detector to determine whether the example is adversarial.Experiments on the large-scale,high-quality image dataset ImageNet show that our detector achieves an average AUC (Area Under the ROC Curve) value of 0.77 against five different types of attacks,demonstrating robust detection performance.Compared with other cutting-edge unsupervised adversarial example detectors,our detector has a drastically enhanced TPR (True Positive Rate) while maintaining a comparable false alarm rate,indicating its significant advantage in detection capability.