Ian Goodfellow looked at the multiclass classifier, given by the model $p(c \vert x),$ (probability of the class condition at the input).
Suppose we want to move smoothly from one class (cat) to another class (dog).
So, you can just maximize the probability of the image of being a dog, moving in the direction of the gradient
$$\nabla \log p(c_2 \vert x), $$But they found an extremely suprising result: even small modification leads to large missclassification.
White-box small-norm attack on a deep neural network model is the following optimization task.
Let $p(x) \in \mathbb{R}^c$ be the classifier with $c$ classes. Then the minimal norm attack is defined as the
solution to the following optimization problem:
$$\min \Vert \varepsilon \Vert\, \mbox{s.t. } \arg \max p(x + \varepsilon) \ne \arg \max p(x)$$I.e. the minimal-norm perturbation that changes the class.
Existing models are notoriously unstable!
There are different types of adversarial attacks (whitebox, greybox, blackbox).
The can be universal or not. They can be one-shot and iterative.
In whitebox attacks we know the weights of the model (which is not always the case).
The most well-known attacks are:
Fast Gradient Sign Method (FGSM): This attack involves adding a small perturbation to the input image by computing the gradient of the loss function with respect to the input.
Projected Gradient Descent (PGD): This attack is an iterative version of FGSM, where the perturbation is added in small steps until a certain threshold is reached.
DeepFool: This attack finds the minimum distance from the input image to the decision boundary of the classifier and then adds a small perturbation in that direction.
Carlini-Wagner (CW) Attack: This attack is designed to minimize the distance between the original image and the adversarial image while also ensuring that the adversarial image is misclassified.
Universal Adversarial Perturbation (UAP): This attack generates a single perturbation that can be added to any input image to cause misclassification.
Fast Gradient Sign Method (FGSM) is a popular whitebox adversarial attack that involves adding a small perturbation to the input image by computing the gradient of the loss function with respect to the input. Mathematically, the FGSM attack can be expressed as:
$$ \hat{x} = x + \epsilon \cdot \mathrm{sign}(\nabla_{x} p(y \vert x)) $$where $\text{adv}{x}$ is the adversarial image, $x$ is the original image, $\epsilon$ is the magnitude of the perturbation, $p(x, y)$ is the loss function, input image $x$, and true label $y$, and $\nabla{x}$ is the gradient operator with respect to $x$. The sign function is used to ensure that the perturbation is added in the direction that maximizes the loss function.
The derivation of FGSM is done by linearizing the loss function and solving the norm-constrained problem!
It does not necessary gives the optimal solution.
Projected Gradient Descent (PGD) is a more powerful whitebox adversarial attack that iteratively applies the FGSM attack with a small step size and then projects the resulting perturbed image back onto the attack set.
Mathematically, the PGD attack can be expressed as:
$$ x_0 = x, \quad x_{t+1} = \text{Clip}(x_t + \alpha \cdot \text{sign}(\nabla_{x} p(y \vert x_t)) $$PGD is more effective, but computationally more expensive as well.
The Carlini-Wagner (CW) attack is a state-of-the-art whitebox adversarial attack that is designed to be more effective than PGD against defenses that use gradient masking or gradient obfuscation techniques. The CW attack formulates the problem of finding an adversarial perturbation as an optimization problem that minimizes the distance between the original image and the perturbed image, subject to a constraint on the classification loss.
Mathematically, the CW attack can be expressed as:
$$ \min_{\delta} \Vert \delta \Vert_{p} + c \cdot f(x+\delta, y) $$where $\delta$ is the adversarial perturbation, $\Vert \cdot \Vert{p}$ is a norm function, $c$ is a hyperparameter that controls the trade-off between the distance and the loss term, and $f(\cdot, y)$ is a function that measures the classification loss of the perturbed image with respect to the target class $y$.
DeepFool attack uses the idea of linearized decision boundary.
Suppose we have a binary classifier, and the separation function is given as $f(x) = 0$.
$f(x) > 0$ corresponds to one class, $f(x) < 0$ to another.
The best attack would be given by the closest point to the boundary.
To find such point, we linearize the function as
$$f(x) \approx f(x_0) + f'(x), x - x_0.$$Finding the closest point to the boundary can be done analytically.
Then we can update iteratively
Initialize: $x := x_0$
Iteration: $r_i := -\frac{f(x_i) \nabla f(x_i)}{\Vert \nabla f(x_i) \Vert^2}$, $x_{i+1} = x_i + r_i$.
Idea of universal adversarial attacks has been proposed in the paper
The idea is to have a single image such that adding it to all images fools the classifier.
Original idea of constructing universal adversarial attacks uses geometrical ideas
In the black-box attacks we only have limited knowledge about the target model, i.e.
How we can construct such kind of attacks?
Transfer attacks:
In this attack, the attacker trains a substitute model to mimic the behavior of the target model using only input-output pairs. The substitute model can then be used to generate adversarial examples that can fool the target model.
Query-based attacks: In this attack, the attacker submits a large number of input queries to the target model to infer its internal behavior. This information can then be used to generate adversarial examples.
Zeroth-order optimization: In this attack, the attacker uses only the output of the target model to generate adversarial examples, without any knowledge of the internal parameters or architecture of the model.
One of the simplest and efficient black-box attacks is Square attack
We find squares by random search.
The squares are sampled randomly according to some distributions.
Small-norm attacks are done in the digital domain. But small-norm is not always the right attack model.
Important classes of attacks include:
Surprisingly, most of them break the classifiers!
Building real-world attacks is an interesting engineering task.
Since the neural networks are unstable with respect to small perturbations, they also get really well back
if the noising process is put back in place.
I.e., when you print and you take the photo, the image is distorted and it can no longer by an adversarial example.
The first paper is Adversarial examples in the physical world by Alexey Kurakin, Ian Goodfellow, Samy Bengio
Idea was super-simple: incorporate image transformations (rotations/blur/brightness) into the process of generating attacks.
The attacks should be attacks under all of those transformations.
At later paper the approach called (expectations over transformations) has been proposed.
Instead of using $f(x)$, they used
$$\hat{f}(x) = E_T T(f(x))$$as the loss function.
Other approaches include synthetic data generation, mapping real-life photos using generative adversarial networks (GANs), etc.
Adversarial attacks pose significant danger in several scenarios, thus it is important to develop defenses.
The defenses can be empirical or certified. Certified means that we guarantee that for a fixed attack model, the prediction will not change. We can do certain certification for small-norm attacks.
Emprical defenses make certain modifications to the training or inference procedures.
There are several standard approaches against attacks. Among them:
In adversarial training, we aim at minimizing the loss at the worst possible sample in the vicinity of the current one.
\begin{equation} \min_{\theta} \mathbb{E}_{x,y \sim p_{data}(x,y)} [\max_{\delta \in S} \mathcal{L}(f_{\theta}(x+\delta), y)] \end{equation}The solution of the inner maximization problem is done by several PGD steps, giving much slower training time, but increased robustness.
The idea of adversarial training has been proposed in the paper by Madry et. al
Adversarial training for free proposes two things:
Leads to neglible overhead!
Next step is Fast is better than free paper, which proposes the following idea.
In 'Adversarial training for free' perturbation from the previous sample is used as an initialization for the next sample.
It is difficult to believe that it is a good starting point, but it is non-zero.
Instead, the authors initialize the perturbation at random, and use 1 step of FGSM in training.
Simple idea in the Quoc Le style: replace ReLU (non-smooth) with its smooth variant, significantly improves robustness.
Let $f(x)$ be our binary classifier. Existence of attacks means that under a small perturbation $f(x + \varepsilon)$ changes a lot. Thus, it means that $f(x)$ should have large Lipschitz constant.
One can try to make the classifier smoother by imposing certain normalization techniques, such as spectral normalization (which is typically a bounded norm of the linear layer).
An alternative approach is to modify the inference procedure by smoothing over small perturbation.
It is called randomized smoothing.
Randomized smoothing has been proposed by Cohen et. al
The idea is to replace the inference procedure with smoothing
$$\hat{f}(x) = E_{\varepsilon \sim N(0, \sigma^2)}f(x + \varepsilon).$$The Cohen paper uses the indicator function under smoothing.
It is basically a voting method: we sample attacks, and select the class that is predicted the most times.
Let $p_A$ and $p_B$ be the probability of true class and false class from the randomized classifier.
We can not evaluate them directly, therefore we can get access to their bounds $\underline{p_A}$ and $\overline{p_B}$.
Then, the smoothed classifier is guaranteed to get the same prediction for an $x$ within the radius
$$R = \frac{\sigma}{2}\left(\Phi^{-1}(\underline{p_A}) - \Phi^{-1}(\overline{p_B})\right),$$where $\Phi^{-1}$ is an error function.
A more complicated estimate for the Randomized Smoothing has been obtained by Salman et. al.
Let $f(x)$ be the base classified, $f(x) \in [0, 1]$. Let $\hat{f}(x)$ be a smoothed classified with $\sigma=1$.
Let $\Phi(x)$ be the error function.
Then,
$$g(x) = \Phi^{-1}(\hat{f})$$has Lipschitz constant one.
We have
If we train the base classifier only, the smoothed accuracy will be smaller.
In practice, we can add maximize the accuracy of the smoothed classified, by minimizing
$$L(\hat{f}) \rightarrow \min.$$Note, that during training we can replace smoothing just by random Gaussian augmentation.
This will give an unbiased estimate of the gradient.
A standard protocol to measure robustness of deep learning models is to attack them using PGD attacks.
A standard benchmark is RobustBench
Lets have a look at the leaderboard: https://robustbench.github.io/