Adversarial attacks on machine learning is a wide domain of interest. In this first, we cover white box adversarial attacks using the gradient sign method. In a white box setting, the attacker has access to all the information used for training the model, its weights and parameters, along the used datasets for training.
The Fast Gradient Sign Method (FGSM) is a straightforward approach proposed in
The adversarial perturbation is given as:
\[\boldsymbol{\eta} = \epsilon \text{sign}(\nabla_{\boldsymbol{x}} J(\boldsymbol{\theta}, \boldsymbol{x}, y)).\]By taking the gradient of the loss w.r.t to $\boldsymbol{x}$, it provides the direction in which the input $\boldsymbol{x}$ should change in order to maximize the loss. Thus, a new generated adversarial sample can simply be generated as $\boldsymbol{x’} = \boldsymbol{x} + \boldsymbol{\eta}$. Given a small value of $\epsilon$, a new generated sample cannot be distinguished by the human eye from a real one, but is sufficient to deceive the model. In other words, a slight alteration of the input image pixels causes a consequent change in the model embedding.
However, this approach only causes the model to missclassify samples. Instead, an variant can be used to target a specific class to be predicted by the model, maximizing $p(\hat{y}|\boldsymbol{x})$, where $\hat{y}$ corresponds to the label of the targeted class. Previous perturbation now becomes:
\[\boldsymbol{\eta} = \epsilon \text{sign}(\nabla_{\boldsymbol{x}} J(\boldsymbol{\theta}, \boldsymbol{x}, \hat{y})).\]This time, as we aim at going in the direction that minimize the loss (instead of maximizing it) of the target class $\hat{y}$, the adversarial sample becomes $\boldsymbol{x’} = \boldsymbol{x} - \boldsymbol{\eta}$. The choice of the target class can be motivated by a specific attack strategy (for example predict dogs as cats). In
This is motivated by the fact that least likely predicted class in is higly dissimilar to the actual true class. This can be used to maximize the model misclassification.
This adversarial process can also be applied iteratively with a small step size.
\[\boldsymbol{x'}_{n + 1} = \boldsymbol{x'}_{n} - \epsilon \text{sign}(\nabla_{\boldsymbol{x}_n} J(\boldsymbol{\theta}, \boldsymbol{x}, \hat{y}))\]with $\boldsymbol{x’}_{0} = \boldsymbol{x}$. This latter approach is more effective than the former one step generation as it produce smaller perturbations and is less destructive.
This adversarial generation method can also be used as a regularization for training more robust models. In its simplest way, adversarial training can be formulated as:
\[\tilde{J}(\boldsymbol{\theta}, \boldsymbol{x}, y) = \alpha J(\boldsymbol{\theta}, \boldsymbol{x}, y) + (1 - \alpha) \epsilon \text{sign}(\nabla_{\boldsymbol{x}} J(\boldsymbol{\theta}, \boldsymbol{x}, y)).\]with hyperparameters $\epsilon$ and $\alpha$ acting on the perturbation level and the regularization balancing between the two terms respectively.