authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
保罗·拉芭塔·巴霍的头像

保罗·拉芭塔·巴霍

paul在定量金融方面有丰富的经验. He combines love of statistics and machine learning with excellent Python skills.

专业知识

以前在

专家顾问团
分享

如今,机器学习模型在 计算机视觉 在许多实际应用程序中使用, 比如自动驾驶汽车, 人脸识别, 癌症诊断, or even in next-generation shops in order to track which products customers take off the shelf so their credit card can be charged when leaving.

The increasing accuracy of these machine learning systems is quite impressive, so it naturally led to a veritable flood of applications using them. Although the mathematical foundations behind them were already studied a few decades ago, the relatively recent advent of powerful GPUs gave researchers the computing power necessary to experiment and build complex machine learning systems. 今天, state-of-the art models 为 计算机视觉 are based on deep neural networks with up to several million parameters, and they rely on hardware that was not available just a decade ago.

In 2012, Alex Krizhevsky et altri became the first to show how to implement a deep convolutional network, which at the time became the state-of-the art model in object classification. 从那时起, many improvements to their original model have been published, 每一个都能提高精度(VGG), ResNet, 《欧博体育app下载》, 等). 最近, machine learning models have managed to achieve human and even above-human accuracy in many 计算机视觉 tasks.

A few years ago, getting wrong predictions from a machine learning model used to be the norm. 现在, 这已成为例外, 我们期望他们表现得完美无瑕, especially when they are deployed in real-world applications.

Until recently, machine learning models were usually trained and tested in a 实验室 environment, such as machine learning competitions and academic papers. 现在, 因为它们部署在真实的场景中, security vulnerabilities coming from model errors have become a real concern.

The idea of this article is to explain and demonstrate how state-of-the-art deep neural networks used in image recognition can be easily fooled by a malicious actor and thus made to produce wrong predictions. Once we become familiar with the usual attack strategies, 我们将讨论如何保护我们的模型免受它们的侵害.

对抗性机器学习的例子

让我们从一个基本问题开始: 什么是对抗性机器学习的例子?

Adversarial examples are malicious inputs purposely designed to fool a machine learning model.

在本文中, we are going to restrict our attention to machine learning models that per为m image classification. 因此, adversarial examples are going to be input images crafted by an attacker that the model is not able to classify correctly.

举个例子,让我们举个例子 GoogLeNet在ImageNet上训练 to per为m image classification as our machine learning model. Below you have two images of a panda that are indistinguishable to the human’s eye. The image on the left is one of the clean images in ImageNet dataset, 用于训练GoogLeNet模型. The one on the right is a slight modification of the first, created by adding the noise vector in the central image. The first image is predicted by the model to be a panda, as expected. The second, instead, is predicted (with very high confidence) to be a gibbon.

两张并排的熊猫照片. The second image looks identical to the first, but is labeled as a different animal. A third image of what appears to be random static is between them, demonstrating the layer that was added to the second panda image to confound the model.

The noise added to the first image is not random but the output of a careful optimization by the attacker.

第二个例子, we can take a look at how to synthesize 3D adversarial examples using a 3D printer. The image below shows different views of a 3D turtle the authors printed and the misclassifications by the Google 《欧博体育app下载》 v3 model.

显示海龟图像网格的图像, 其中一些被正确地归类为海龟, 其中一些被归类为步枪, 其中一些被归类为其他

最先进的模型是怎么做到的, 它们的分类准确率高于人类, 犯这些看似愚蠢的错误?

Be为e we delve into the weaknesses that neural network models tend to have, let us remember that we humans have our own set of adversarial examples. 看看下面的图片. 你看到了什么?? 螺旋形:螺旋形或一系列同心圆?

视觉错觉显示视觉错觉的图像.

What these different examples also reveal is that machine learning models and human vision must be using quite different internal representations when understanding what it is there in an image.

In the next section, we are going to explore strategies to generate adversarial examples.

如何生成对抗性示例

让我们从一个简单的问题开始: 什么是对抗性的例子?

Adversarial examples are generated by taking a clean image that the model correctly classifies, and finding a small perturbation that causes the new image to be misclassified by the ML model.

Let’s suppose that an attacker has complete in为mation about the model they want to attack. 这 essentially means that the attacker can compute the loss function of the model $J(\, X, y)$ 在哪里 $X$ 是输入图像, $y$ 是输出类,和 \θ美元 内部模型是参数吗. 这 loss function is typically the negative loss likelihood 为 classification 方法s.

在这个白盒场景下, 有几种攻击策略, each of them representing different tradeoffs between computational cost to produce them and their success rate. All these 方法s essentially try to maximize the change in the model loss function while keeping the perturbation of the input image small. 输入图像空间的维数越高, the easier it is to generate adversarial examples that are indistinguishable from clean images by the human eye.

L-BFGS方法

我们找到了对抗性的例子 ${x}’$ by solving the following box-constrained optimization problem:

$$ 开始\{矩阵} \text{minimize } c \cdot \left \| x - {x}' \right \|^2_2 + \text{loss}_{f,1} {x}' \\ \text{such that } {x}' \epsilon \left [0, 1 \right ]^n 结束\{矩阵} $$

在哪里 $c > 0$ 是否有一个参数也需要求解. 直觉上,我们会寻找对抗性的图像 ${x}’$ such that the weighted sum of the distortion with respect to the clean image ($\left | x - {x} ' \right |$) and the loss with respect to the wrong class is the minimum possible.

For complex models like deep neural networks the optimization problem does not have a closed-为m solution and so iterative numerical 方法s have to be used. 因为这个,这个 L-BFGS方法 是缓慢的. 然而,它的成功率很高.

快速梯度标志(FGS)

快速梯度符号(FGS)方法, we make a linear approximation of the loss function around the initial point, 由干净的图像向量给出 $X$ 真正的阶级 $y$.

在这个假设下, the gradient of the loss function indicates the direction in which we need to change the input vector to produce a maximal change in the loss. 为了保持扰动的小, 我们只提取了梯度的符号, 不是它的实际标准, 然后乘以一个小因子.

这 way we ensure that the pixel-wise difference between the initial image and the modified one is always smaller than epsilon (this difference is the L_infinity norm).

$$ X^{adv} = X + \epsilon \text{ sign} \left( \bigtriangledown_x J \left( X, y_{true} \right) \right) $$

The gradient can be efficiently computed using backpropagation. 这 方法 is one of the fastest and computationally cheapest to implement. 然而, its success rate is lower than more expensive 方法s like L-BFGS.

的作者 大规模对抗机器学习 said that it has between 63% and 69% success rate on top-1 prediction 为 the ImageNet dataset, 在2到32之间. For linear models, like logistic regression, the fast gradient sign 方法 is exact. 在这种情况下,另一个的作者 对抗性例子的研究论文 报告成功率为99%.

迭代快速梯度符号

一个明显的 前一种方法的扩展 is to apply it several times with a smaller step size alpha, and clip the total step length to make sure that the distortion between the clean and the adversarial images is lower than epsilon.

$$ X^{adv}_0 = X, X^{adv}_{N + 1} = Clip_{X, \epsilon} \left\{ X^{adv}_{N} + \alpha \text{ sign} \left( \bigtriangledown_X J \left( X^{adv}_N, Y_ {true} \右)\右)\右\} $$

其他技术,比如 Nicholas Carlini的论文 比L-BFGS有改进吗. They are also expensive to compute, but have a high success rate.

然而, 在大多数现实世界中, the attacker does not know the loss function of the targeted model. In this case, the attacker has to employ a black-box strategy.

黑盒的攻击

研究ers have repeatedly observed that adversarial examples transfer quite well between models, meaning that they can be designed 为 a target model A, but end up being effective against any other model trained on a similar dataset.

这 is the so-called transferability property of adversarial examples, which attackers can use in their advantage when they do not have access to complete in为mation about the model. The attacker can generate adversarial examples by following these steps:

  1. 用输入查询目标模型 X_i美元$i=1…n$ 并存储输出 y_i美元.
  2. 有了训练数据 (X_i y_i)美元,建立另一个模型,叫做 替代模型.
  3. Use any of the white-box algorithms shown above to generate adversarial examples 为 the 替代模型. Many of them are going to transfer successfully and become adversarial examples 为 the target model as well.

A successful application of this strategy against a commercial Machine learning model is presented in 这篇计算机视觉基础论文.

针对敌对例子的防御

The attacker crafts the attack, exploiting all the in为mation they have about the model. 很明显, the less in为mation the model outputs at prediction time, the harder it is 为 an attacker to craft a successful attack.

A first easy measure to protect your classification model in a production environment is to avoid showing confidence scores 为 each predicted class. 相反,模型应该只提供顶部 $N$ (e.g.最有可能是班级. 当置信度分数提供给最终用户时, a malicious attacker can use them to numerically estimate the gradient of the loss function. 这 way, attackers can craft white-box attacks using, 为 example, fast gradient sign 方法. In the Computer Vision Foundation paper we quoted earlier, the authors show how to do this against a commercial machine learning model.

Let us look at two defenses that have been proposed in the literature.

防守蒸馏

方法 tries to generate a new model whose gradients are much smaller than the original undefended model. 如果梯度非常小, techniques like FGS or Iterative FGS are no longer useful, as the attacker would need great distortions of the input image to achieve a sufficient change in the loss function.

防御性蒸馏引入了一个新参数 $T$, called temperature, to the last softmax layer of the network:

$$ \text{softmax} \left( x, T \right)_i = \frac{e^{x_i/T}}{\Sigma_j e^{x_j/T}} $$

Note that, 为 T=1, we have the usual softmax function. 的值越高 $T$, the smaller the gradient of the loss with respect to the input images.

防御性蒸馏的过程如下:

  1. Train a network, called the teacher network, with a temperature $T » 1$.
  2. Use the trained teacher network to generate soft-labels 为 each image in the training set. A soft-label 为 an image is the set of probabilities that the model assigns to each class. 举个例子, 如果输出图像是鹦鹉, the teacher model might output soft labels like (90% parrot, 10%的游船上).
  3. 训练第二个网络 蒸馏网络在软标签上,再次使用温度 $T$. Training with soft-labels is a technique that reduces overfitting and improves out-of-sample accuracy of the 蒸馏网络.
  4. Finally, at prediction time, run the 蒸馏网络 with temperature $T=1$.

Defensive distillation successfully protects the network against the set of attacks attempted in Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks.

Image of a table showing the success rate based on distillation temperature. 一般, 温度越高, 成功率越低, MNIST和CIFAR10对抗性样本.

不幸的是, later paper by University of Cali为nia, Berkeley reserachers presented a new set of attack 方法s that defeat defensive distillation. These attacks 比L-BFGS有改进吗 方法 that prove that defensive distillation is not a general solution against adversarial examples.

对抗训练

现在, adversarial training is the most effective defense strategy. Adversarial examples are generated and used when training the model. 直观地说,如果模型 看到 训练中的对抗性例子, its per为mance at prediction time will be better 为 adversarial examples generated in the same way.

在理想的情况下, we would like to employ any known attack 方法 to generate 训练中的对抗性例子. 然而, 为 a big dataset with high dimensionality (like ImageNet) robust attack 方法s like L-BFGS and the improvements described in the Berkeley paper are too computationally costly. In practice, we can only af为d to use a fast 方法 like FGS or iterative FGS can be employed.

Adversarial training uses a modified loss function that is a weighted sum of the usual loss function on clean examples and a loss function from adversarial examples.

$$ Loss = \frac{1}{\left( m - k \right)} \left( \sum_{i \epsilon CLEAN} {L \left( X_i | y_i \right) + \lambda} \sum_{i \epsilon ADV} {L \left( X*{adv}_i | y_i \right)} \right) $$

培训期间,为每一批 $m$ 我们生成干净的图像 $k$ adversarial images using the current state of the network. We 为ward propagate the network both 为 clean and adversarial examples and compute the loss with the 为mula above.

本文对该算法进行了改进 会议论文 叫做集体对抗训练. Instead of using the current network to generate adversarial examples, several pre-trained models are used to generate adversarial examples. On ImageNet, this 方法 increases the robustness of the network to black-box attacks. 这个防守是第一轮比赛的赢家 NIPS 2017 competition on Defenses against Adversarial Attacks.

结论和进一步措施

As of today, attacking a machine learning model is easier than defending it. State-of-the art models deployed in real-world applications are easily fooled by adversarial examples if no defense strategy is employed, opening the door to potentially critical security issues. The most reliable defense strategy is adversarial training, 在哪里 adversarial examples are generated and added to the clean examples at training time.

If you want to evaluate the robustness of your image classification models to different attacks I recommend that you use the open-source Python library cleverhans. Many attack 方法s can be tested against your model, including the ones mentioned in this article. You can also use this library to per为m adversarial training of your model and increase its robustness to adversarial examples.

Finding new attacks and better defense strategies is an active area of research. Both more theoretical and empirical work is required to make machine learning models more robust and safe in real-world applications.

I encourage the reader to experiment with these techniques and publish new interesting results. Moreover, any feedback regarding the present article is very welcome by the author.

了解基本知识

  • 什么是对抗性的例子?

    一个对抗性的例子是输入(e).g. image, sound) designed to cause a machine learning model to make a wrong prediction. It is generated from a clean example by adding a small perturbation, 对人类来说是难以察觉的, but sensitive enough 为 the model to change its prediction.

  • 什么是对抗性攻击?

    Any machine learning model used in a real-world scenario is subject to adversarial attacks. 这 includes 计算机视觉 models used in self-driving cars, facial recognition systems used in airports or the speech-recognition software in your cell phone assistant.

  • 什么是对抗性攻击?

    An adversarial attack is a strategy aimed at causing a machine learning model to make a wrong prediction. It consists of adding a small and carefully designed perturbation to a clean image, 这是人眼无法察觉的, but that the model 看到 as relevant and changes its prediction.

就这一主题咨询作者或专家.
预约电话
保罗·拉芭塔·巴霍的头像
保罗·拉芭塔·巴霍

位于 西班牙巴塞罗那

成员自 2019年4月24日

作者简介

paul在定量金融方面有丰富的经验. He combines love of statistics and machine learning with excellent Python skills.

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

专业知识

以前在

专家顾问团

世界级的文章,每周发一次.

<为m aria-label="Sticky subscribe 为m" class="-Ulx1zbi P7bQLARO _3vfpIAYd">

订阅意味着同意我们的 隐私政策

世界级的文章,每周发一次.

<为m aria-label="Bottom subscribe 为m" class="-Ulx1zbi P7bQLARO _3vfpIAYd">

订阅意味着同意我们的 隐私政策

Toptal开发者

加入总冠军® 社区.