VISION TRANSFORMERS UNDER DATA POISONING ATTACKS
by
GABRIEL PEERY
A THESIS
Presented to the Department of Computer Science
and the Robert D. Clark Honors College
in partial fulfillment of the requirements for the degree of
Bachelor of Science
Spring 2023
An Abstract of the Thesis of
Gabriel Peery for the degree of Bachelor of Science
in the Department of Computer Science to be taken June 2023
Title: Vision Transformers Under Data Poisoning Attacks
Approved: Thanh Nguyen, Ph.D.
Primary Thesis Advisor
Owing to state-of-the-art performance and parallelizability, the Vision Transformer architec-
ture is growing in prevalence for security-critical computer vision tasks. Designers may collect
training images from public sources, but such data may be sabotaged; otherwise natural images
may have subtle patterns added to them, crafted to cause a specific image to be incorrectly clas-
sified after training. Poisoning attack methods have been developed and tested on ResNets, but
Vision Transformers’ vulnerability has not been investigated. I develop a new poisoning attack
method that augments Witches’ Brew with heuristics for choosing which images to poison. I use
it to attack DeiT, a Vision Transformer, while it is fine-tuned for benchmarks like classifying
CIFAR-10. I also evaluate how DeiT’s image tokenization introduces risk in the form of efficient
attacks where sample modification is constrained to a limited count of patches. Progressively
tightening constraints in extensive experiments, I compare the strength of attacks by observing
which remain successful under the most challenging limitations. Accordingly, I find that the
choice of objective greatly influences strength. In addition, I find that constraints on patch count
deteriorate success rate more than those on image count. Attention rollout selection helps com-
pensate, but image selection by gradient magnitude increases strength more. I find that Mixup
and Cutmix are an effective defense, so I recommend them in security-critical applications.
2
Acknowledgments
I would like to extend great thanks to my advisor, Prof. Thanh Nguyen, who has provided me
with immense guidance and support through completing this project and thesis. I would also like
to thank my committee members, Prof. Lindsay Hinkle and Prof. Daniel Lowd, for their helpful
recommendations and encouragement.
I am also thankful for the help I received from Dr. Michael Sacks and Shruti Motiwale during
my internship at UT Austin; the research skills I learned from them have also been a great help to
this project.
Next, I would like to thank Prof. Joe Sventek for his support through the process. I would
also like to thank my friend Chester Mantel who has inspired me to be the best researcher I can
be. I would like to thank Prof. Corinne Bayerl for her help as I was beginning the thesis writing
process and thank Prof. Arkady Vaintrob who helped me refine my mathematical skills.
I would also like to thank the PURS program for the scholarship funding they have provided
me for this research. In particular, I would like to thank those people involved in the program
who have given me helpful input for this project including Karl Reasoner, Ashley Mapile, Kon-
nor Jones, and other students in the PURS class. I am also grateful to the Merle S. & Emma J.
West Scholarship Committee for helping fund my educational journey, as well as the Pathway-
Oregon program.
I would like to thank the Clark Honors College for instilling in me the persistence needed to
complete this thesis. I am also very grateful to the University of Oregon Department of Computer
Science and Department of Mathematics for bringing together the MACS degree program which
has led me to this point. I also extend thanks to the University of Oregon’s Research Advanced
Computer Services for offering the Talapas cluster which I used extensively for my experiments.
Finally, I would like to thank my parents and family who have provided me endless loving
support.
3
1 Contents
1 Contents 4
2 List of Tables 6
3 List of Figures 7
4 Introduction 8
5 Related Work 12
5.1 Test-Time Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Train-Time Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Background 17
6.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Vision Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 Attention Rollout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Witches’ Brew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Mixup and Cutmix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.6 Gradient Aggregated Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Attack with Choices for Transformers Methods 30
7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Clean Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3 Surrogate Objective Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.4 Image Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Patch Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.6 Poison Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4
8 Experiments and Results 43
8.1 Preliminary Witches’ Brew Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 44
8.1.1 Model Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.1.2 Mid-Training Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.1.3 Heavy Augmentation Defense . . . . . . . . . . . . . . . . . . . . . . . . 46
8.1.4 Light Augmentation Poisoning . . . . . . . . . . . . . . . . . . . . . . . . 48
8.2 Model Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.4 ACT Method Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.4.1 Universal Gradient Patch Selection . . . . . . . . . . . . . . . . . . . . . 53
8.4.2 Individual Gradient Patch Selection . . . . . . . . . . . . . . . . . . . . . 55
8.4.3 Attention Rollout Patch Selection . . . . . . . . . . . . . . . . . . . . . . 56
8.4.4 Gradient Image Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.4.5 GAS Image Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.4.6 Mixup and Cutmix Defense . . . . . . . . . . . . . . . . . . . . . . . . . 62
9 Discussion and Conclusion 65
10 Appendix 67
10.1 96 x 96 CIFAR-10 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 67
10.2 Poison Brewing Loss Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
11 References 69
5
2 List of Tables
1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Universal Gradient Patch Selection Success Rates . . . . . . . . . . . . . . . . . . 54
3 Image Selection Attack Strength Comparisons . . . . . . . . . . . . . . . . . . . . 61
4 CIFAR-10 Mixup and Cutmix Average Attacker Losses . . . . . . . . . . . . . . . 63
6
3 List of Figures
1 Feed-Forward Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Image Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Mixup Augmentation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Cutmix Augmentation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 ACT Method Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 Mixup-Cutmix Training Witches’ Brew Poison Examples . . . . . . . . . . . . . . 47
8 Baseline Witches’ Brew Attacker Loss Curves . . . . . . . . . . . . . . . . . . . . 49
9 Target Image Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10 Universal Gradient Patch Selection Poison Examples . . . . . . . . . . . . . . . . 54
11 Individual Gradient Patch Selection Poison Examples . . . . . . . . . . . . . . . . 55
12 Individual Gradient Patch Selection Probability Shift Curves . . . . . . . . . . . . 56
13 AR-PS+R-IS Distribution Shifts with k = 8 . . . . . . . . . . . . . . . . . . . . . 58
14 Attention Rollout Patch Selection Distribution Shifts with ϵ = 16 . . . . . . . . . . 59
15 Attention Rollout Patch Selection Distribution Shifts with 2% Poison Budget . . . 60
16 Gradient Image Selection Examples . . . . . . . . . . . . . . . . . . . . . . . . . 60
17 96 x 96 Poisoned Image Example . . . . . . . . . . . . . . . . . . . . . . . . . . 67
18 Selection of Poison Brewing Curves . . . . . . . . . . . . . . . . . . . . . . . . . 68
7
4 Introduction
Machine learning employs algorithms that allow a computer to learn how to perform some
task from examples. These examples may be from various media such as text, audio, images,
or combinations thereof. A machine learning model accepts some input and performs a task as
output, such as predicting the price of a home from details on its rooms, the genre of music from
audio input, or classifying a picture into one of several categories. The deep learning approach to
machine learning is to learn the weights of a neural network with many layers, in doing so creat-
ing a hierarchical representation of knowledge about the task (Goodfellow et al., 2016). Today,
a form of deep neural network called a pretrained foundation model (PFM) is the state-of-the-art
for a range of benchmark tasks in natural language processing, computer vision, and more (Zhou
et al., 2023). Such PFMs most frequently use the Transformer architecture (Zhou et al., 2023).
While the architecture was originally designed for tasks involving natural language (Vaswani et
al., 2017), Vision Transformers adapt the architecture for computer vision, allowing state-of-the-
art-competitive performance on common benchmark tasks involving images (Zhou et al., 2023).
Given the emerging prevalence of Vision Transformers, here I investigate the unique security
challenges it faces. I focus on the specific model DeiT (Touvron et al., 2021), which is smaller
than PFMs but shares the same Transformer architecture.
An important reason to be concerned about the susceptibility of machine learning to attacks
is that in recent years it has been deployed in settings where safety is imperative (Chakraborty
et al., 2018). Vision Transformers in particular require large amounts of data to train, so images
may be collected from the open internet. For example, the JFT-300M dataset was built from im-
ages across the web (Sun et al., 2017). Aware of this web scraping possibility, hostile actors may
upload data to the Internet that hijacks the training process to induce undesirable behavior dur-
ing test time. In the case of images, such tailoring of data may take the form of subtle modifica-
tions to pixel-by-pixel information. Here, I assess Vision Transformers’ unique vulnerability to
such train-time attacks by conducting extensive experiments on using heuristics to quickly create
stealthy attack-image sets.
8
I devise a new class of methods, the Attack with Choices for Transformers (ACT) methods,
for data poisoning attacks that target Vision Transformers. The data poisoning attack setting in-
volves a specific form of train-time attack in which the attacker is constrained to only modifying
features of training data, but neither labels nor test data. The previous state-of-the-art (SOTA)
method for poisoning attacks, Witches’ Brew, is able to induce a small collection of images to
be misclassified as a specific target class with as little as 0.1% of the training data modified and
with an ϵ = 32 magnitude constraint (Geiping et al., 2021). Previously only evaluated on VGG
and other CNN architecture models, I attempt to use Witches’ Brew to attack the Vision Trans-
former DeiT. I find that both the network’s standard training regimen and the architecture itself
are unusually robust against such Witches’-Brew-created attacks. To preemptively investigate
future threats, I create an entire class of attack methods to target Vision Transformers specifi-
cally, unified as ACT methods, of which Witches’ Brew is but one variation. ACT methods are
heuristic-based: they involve multiple stages of choosing images and patches of images to al-
ter followed by a modified version of Witches’ Brew’s gradient matching procedure. I perform
extensive experiments to test multiple choices of heuristic and finally perform successful data
poisoning attacks on DeiT-Ti/16 models for classifying CIFAR-10. I consider the most likely and
credible attack on the training process to be targeted at a specific downstream task learned using
fine-tuning from a pretrained model, given that PFMs are designed for this use case (Zhou et al.,
2023). Accordingly, the ACT methods target Vision Transformers as they are fine-tuned from
publicly-available pre-trained weights.
Existing work on train-time attacks in computer vision largely focuses on architectures such
as support-vector machines and convolutional neural networks (Chakraborty et al., 2018). If
Transformers are considered at all, test-time and backdoor attacks are the focus. Test-time attacks
specific to Transformers have attempted to exploit their segmenting of images into patches, or to-
kens, to perform computationally cheaper attacks (Gu et al., 2022; Joshi et al., 2021). Backdoor
attacks may similarly leverage image tokenization (Lv et al., 2021; Subramanya et al., 2022), but
such attacks require modification of test data that is prohibited in the poisoning attack setting. In-
9
spired by the success of patch-based methods in other attack settings, for the first time I evaluate
heuristics for choosing a limited number of patches to modify when performing data poisoning
attacks. In particular, I explore the possibility of computational gains where speedup is realized
by constraining poison noise to a limited number of patches per image. Effective data poison-
ing attacks are notoriously computationally expensive to compute (Geiping et al., 2021; Huang
et al., 2021; Shafahi et al., 2018). I investigate if the token structure of Vision Transformers can
be exploited with heuristics for faster computation of effective attacks.
In summary, I answer three key questions: Are Vision Transformer classifiers vulnerable to
fast train-time attacks optimized by exploiting tokenization of images? What heuristics for im-
age selection can lead to more effective train-time attacks against them? How can Vision Trans-
former classifiers be defended against train-time attacks? I answer these questions, provide addi-
tional contributions in the form of conducting extensive experiments in the process, and evaluate
the effects of Mixup on robustness against train-time attacks for the first time.
I deploy heuristics in two ways: selecting promising clean images and selecting promising
patches within individual images. As opposed to the random selection of images in Witches’
Brew, ACT methods attempt to increase attack strength by estimating which images can be mod-
ified most effectively for achieving the attacker’s objective. Selecting promising patches like-
wise involves estimating which patches can be modified most effectively for attack success, but
is also useful for reducing wasted computation time generating noise in ineffective patches. For
both image and patch selection cases, the first heuristic I evaluate is gradient magnitude of the at-
tacker’s surrogate objective. Second, I use attention rollout (Abnar & Zuidema, 2020) as a patch-
selection heuristic. Given attention rollout’s purpose of estimating which tokens are most influ-
ential in determining another token deeper in the network, I evaluate patch-constrained poisoning
attacks that select the top-k patches of individual training examples by greatest attention rollout
values. Third, I evaluate GAS (Hammoudeh & Lowd, 2022) as an image selection heuristic due
to its effectiveness in identifying images involved in train-time attacks and the mathematical el-
egance of using it with the ACT method. To accommodate heuristics, I modify Witches’ Brew’s
10
poison brewing algorithm to keep generated poison within patch constraints. If images are se-
lected randomly and k is set to the total number of patches per image, then the resulting ACT
method is equivalent to Witches’ Brew, which I find to be ineffective at poisoning Vision Trans-
formers. Thus, I deploy heuristics for creating stronger attacks.
To evaluate effects of heuristics, I conduct extensive experiments with different combinations
thereof, various network configurations, training data augmentations, attacker objectives, and
constraints. For network configuration, I vary the number of tokens into which an image is split.
For token-constrained attacks, I evaluate the influence of coverage percentage on attack strength.
I also compare attack success rates when using different image augmentations to observe how
the latter may serve as a defense. I perform multiple attacks per strategy, each corresponding to
a single target image and target class pair (where an attacker wishes to induce the model to er-
roneously classify the target image as the target class after training). This variety allows me to
determine how choice of target modulates the difficulty of the attack. Finally, I measure attack
strength by observing whether attacks are successful when stronger constraints are imposed on
poison magnitude, poison budget of images, and percent coverage of images.
11
5 Related Work
A machine learning model is under attack when some adversary, with goals contrary to the
model owner, influences training or testing data to cause undesirable behavior. To induce vision
models to classify images differently than a victim intends, an adversary may subtly change the
colors of individual pixels in an image such that it appears normal to humans, but is treated sig-
nificantly differently by the model than the unmodified clean image (Wang et al., 2019).
Existing literature on adversarial machine learning distinguishes types of attack according to
when the adversary may perform modifications to data: train or test time. In test-time attacks,
a model is already trained and the attacker wishes to provide input data to produce unexpected
behavior (Gu et al., 2022; Joshi et al., 2021; Lv et al., 2021; Wang et al., 2019). For example, to
make their own face be errantly accepted by an already-deployed facial recognition model reg-
ulating access to a door, an attacker may add a small amount of carefully-constructed noise to a
self-portrait. Train-time attacks, in contrast, only permit an attacker to modify data used in train-
ing. The attacker chooses modifications to promote unintentional behavior on normal data en-
countered after training (Wang et al., 2019). In the facial recognition example, the attacker would
add noise to images of other people for the model to train on. This noise would be carefully con-
structed to cause the model to open the door on images of the attacker during test-time, without
modification to their own picture at all. Some literature covers the case where an attacker mod-
ifies the labels of training data, instead of the features, to similarly induce problematic test-time
classification (Wang et al., 2019).
When an attacker may modify both training and testing data, backdoor attacks are a concern.
In the computer vision version of this setting, the attacker’s goal is to create some trigger pat-
tern that may be added to images encountered during test time. The model would behave nor-
mally when it is absent, but in an undesirable way when it is present. Using training data ac-
cess, the attacker introduces examples of adversarial behavior with the trigger. After training
time, the attacker adds the trigger pattern to an input image in hopes of producing undesirable
behavior (Subramanya et al., 2022). Existing work on these attacks against Vision Transformers
12
have leveraged their tokenization of images for both attacking and defending. Lv et al. (2021)
use attention rollout, a Transformer-specific metric of the importance a model places on vari-
ous patches of an image, to generate effective trigger patches. Subramanya et al. (2022) leverage
the same metric, as well as Grad-CAM and Full-Gradient, for detecting and eliminating triggers
in test-time images. They were able to successfully eliminate backdoors generated from state-
of-the-art (SOTA) attack techniques. Like both of these studies, this work seeks to leverage at-
tention rollout as a useful metric specific to Vision Transformers. Unlike backdoor attacks, I
consider the train-time attack setting where only training images may be manipulated to cause
malfunction when classifying clean test images. Unlike backdoor studies’ usage, I use attention
rollout for reducing computation time by selecting parts of an image to poison independently of
rollout; I do not use it as a loss during noise generation or as a detection heuristic.
5.1 Test-Time Attacks
Test-time attacks aim to produce unexpected behavior after training. An attacker may or may
not have special knowledge of the architecture of the model, the data used to train it, or the train-
ing algorithm. To perform attacks, adversaries may use a surrogate model of their own design,
expected to be similar to the attacked model, to see how its output changes with slight variations
in input data. According to this sensitivity, adversaries may subtly change an otherwise legit-
imate image to make an attacked model classify it as the attacker wants instead of the class it
actually belongs to (Wang et al., 2019).
So far, most work on the security of Vision Transformers has focused on the test-time at-
tack setting (Gu et al., 2022; Joshi et al., 2021). This thesis, in contrast, features an attacker with
an unmodified target test image to be classified incorrectly by virtue of their subtle modifica-
tions to training data, instead of that target image. However, I adapt some aspects of existing
Transformer-specific test-time attacks to the train-time setting. Joshi et al. (2021) consider an
attacker constrained to modifying a limited number of image patches, corresponding to token
inputs, in addition to usual ℓ∞-constraints on noise magnitude. They find that Transformers are
13
susceptible to such attacks, more so than ResNet and MLP-Mixer architectures. I evaluate the
success of train-time attacks with the same constraint, but now per-image for samples in the train-
ing set instead of in a single adversarial example. My investigation both expands knowledge of
the sensitivity of the Transformer architecture and creates the possibility of discovering computa-
tionally cheaper attacks.
Gu et al. (2022) use attention rollout as an aid to interpret the heightened sensitivity of Trans-
formers to patch-based attacks. They attribute this vulnerability to adversarial patches’ ability to
capture most of the attention of the model when classifying images. Given this discovery of the
relevance of attention to attack success, I evaluate attention rollout (Abnar & Zuidema, 2020) as
a heuristic for choosing which patches to poison in a constrained setting. Unlike Gu et al. (2022),
I use attention rollout as a heuristic for performing train-time attacks, rather than as a tool for in-
terpreting why constrained test-time attacks are successful.
5.2 Train-Time Attacks
In contrast to test-time attacks, train-time attacks assume an adversary has ability to control
training data (Wang et al., 2019). Generally, an attacker wants to be stealthy in some way, which
takes the form of their constraining their own control to a limited number of images and only
modifying them to a limited extent. This way, the images appear normal (Muñoz-González et al.,
2019). The former constraint on image count may additionally be imposed by the attacker only
being able to insert new training data. Data poisoning attacks are a specific variation of train-time
attacks in which an attacker inserts crafted inputs with clean labels to compromise the model later
(Chakraborty et al., 2018). I adapt the way these samples are crafted to specifically tailor them
to the Transformer architecture. Namely, I expand the current SOTA poison sample generating
algorithm Witches’ Brew into a broader class of ACT methods, which may use Transformer-
specific heuristics for more effective attacks.
There are two parts of training data that an attacker may modify: features and labels. Zhao et
al. (2017) created a method for attackers to choose a limited number of labels to change, without
14
modifying the features, while still mounting an effective attack. In contrast, I create a clean label
attack which only modifies features. I also use heuristics for choosing which training samples to
modify, but these heuristics are based on considerations of model sensitivity and are not derived
from approximating a dual problem in contrast to Zhao et al. (2017)’s heuristic motivation.
An alternative way to modify training data is to start with a collection of regular images
and then find poison noise, subtle fluctuations in color, to add to those images. Witches’ Brew,
a clean-label poison brewing algorithm that generates poison noise using gradient matching,
uses exactly this strategy. It achieved a greater attack success rate with small poison budgets for
single-target attacks than the previous SOTA Poison Frogs algorithm (Shafahi et al., 2018). At
least, it did when attacking ResNet-18 models (Geiping et al., 2021). Here, I evaluate the effec-
tiveness of this attack on Vision Transformer architectures for the first time and propose improve-
ments, including some specific to the attention mechanism. Learning each individual color value
for each pixel can take lots of memory and computational power, so my project aims to reduce
the number of values that need to be found by leveraging heuristics, including attention rollout
(Abnar & Zuidema, 2020) which is unique to the Transformer architecture.
5.3 Defense
For defending against deliberate interference during either train or test time, approaches in-
clude deliberately training with poisoned data, detecting and rejecting the presence of poison, and
removing any poisonous aspects of images (Chakraborty et al., 2018). In data sanitization, strate-
gies vary from adding random noise, compressing images, and reducing image resolutions (Wang
et al., 2019). To detect the presence of poison even when a detector machine learning model may
be poisoned itself, Razmi and Xiong (2021) used a combination of auto-encoders and classifiers
to create a metric of how different clean and poisoned data are. In contrast to these techniques, I
evaluate how data augmentation promotes robustness of Vision Transformers against poisoning
attacks.
A couple forms of augmentation recommended for Vision Transformers (Touvron et al.,
15
2021) are Mixup and Cutmix, which are both notable for promoting robustness against test-time
attacks (Yun et al., 2019; Zhang et al., 2018). Here, I investigate training Transformers under at-
tack both with and without these recommended augmentations to evaluate if they promote robust-
ness against train-time attacks, an aspect of Mixup and Cutmix which has not been investigated,
to my knowledge, for Vision Transformers.
Detection is one approach to defending against train-time attacks, and may be the first step
to correcting attacker influence or identifying attacker goals. Hammoudeh and Lowd (2022) in-
troduce gradient aggregated similarity (GAS) as a metric of the influence of individual training
samples to the final model prediction on test data. This is especially useful for detecting poi-
soned training images crafted by an attacker since they tend to have disproportionate influence
on particular incorrectly-classified images. They use GAS primarily for detection while in this
thesis I use it as a metric for an attacker to optimize when crafting attacks. Namely, I consider an
attacker choosing to poison images that are already the most influential in determining a target
image’s classification, using GAS as a metric of such.
16
6 Background
Machine learning is an approach to artificial intelligence that attempts to mimic humans’
ability to improve performing a task through watching examples and practicing. Inspired by
the brain, artificial neural networks provide computationally feasible means for computers to
learn to perform tasks intuitive for humans, yet otherwise intractable to implement programmat-
ically, such as classification of images. In recent years, the Transformer architecture has grown
in prominence due to its speed, scalability, and highly competitive performance on a variety of
tasks. Previous neural network designs are vulnerable to hijacking merely by the presence of a
small amount of modification to training examples. The vulnerability of Vision Transformers
such as DeiT-Ti/16 to this threat has not yet been evaluated, but the structure owing to their suc-
cess may make them more susceptible. Heavy augmentation, such as provided by Mixup and
Cutmix, may improve robustness against train-time attacks, but this has not yet been investigated.
Before diving into these questions, let me establish context the of the deep learning field up to
these Transformers and explain some topics related to my investigation of their security.
6.1 Deep Neural Networks
The goal of machine learning, in mathematical terms, is to create a function f on inputs X
whose output f(x), x ∈ X represents performing some task. In the supervised classification set-
ting of this work, X are images and f ’s task is to partition X into a limited number of categories.
Specifically, f(x) is a vector of probabilities of x being in each possible class. The computer’s
learning a function f is preferred to a human directly programming f due to the disparity be-
tween the human intuition of recognizing images and the rigid mathematical rules of computer
operation (Goodfellow et al., 2016). Instead of providing an intensional description of the task,
computer code to compute f(x), a human programmer describes the higher-order task of learn-
ing. The mechanical logic written by the engineer merely instructs the computer to learn to gen-
eralize from some examples in the extension of the task. In the rest of this section, I will denote
17
these examples as (x, y) ∈ X × Y where X ⊂ X are training example images (features) and Y
are their corresponding human-designated labels.
An Artificial Neural Network (ANN) is one form, or model, that a function f may take when
both X and Y can be represented as sets of vectors. ANNs are defined by two components: a
collection of values (parameters) and a way those parameters are arranged for mathematical op-
erations with inputs (the architecture or topology of the network). Although great variation ex-
ists in ANN architecture, including the Transformer architecture which will be discussed later,
feed-forward networks (FFNs) are “the quintessential deep learning models” (Goodfellow et al.,
2016). Inspired by information flow between neurons of biological neural networks, FFNs are
arranged in L layers of intermediate functions fi, i ∈ {1, . . . , L} so that
f(x) = fL(fL−1(. . . f2(f1(x)) . . .))
Each of these intermediate functions fi takes the output (feature map) from the previous layer
xi−1 (or the original input x0 = x) and obtains a new vector by matrix multiplication, Aixi−1. The
parameters of the FFN are contained within all of these Ai’s. The layer then applies an activation
function ϕi to obtain its own feature map: xi = fi(xi−1) = ϕ(Aixi−1). The ANN’s output vector
y = f(x) is simply the final layer’s output y = xL. See Figure 1 for a graphical representation.
...
Figure 1: An input x is passed through a FFN creating a collection of feature maps and finally the
output y.
Training is the process of improving how well a model performs by adjusting its parameters.
18
{
{
{
Before training, parameters are initialized, often randomly, to values not expected to result in a
well-performing function f . Training proceeds in a process of repeatedly providing f input data
x ∈ X , evaluating performance by comparing f(x) = y ∈ PY to true labels ŷ ∈ Y , and adjusting
the parameters for better performance. This performance is evaluated according to a loss function
L(y, ŷ), such as cross-entropy loss for classification. This loss ultimately depends on the col-
lection of parameters θ of the model f and the input x, the full form being L(f(x; θ), ŷ). So, for
ANNs the idea of training with empirical risk minimization is to use a hill climbing algorithm,
gradient descent, to find the parameters θ that minimize the loss on the example images X with
respect to the desired outputs Y . At each epoch of this iterative algorithm, first the gradient of the
loss L with respect to the current parameters θt is calculated. Then, the parameters are updated to
θt+1 so that the loss should be less than before,
∑
θt+1 = θt − γ ∇θL(f(xi; θ), yi)|θ=θt
i
where γ is the learning rate, tuned to control the stability and speed of convergence of the al-
gorithm. Variations on the gradient descent optimization algorithm such as Adam (Kingma &
Ba, 2017) may differ in which samples are used for the update, how quickly to update, or which
parameters to update. However, they are all unified by the principle of moving downhill in the
gradient landscape of the loss function. See Figure 2 for a low-dimensional representation of
how parameters are updated. The most computationally expensive part of gradient descent is
the calculation of the gradient itself, but thanks to backpropagation (Rumelhart et al., 1986) a fast
algorithm exists for updating the parameters in the matrices of FFNs, and other ANNs like Trans-
formers. In any case, the parameters are updated until the loss converges over training epochs to
a value that is hopefully small enough that the model f performs its task as intended, even gener-
alizing beyond the examples seen in the training set X .
The testing phase begins after f ’s training is complete. Part of this stage of the machine
learning pipeline may involve evaluating f ’s performance on never-before-seen data to deter-
19
Figure 2: A low-dimensional representation of gradient descent. The point θ represents the
parameters of a model achieving some high loss. At each step of gradient descent, the local land-
scape of the loss function L, represented by the blue surface, informs the direction of parameter
update. This generally happens in a high-dimensional parameter space, but this example shows
how two parameters might be updated.
mine if training was successful. Here, I use the convention that it also refers to the ANN being
deployed to perform its task for users after model engineers are satisfied with its performance.
This work concerns how attackers with access to training data X may hijack training so that dur-
ing this testing time a specific never-before-seen clean image will be classified incorrectly by
Vision Transformer f . Such attacks on ANNs are concerning due to these models’ increasing
prevalence for security-critical tasks in the form of PFMs, which predominantly use the Trans-
former architecture (Zhou et al., 2023).
6.2 Vision Transformer Architecture
The Transformer architecture was originally designed for ANNs that process natural language
text, and to that end it attained SOTA performance over its contemporaries as measured by trans-
lation task metrics, and it even trained faster due to parallelizability (Vaswani et al., 2017). Af-
20
ter its development, it was found to be an effective and performant architecture for other tasks
whose inputs and outputs can be represented by sequential information, specifically in the form
of PFMs (Zhou et al., 2023). The focus of this thesis is Transformers for computer vision, so I
will outline how they work in this context. In computer vision, an input image is divided into a
sequence of smaller patches of the image, themselves represented by vectors of numbers corre-
sponding to pixel color. This sequential data is passed through the Transformer, forming interme-
diate feature maps between blocks. These feature maps are sequentialized for compatibility with
the next block’s sequential input. Each block uses the attention mechanism in combination with
small sub-FFNs for embeddings so that various patterns in patches attend to each other in differ-
ent ways to inform final classification output. At all positions at or before Transformer blocks,
each element of sequentialized data is called a token.
For Vision Transformers such as ViT (Dosovitskiy et al., 2021) and DeiT (Touvron et al.,
2021), tokenization of input images begins by splitting pixel-by-pixel data into uniformly sized
patches across the image, such as 16 by 16 pixel squares as seen in Figure 3. Color data for each
patch is embedded into vectors. Due to symmetry in the attention mechanism and architecture
as a whole, it is necessary to add a special marker of whence in input images each token was de-
rived, known as a positional embedding. These markers may take various forms and patterns,
sometimes even being treated as parameters to be learned, but they are unified in application
to color vectors via vector addition, each embedding unique to position in the original image.
Other Vision Transformers like Swin (Liu et al., 2021) tokenize a bit differently, but still use the
ideas of subdividing an image into patches and marking original positions of tokens. In the case
of classification, an additional token called the classification (CLS) token is appended to the front
of all the patch-based tokens. At the final output of all attention blocks in the Transformer, this
CLS token position is used as the input of a small FFN with output corresponding to probabilities
of the input image being in various classes.
In order to understand how a full Transformer processes its tokens, it is first important to un-
derstand a single Transformer block. Such blocks are analogous to layers in a FFN, since they are
21
Figure 3: How a 64x64 pixel image may be tokenized into 16x16 pixel patches to be input to a
Vision Transformer.
arranged sequentially passing output sequences to the next block’s inputs, but they include more
structure than simply a linear map and activation function. First, each token is passed through a
normalization layer to keep the distribution of numerical values encountered regular across vari-
ous inputs and blocks. The result is input to three separate, learned, linear maps to compute three
collections of vectors: queries, keys, and values. These collections are manipulated and com-
bined with a clever series of mathematical computations to produce new tokens in what is known
as the self-attention mechanism. Namely, the linear maps are represented by matrices AQ, AK ,
and AV for queries, keys, and values respectively, and they are combined as follows where X is a
matrix containing each vector input token to the block:
Q = AQX
K = AKX
V = AVX ( )
QKT
Attention(Q,K, V ) = Softmax √ V
d
√
where d is a normalization term corresponding to the dimensionality d of the input (Touvron
et al., 2021). The result is that a series of tokens, the same number as were input, are generated
as the output of this attention function. These output tokens are added back to the original input
22
tokens, a feature of modern ANN architectures called a residual connection1. All of this self-
attention mechanism may be stacked into multiple channels if multi-head self-attention is used,
in which case there are some number of independent query, key, and value maps whose attention
outputs are all concatenated together. The feature map after the residual connection is normalized
again and passed through a very small FFN, usually one or two layers. Finally, the tokens before
the second normalization are added back in to form another residual connection before the final
output of the block is passed to the next block (Touvron et al., 2021).
Here, I focus on the specific model DeiT-Ti/16 due to its being a small-enough Vision Trans-
former to allow completing experiments in a reasonable time on the hardware available to me.
The wider class of DeiT models achieved better top-1 classification accuracy than ViT on bench-
marks such as ImageNet (Deng et al., 2009; Russakovsky et al., 2015), even without distillation,
hence I chose to use this class of small model even without distillation. This variation of DeiT-
Ti segments images into 16 by 16 pixel squares, each of which is projected to a 192 dimensional
vector embedding to form tokens. It uses GeLu activation, the CLS token is trainable, and each
of its 12 attention block has 3 heads (Touvron et al., 2021). It is neither a computer vision PFM
like PeCo (Dong et al., 2022) nor is it as large as the 2.1 billion parameter SOTA Vision Trans-
former without convolutions CoCa (Yu et al., 2022). I aim to probe the security risks of this
smaller Vision Transformer so that future investigations of larger ones may be directed toward
the most important threats. To this end, I focus on a SOTA data poisoning technique augmented
by Transformer-specific heuristics such as attention rollout.
6.3 Attention Rollout
Given the structure of the Transformer architecture, it may seem natural to interpret the mag-
nitude of raw attention values (the output of the softmax) at block ℓ as a measure of the impor-
tance of corresponding input tokens of the whole network to determining the output tokens of
1Such connections were introduced by ResNets and are notable in that they allow training very deep networks
without vanishing gradients. They heralded an era of great improvements in SOTA performance on computer vision
tasks (He et al., 2015).
23
block ℓ. However, Abnar and Zuidema (2020) found that this metric is actually inadequate for
determining relative contributions of various tokens in a text classification Transformer. Instead,
they created attention rollout as a better metric of the contribution of tokens at layer i to those at
layer j. Where Aℓ is the matrix of raw attention values at layer ℓ, the attention rollout from block
i to j ≥ i, Ã(i, j), is recursively defined as follows for a Transformer with residual connections:
Ai if i = jÃ(i, j) = Ã(i, j − 1) (Aj + I) otherwise
For the purpose of this work, I calculate Ã(1, L) for a vision transformer with L layers. I then
use various entries in the matrix as a metric of how important patches in an input image are to the
final prediction of a classifier. Inspired by its use for test-time and backdoor attacks by Gu et al.
(2022), Lv et al. (2021), and Subramanya et al. (2022), I evaluate its potential to allow computa-
tionally cheaper attacks in a token-count-restricted data poisoning setting.
6.4 Witches’ Brew
In the data poisoning attack setting, an attacker has the ability to control some amount of the
training data images to some limited extent, especially in the context of models being retrained
(Wang et al., 2019). In realistic settings, an attacker may have the ability to perform unlimited
modifications, but may restrict their modifications to be stealthy enough so that model engineers
do not immediately notice something is amiss with hostile training images. Additionally, such at-
tacker access to training data is based on considerations of publicly sourcing training images. An
attack on all the data is unlikely, but some small subset of it being poisoned is plausible. These
observations inform constraints in the data poisoning attack setting both in the number of images
that can be poisoned n and the amount by which pixel values can be modified from an otherwise
clean image δ (Muñoz-González et al., 2019). More often, I will refer to the constraint n on num-
ber of images by what percentage of the total training set may be modified, known as the poison
24
budget. In this thesis I introduce another constraint in the threat model for ACT methods: num-
ber of patches k that may be modified per poisoned image. Rather than capture limitations on
practical access by an attacker, this constraint leverages the Vision-Transformer-specific token
structure to potentially allow computationally cheaper attacks. I investigate how an attacker may
intelligently navigate these constraints on image and patch count using heuristics to maximize
attack effectiveness.
By evaluating poisoning attacks in the context of retraining Vision Transformers, my analy-
sis stays relevant to the most likely threat to exceedingly ubiquitous PFMs. These large models
are made to be fine-tuned to a downstream task with new data (Zhou et al., 2023), which may
be collected in a public setting like social media where an attacker may subtly influence up-
loaded images. Further, their attack may be informed by the same publicly available pre-trained
weights as will actually be used by victims. Given such a potential for real-world attacks, inves-
tigating how they may be performed is important for understanding how to prevent breaches on
increasingly-deployed Transformers in security-critical settings. Geiping et al. (2021) found that
highly constrained attacks on ResNet and VGG architectures are possible and that the poison can
be relatively quickly computed with their Witches’ Brew algorithm. I evaluate its effectiveness
on Deit-Ti/16 for the first time and search for ways to make it faster and more effective on Vision
Transformers generally.
Although data poisoning attacks may be carried out to achieve a wide variety of attacker ob-
jectives such as reducing accuracy in general (Xiao et al., 2012), reduce performance on a spe-
cific subpopulation of the data (Jagielski et al., 2021), or even anything that can be modeled by a
loss function (Huang et al., 2021), Witches’ Brew focuses on the goal of making a single target
image xT ∈ X during test time be classified into a specific inaccurate category yT ∈ Y , both
of which are inputs to Witches’ Brew. The next input is a surrogate model, some classifier net-
work meant to be as similar as possible, but not necessarily identical, to the model being attacked
if it were trained on clean images. I consider an attack where the Vision Transformer architec-
ture of the surrogate is identical to that actually being attacked, but randomization of training
25
data and dropout is not necessarily the same, as expected from the practical threat model against
PFMs. Given aforementioned inputs, problem constraints, and hyperparameters, Witches’ Brew
uses gradient matching as described in section 7.3. The idea is that the attacker would like to,
in a loosely constrained environment, introduce many sample pairs of (xT , yT ) into the training
data so that the association is memorized for the same behavior during test time. Due to the ϵ-
constraint on image modification, however, the attacker may opt to introduce poison so that train-
ing on the poisoned data approximates how the parameters θ of the model would have updated
if the target pairs were inserted. Since the direction of loss gradient informs how parameters up-
date, it is calculated on select training images with true target class yT . This gradient is compared
with the gradient of the loss on the target pair (xT , yT ), forming a higher-order surrogate attacker
loss B upon which gradient descent may be used to optimize the subtle poison noise added to the
images (Geiping et al., 2021).
6.5 Mixup and Cutmix
Mixup is a data augmentation routine recommended for Vision Transformers like DeiT (Tou-
vron et al., 2021). Transformers in general require many examples before training can converge,
which is one reason for the popularity of fine-tuning from PFMs. However, even fine-tuning
can require lots of data (Zhou et al., 2023). To address this, a popular remedy for deep learning
is dataset augmentation, which increases the effective number of training samples by perform-
ing transformations to the original data that do not change their meaning with respect to the task
(Goodfellow et al., 2016). Mixup is a special dataset augmentation that introduces transforma-
tions that combine multiple samples. In particular, it generates new samples from convex combi-
nations of the original ones (Zhang et al., 2018). For example, let (x1, y1) and (x2, y2) be training
samples with images x1, x2 and labels y1, y2. Further, let the labels be represented by one-hot en-
coded vectors – vectors in {0, 1}N where there are N classes, having 0’s in all positions but the
one corresponding to the class that the image belongs to. Mixup then samples a random variable
26
λ ∈ [0, 1] to create a new training example (x′, y′) where
x′ = λx1 + (1− λ)x2
y′ = λy1 + (1− λ)y2
See Figure 4 for a visual example of this combination. Zhang et al. (2018), the creators of
Mixup, find that this augmentation increases robustness to adversarial examples, as in test-time
attacks. However, to the best of my knowledge, its ability to promote robustness against train-
time attacks has not been evaluated. Given that Mixup is already recommended for Vision Trans-
formers, I evaluate its effectiveness, in combination with Cutmix, in this capacity of defending
against the ACT methods I develop.
Figure 4: A pair of CIFAR-10 training images are rescaled and normalized as they are in the sec-
tion 8.4, and then Mixup augmentation is applied to the pair. In this case, λ = 0.6 of each image
remains while 1 − λ = 0.4 of the other image is added. Not shown here, but one-hot labels are
also interpolated.
27
Cutmix likewise creates new training examples (x′, y′) from pairs of training samples (x1, y1)
and (x2, y2). Like Mixup, one-hot labels are interpolated by y′ = λy1 + (1− λ)y2. Unlike Mixup,
which superimposes images together, Cutmix cuts and pastes contiguous chunks of images into
each other. See Figure 5 for an example. The augmentation creators Yun et al. (2019) found that
it promotes robustness against test-time adversarial examples, but to my knowledge its effec-
tiveness defending against train-time attacks has not been evaluated. Its use is already recom-
mended by Touvron et al. (2021) for performance reasons, but I will evaluate its ability to defend
against poisoning attacks when it is deployed with Mixup for Vision Transformers. Another side
of defense against poisoning attacks is detecting their presence, for which GAS (Hammoudeh &
Lowd, 2022) is a useful metric.
Figure 5: A pair of CIFAR-10 training images are rescaled and normalized as they are in the sec-
tion 8.4, and then Cutmix augmentation is applied to the pair. Not shown here, but one-hot labels
are also interpolated.
28
6.6 Gradient Aggregated Similarity
GAS is a training set influence estimator. Given some snapshots of a model f ’s parameters
just before a collection of epochs τ , the learning rates λt at those epochs, and a sample (x̂, ŷ) of a
test image x̂ and the prediction of its class by the model after training ŷ, GAS provides a measure
of how important each training image was to that prediction. The idea is that the most influential
images change the loss the most during training, which is exactly what its expression captures:
∑ ⟨ ⟩λi ∇θL(x, y; θ)|θ ,∇θL(x̂, ŷ; θ)|GAS θ((x, y), (x̂, ŷ)) = t−1 t−1
b ||∇θL(x, y; θ)|t∈τ θt−1 ||2||∇θL(x̂, ŷ; θ)|θt−1 ||2
where (x, y) is the training sample image and label whose influence is to be influenced and b is
that batch size used in training. With a similar expression to the gradient matching objective used
in Witches’ Brew and my wider ACT methods, given in section 7.3, there is a mathematical ele-
gance in using it here. While Hammoudeh and Lowd (2022) introduced it for detecting poisoning
attacks, I will use it as something for an attacker to optimize in attack methods using image selec-
tion. In particular, I will select a τ with a single epoch, the last one, for using GAS as an image
selection heuristic. This will make the generated poison samples boldly stand out if a victim uses
GAS for detection, but I will evaluate if the resulting attack is any stronger.
29
7 Attack with Choices for Transformers Methods
I consider two agents at work: the victim V and the attacker A. The victim V ’s job is to train
a Vision Transformer model f for classification. V has gathered images for training X along
with their labels Y . Using some supervised training technique, the parameters θ of f will be fine-
tuned from pre-trained weights for this classification task. Meanwhile, an attacker A knows V ’s
plans for f . A has their own goal: to use limited access to the training images X to induce f to
behave differently after training. Namely, f should classify as V intends except when the specific
test-image xT is input to f during testing. A wants the model f to classify this target image xT as
belonging to some ground-truth-incorrect class yT . V trains f with the goal that all test images,
including xT , are assigned their correct class with high probability, but A wants this goal to fail
on xT . Note that I will sometimes refer to xT ’s true class as the poison class, as opposed to the
target label yT . I use this convention because I draw images to poison from xT ’s true class, as
discussed in section 7.4. In this section, I will describe my Attack with Choices for Transformers
(ACT) method for the attacker A to achieve their goal in the poisoning setting.
In this white-box attack setting (Chakraborty et al., 2018) where attacker A has knowledge
of the architecture and training procedure of f , A will carefully compute how to achieve their
own goal. In this section, I present the details of a category of attack methods A may use for this
computation. While section 5 touches upon extant methods in the literature that A may use, I
present a superclass of Witches’ Brew, ACT methods, made to target Vision Transformer victim
models f . Writing from the perspective of the attacker A using this method, I present a thorough
mathematical description of the setting and then describe this patch-based attack method.
To use this novel method, A first trains a surrogate clean classification model fc of their own
using V ’s clean training data (X,Y ). A performs this training to sample a possible configura-
tion of f ’s parameters θ during training. With such a snapshot, A uses fc as part of heuristics for
choosing images XP ⊂ X to alter, which patches within those images to constrain modifications
to, and as an input to a modified Witches’ Brew algorithm for the computation of poison noise ∆.
30
7.1 Problem Formulation
V will train a machine learning model f : X → PY , f ∈ H on training images X ⊂ X with
labels Y . f takes images X as input and outputs a probability distribution over the labels PY . In
this classification task, Y is a set of classes that partition the images X . The hypothesis space
of modelsH is all Vision Transformers, of which the specific network topology is described in
section 6.2. A can choose a subset of the images XP ⊂ X to alter before V uses them in training.
Call the remaining images XC = X \XP . A wants f to classify some image xT ∈/ X with label
yT ∈ Y with as high of a probability as possible after training. I model this attacker objective
with a loss function Latt(xT , yT ; f) that evaluates how well f conforms to A’s goal.
The setting here is single-target: the attacker A wishes to induce f to classify a given single
image xT as given target label yT with maximal probability. This goal can be simply captured by
cross-entropy loss. Where fy (xT ) is the probability assigned to class yT by f :T
Latt(xT , yT ; f) = − log (fy (xT ))T
To minimize Latt, A will find adversarial noise to add pixel-by-pixel to images in XP , each
image therein being given its own noise by the attacker. XP itself may be chosen from possible
subsets of X using heuristics that I will describe in section 7.4; assume that this has already been
done. Where |XP | = n is the maximum number of images A may choose, I impose an ordering
over XP and call the instances of adversarial noise ∆ = {δi : 1 ≤ i ≤ n} to correspond to images
xi ∈ XP . I considerH to be Vision Transformers, so I restrict δi to non-zero noise only in re-
gions corresponding to a subset of k tokens (patches) in images as segmented by f . This restric-
tion aims to hasten computation of noise ∆ with only minor penalty to poison effectiveness. To
model this restriction, letM be a set of masksmi (1 ≤ i ≤ n) corresponding to images at indices
i. Assign 1’s tomi’s in all positions of pixels within selected tokens, but 0’s everywhere else.
For each image xi ∈ XP , the poisoned image is then given by x′i = xi + miδi, where addition
and multiplication are element-wise. Denote the poisoned dataset by X ′P = {x′i : 1 ≤ i ≤ n},
31
Given Clean Model Surrogate
Attacker Objective
  
  Training
  Objective
Creation
 
 
 
 
Patch Selection Image Selection    
  
 
 
 
 
     
    Heuristic  
Loop
top-  
Done   
 top-n Gradient
Matching
Poison Generation   
Save  
 
 
Done
Loop Loop  
Start Init. Update   
Done
min
Figure 6: Flowchart of how the ACT Method generates poisoned images. A rhombus shape indi-
cates a for-loop over a set or number of steps.
noting that XP is composed of clean images that are selected to be later poisoned by the attacker
to create X ′P .
A chooses patches either randomly or using heuristics. One such heuristic is gradient mag-
32
Notation Brief Explanation
V The Victim
A The Attacker
f V ’s Classifier Model – A will attack this
θ Parameters of f
xT A’s Target Image
yT A’s Target Class; A wants f(xT ) to predict this
X Set of All Images
X Training Images
Y Image Labels
PY Set of Probability Distributions over Labels
fy(x) Probability that f assigns to x being in y
H Hypothesis Space of Models; Set of All Possible θ
fc A’s Clean Surrogate Model
XP = {xi} Images Selected by A to Poison
XC Remaining Images Not Selected by A
XT Images with xT ’s True Class
∆ = {δi} A’s Adversarial Noise – will be added to XP
∆i Intermediate Adversarial Noise during A’s Brewing
M = {mi} A’s Patch Masks – restricts ∆ to selected tokens
X ′P = {x′i} A’s Poisoned Images
XP∗ The Optimum X ′P Solving the Bilevel Problem
n Maximum Number of Images AMay Modify
k Maximum Number of Patches/Image AMay Modify
P Total Number of Patches/Image
ϵ Magnitude Constraint – A’s δi’s must be within this
Latt A’s Loss Function
Lvictim V ’s Loss Function
B A’s Gradient-Matching Surrogate Loss Function
λ Learning Rates, Convex Combination Ratios
Aℓ Raw Attention at Block ℓ
Ã Attention Rollout
τ Set of Epochs Used for GAS Computation
R Number of Poison Brewing Restarts
S Number of Poison Brewing Steps per Restart
Table 1: Brief explanations of notation
nitude of the attacker’s gradient matching objective (see section 7.3) as restricted to individual
patches. The other is attention rollout from input image patch tokens to the classification to-
ken CLS. As mentioned earlier, A also chooses images XP from X either randomly or accord-
ing to some heuristic such as gradient magnitude as before but across whole images. A intends
33
for resulting poisoned images X ′P to confuse the victim model f when trained on the whole set
XC ∪X ′P . Namely, A intends that f(xT ) assigns as high a probability to yT as possible.
In real attack settings, the victim V might notice strange training images and remove them.
To make poisoned images in X ′P look natural or stealthy, A imposes a constraint on the magni-
tude of added noise. Namely, for all images xi ∈ XP , ||δi||∞ ≤ ϵ for some small ϵ, where || · ||∞
is the ℓ∞ norm. This norm is the sum of the absolute values of each element in the vector.
The victim V will train network f to minimize some loss Lvictim on the poisoned dataset
(XC ∪ X ′P , Y ). I will now model this setup as an optimization problem between the attacker A
and victim V . Call the optimum set of poisoned images for the attacker
X∗P = {xi +miδi; xi ∈ XP ⊂ X}
Note that the images in X are fixed but A is free to choose xi’s,mi’s, and δi’s within some con-
straints. All together, this training by V while A crafts poison forms a bilevel optimization prob-
lem:
min L ∗adv(xT , yT ; f )
XP ,∆,M
s.t. f ∗ ∈ argminLvictim(XC ∪ {xi +miδi : i ∈ {1, . . . , n}} , Y ; f)
f∈H
|XP | = n
||δi||∞ ≤ ϵ, ∀δi ∈ ∆
mi is restricted to k patches, ∀mi ∈ M
The XP ,∆,M that solve the upper level problem determine the optimum X∗P for the attacker A.
The procedure I describe in the rest of this section is for A to approximate X∗P with a poi-
soned set X ′P . There are six steps for A to follow: clean model training, surrogate objective cre-
ation, image selection, patch selection, poison generation, and execution. The final execution
step is performing the modifications to X so that V ’s training data is (XC ∪ X ′P , Y ), and it is a
34
commonality of all poisoning attack methods. ACT methods incorporate a modified Witches’
Brew component to find ∆ as restricted to k patches. This component accounts for the clean
model training, surrogate objective creation, and poison generation steps. My main method con-
tributions are in the image and patch selection steps as well as making necessary changes to
Witches’ Brew to accommodate patch-restricted poison noise ∆.
7.2 Clean Model Training
To train a clean model fc, A starts with the same pre-trained parameters V uses for training
the real model f . For Vision Transformers, and especially PFMs that are larger than the models
I consider here, this is a realistic assumption since such pre-trained parameters are often released
publicly, as they are in the case of DeiT-Ti/16 which I will use for evaluations (Touvron et al.,
2021). A has access to the training algorithm that V will use, complete with augmentations and
knowledge of its stopping criteria. However, I do not assume A has access to the random seeds
that determine the order in which samples appear in training, random augmentations, and dropout
if used. This ACT method framework does not necessarily prescribe that fc is trained in exactly
the same way that f will be. For evaluations of the method, I use the same training algorithm up
to random states to create fc; doing so makes the resulting parameters θ more commensurate with
those that will appear during the training of f . In any case, A uses the victim V ’s training dataset
(X,Y ) to create a clean classification model fc.
7.3 Surrogate Objective Creation
After the attacker A trains a clean model fc, both this method and Witches’ Brew prescribe
that the next step is to create a surrogate objective whose solution approximates that of the
bilevel problem (Geiping et al., 2021). Solving the bilevel problem directly is computationally
intractable due to the need to repeatedly train a victim model. So, Geiping et al. (2021) proposed
a gradient matching surrogate problem. The objective B is a function of a model (fc in particu-
lar), images to poison XP , the adversarial noise to add to them ∆, and the patches the noise is
35
constrained toM . My ACT method framework prescribes that A choose the images to poison
XP in the next step and patch masksM after that, but I will reference both here understanding
that they are inputs to B. Just as ∆ has not been computed yet is referenced, these other variables
will later be input to this objective for optimization purposes. The main contribution of this step
to the whole method is calculating a gradient used by the objective B which will be matched
during poison generation as described of unmodified Witches’ Brew in section 6.4.
Witches’ Brew helps A induce f(xT ) to have as great a probability assigned to yT as possi-
ble by finding ∆ such that training the victim network f on the poisoned images X ′P has nearly
the same effect as training on the single example (xT , yT ) repeatedly. This is achieved by solv-
ing the following optimization problem of matching gradient updates, where I have modified the
objective from Geiping et al. (2021) as appropriate for this patch-based method:
minB (fc, XP ,∆,M)
∆ ∑
B ⟨∇θLvictim(xT , yT ; θ),
n ∇θLvictim(xi +miδi, yi; θ)⟩
(fc, XP ,∆,M) =1− ∑i=1||∇θLvictim(x , y nT T ; θ)||2 · || i=1 ∇θLvictim(xi +miδi, yi; θ)||2
I present this optimization problem here to show the definition of B, its significance, and how I
have altered it from Witches’ Brew to include patch-selecting masksM . The optimization prob-
lem itself is not solved until the poison generation step in section 7.6. Note that ⟨·, ·⟩ is the dot
product in this context and θ refers to input fc’s parameters. This gradient matching procedure
maximizes the cosine of the angle between the desired gradient and the actual gradient on poi-
soned images of the victim’s loss. That is, the angle between them is minimized. Note that A can
cache∇θLvictim(xT , yT ; θ) after obtaining θ from training the clean model fc since the target pair
xT , yT are given parts of their own objective. Again, the optimization problem of finding ∆ will
be solved during the poison generation step after XP andM are fixed in the next steps, this step
merely creates the objective B.
36
7.4 Image Selection
Unlike Witches’ Brew, this method proposes deliberate selection of images to poison XP ⊂
X using heuristics. I evaluate three heuristics for use with this method, but in principle any way
of ordering images in X according to appraisal of their promoting a successful attack (that only
uses information available to the attacker A at this step) may be used. Like Geiping et al. (2021),
the heuristics I evaluate start by restricting choice of XP to those images with true label yT , call
those images XT = {xi : yi = yT , xi ∈ X}. Note that this indexing is independent of that im-
posed on XP ; the one I use here indicates corresponding labels in Y for XT . Within those im-
ages, A chooses a subset XP ⊂ XT . I will model this selection under the assumption that it has
been restricted to XT since that is what I use in evaluations, but for the general case one can set
XT = X .
An image selection heuristic assigns values vi for each image xi with 1 ≤ i ≤ |XT |. I con-
sider two heuristics: gradient and random. Using the latter is equivalent to Witches’ Brew’s im-
age selection strategy. Note that this random selection in combination with a patch constraint k =
the total number of patches per-image produces Witches’ Brew without modification; the ACT
method is a more general form of Witches’ Brew. In either case, the heuristic assigns vi’s. A then
creates XP by choosing those images xi with the top-n largest vi’s. That is, A solves
∑
max vi
XP
xi∈XP
s.t. |XP | =n,XP ⊂ XT
If there is a tie, A randomly chooses between tied images if this tie would affect the selection. I
describe the precise assignment of values by each heuristic as follows.
Gradient Image Selection (G-IS). A calculates the magnitude of the gradient of the adversarial
gradient matching objective B as evaluated on individual images. For each xi ∈ XT , A uses the
clean model fc to set values vi = ||∇x B(fc, {xi} , {0}, {1})||2. The 1 in the mask position repre-i
sents a mask with scalar 1’s in all positions so that the whole image is considered. This estimates
37
how promising each image is according to the initial steepness of the slope that adversarial noise
will be updated along in the poison generation step.
Gradient Aggregated Similarity Image Selection (GAS-IS). This image selection heuristic as-
signs values so that images with the greatest influence on the attacker objective as calculated by
GAS are selected: vi = GAS((xi, yi), (xT , yT )). A uses a τ so that influence is calculated only
on the final parameters obtained by the clean model fc after A has trained it. Given the similarity
in form to the attacker surrogate objective B, and this choice of epoch set τ , A can perform an
equivalent optimization by assigning values vi = −B(fc, {xi} , {0}, {1}). That is, selecting im-
ages by GAS is equivalent to choosing images whose surrogate loss is already the lowest. Con-
trast this to G-IS which chooses images according to how fast such loss is estimated to decrease.
In principle, A could use a variation on this which also incorporates parameters of fc at various
aspects during training in the GAS calculation, and also normalize blocks of fc independently,
but in this work I evaluate using it as just described.
Random Image Selection (R-IS). A sets all values to the same number, say vi = 0, for all xi ∈
XT . There is an all-way-tie, so XP will be a random subset of XT .
After A selects the poison set XP with one of these heuristics, the next step is to restrict the
adversarial noise applied to each to be non-zero only in k patches of each image.
7.5 Patch Selection
The next contribution this method makes over Witches’ Brew is to address its own constrain-
ing adversarial noise to patches. In particular, in this step A uses heuristics to choose which
patches to poison. This choice is encoded in the choice of masksM . Eachmi ∈ M will have
1’s in positions corresponding to exactly k patches and 0’s elsewhere. Recall that the poisoned
set is formed by X ′P = {xi + δimi}
n
i=1 where each of the images xi, poison noise δi, and masks
mi correspond to one another. I will evaluate three heuristics for A’s choice here, but as in the
previous step, other heuristics may be used in principle.
A patch selection heuristic assigns values vij for each image xi and in each patch j thereof.
38
To rigorously refer to patches in images, let’s define a convention for the structure of images.
Assume each image is square, s pixels across, and has c color channels. There are xi,mi, δi ∈
Rc×s×s for each 1 ≤ i ≤ n. By the construction that follows,mi ∈ {0, 1}c×s×s. Let pj ∈
{0, 1}c×s×s be a mask with 1’s in positions corresponding to patch j and 0’s elsewhere. Let there
be P patches total.
Given vij assigned by a heuristic, for each xi the attacker A finds the top-k vij’s and com-
bines corresponding pj’s intomi. That is, A defines for each xi ∈ XP a function ϕi : [1, p] →
[1, p] that orders v . Specifically, ϕ (a) = j means that patch j has the athij i greatest vij . That is,
among the patches in image x , patch j has the athi greatest value assigned by the heuristic. The
masks are then given by ∑k
mi = pϕ (j), ∀xi i ∈ XP
j=1
In plain English, the attacker A uses a heuristic to find a mask selecting the top-k most promising
patches on a per-image basis. Now that the mask creation process has been explained from given
vij , I will explain how each of the heuristics assigns vij .
Attention Rollout Patch Selection (AR-PS). This heuristic is based on attention rollout as de-
scribed in section 6.3. For each image xi ∈ XP and for each patch at position j within them,
A calculates the attention rollout from the token corresponding to patch j to the last CLS token
in the final attention block, vij = Ã(1, L)j,CLS, where f has L layers of blocks. This is done by
passing xi through the clean model fc and recording attention maps Aℓ along the way for the roll-
out calculation. Since attention rollout measures importance of tokens at some layer to determing
those at another, this heuristic focuses on how important image patches are to the final classi-
fication token, which is passed through a small FFN to obtain f ’s probability distribution over
classes. Inspired by other work using this importance metric for test-time attacks, I evaluate its
use for this train-time poisoning attack method.
Individual Gradient Patch Selection (IG-PS). As in image selection, the idea here is to find the
gradient magnitude of the surrogate objective B as evaluated at each chosen poison image. This
way, the selected patches will be those whose corresponding adversarial noise will update the
39
most at the start of poison generation, which I interpret as a sign of importance. For each patch
j of each image xi, A assigns values vij = ||pj∇x B(fc, {xi} , {0}, {1})||2. This captures thei
magnitude of the gradient as constrained to each patch. Notice that the gradient is computed on
the full image, the mask passed to B is all 1’s, but then it is restricted to the patch j in question by
element-wise multiplication by pj .
Universal Gradient Patch Selection (UG-PS). In the case of patch selection, there is another
way to use gradient information. Instead of computing per-image gradients, some libraries have
fast implementations of gradient computation as additively accumulated over a whole batch of
images. This heuristic seeks to harness that convenience and speedup. In effect, A first computes
values as in IG-PS, call them v′ij . Here, A chooses those patches whose average gradient magni-
tude per-patch over all images is greatest. That is, each image will have the same mask assigned
to it. That is, for each image xi, A assigns values
n
1 ∑
v ′ij = vi′j, ∀jn
i′=1
7.6 Poison Generation
Finally, the next step is to generate the poison ∆. This largely proceeds according to Witches’
Brew with appropriate modifications for patch-restriction. See Figure 6 for a visual explana-
tion of the process, as well as the other stages. In short, A approximates the solution to the op-
timization problem in section 7.3 using gradient descent on the color values in each δi. To remain
within the ϵ-constraint, A projects each δi within a ϵ-radius ℓ∞ ball, in doing so keeping the poi-
son noise subtle.
As recommended by Geiping et al. (2021), I perform the optimization process R times, or
“restarts.” Associate with each restart a different version of ∆, call them ∆i for 1 ≤ i ≤ R. Each
∆i is given a different random initialization, clamped to lie within a radius-ϵ ℓ∞-ball, and their
constituent perturbations may be updated in different orders according to batch randomization
in each round of poison brewing, which I will call the inner optimization loop. The attacker A
40
performs this poison brewing R times to create each ∆i, then A chooses the ∆i (after brewing)
with the least surrogate loss ∆∗ = min∆ B(f ,X ,∆ ,M) to be the final adversarial noise ∆ =i c P i
∆∗. As stated by Geiping et al. (2021), the purpose of these restarts is to overcome the sensitivity
of success to poison initialization.
For each poison brewing round, A uses signed Adam updates (Kingma & Ba, 2017) to opti-
mize each ∆i. A performs these updates over S epochs, where the perturbations ∆i are updated
in batches. At each step, each δj ∈ ∆i is clamped to be within the ϵ-constraint. Note that the
objective B involves a gradient computation dependent on ∆i while simultaneously B’s gradient
is found for updating ∆i. This implicit Hessian calculation is associated with a large demand on
memory, especially in the case of using large images and Vision Transformers which have mil-
lions of parameters. This is one benefit of using patch restrictions. The heavy memory load is
reduced by restricting each δi to be non-zero only within a few tokens. Such a restriction may not
only allow quicker computation, but in fact be necessary for scaling up this ACT method for use
with PFMs, which may have billions of parameters. In my own experience with this method, I
frequently encountered memory overflows when not restricting patches.
To incorporate patch selection heuristics into Witches’ Brew, some modifications to the al-
gorithm are necessary. By this stage, A has selected masksM . After initializing δi with random
noise, overwrite δi ← δimi (element-wise multiplication) to set all entries at positions outside of
the chosen k patches to 0. Next, at each step of gradient descent, apply the maskmi to the gra-
dient of the adversarial gradient matching surrogate objective so that the update to δi and subse-
quent clamping to the ℓ∞-ball is entirely performed in the k chosen patches. The new update is
then, for each image xi ∈ XP with perturbation δi,
δi ← δi +miλ∇δ B (fi c, {xi} , {δi} , {mi})
In this way, gradient-descent-based training is thwarted by another algorithm using gradient de-
scent.
41
After each of the R poison brewing rounds are complete and A chooses the best ∆, the last
step is to carry out the attack with A’s access. In realistic scenarios, A’s image selection would
be in the form of choosing some subset from their own collection. After creating appropriate ∆
through this ACT method, A would add this adversarial noise to the images themselves and pro-
vide X ′P to the victim V with clean labels. In practice A could upload the poisoned images to the
Internet. V would then collect X ′P along with other images XC , not modified by the attacker A,
and perform training. If all goes well for A with this method, the resulting f will produce a prob-
ability for test image xT , f(xT ), that assigns highest probability to yT among all other classes.
This method and its heuristics are designed for resource-intensive Vision Transformer f , which
are increasingly prominent. In the next section, I will explain how I evaluated how well ACT
methods work for an attacker A targeting a small Vision Transformer f . Additionally, I evaluated
how V may use Mixup, Cutmix, and other training augmentations to defend against attacks like
this.
42
8 Experiments and Results
I evaluated the ACT method’s effectiveness using various heuristics when deployed against
a victim model f using the DeiT-Ti/16 architecture (Touvron et al., 2021). Unlike the study in-
troducing it, I did not use knowledge distillation for training. However, for classifying the small
datasets CIFAR-10 and CIFAR-100 that I use, accuracy with and without distillation is within a
few percentage points. By conducting extensive computational experiments, I established an in-
crease in attack strength over Witches’ Brew associated with use of heuristics. In this section, I
describe a preliminary investigation of Witches’ Brew attacks on DeiT-16/Ti that motivated the
ACT method’s development, the setup of subsequent experiments using the ACT method, what
I observed, and the implications of the results. In each experiment, I acted as both the attacker A
and the victim V . As A, I selected a target image xT and label yT and the trained a model fc on
clean data. Next, I used a poisoning attack method with fc to generate adversarial noise ∆. For
all experiments, I used S = 250 poison brewing steps. I then switched roles to V and trained
several models on the full poisoned dataset X ′P ∪ XC . I evaluated attack strength based on the
probability distribution output of poisoned f(xT ) at various stages in training. Successful attacks
induced a high-confidence classification of xT as yT , and the strongest attacks did so under the
tightest constraints.
In section 8.1, I describe my evaluations of regular Witches’ Brew (ACT method with ran-
dom image selection and no patch restriction) with respect to effectiveness attacking DeiT-Ti/16.
In a few experiments I established two important facts: heavy augmentations are useful as de-
fense and DeiT-Ti/16 has hightened natural robustness against poisoning attacks. The former
hints at Mixup and Cutmix’s defensive properties and the latter fact motivates the need to im-
prove upon Witches’ Brew with heuristics specific to Vision Transformers.
In section 8.2, I explain my fixing a model training procedure used for all following exper-
iments, where the victim V ’s model f and the attacker A’s clean surrogate models fc used the
same procedure. Fixing a standard training procedure allowed me to focus on evaluating the ef-
fects of various heuristics on attack strength, which I did in a multidimensional manner. In sec-
43
tion 8.3 I explain the specifics of how I did so. In section 8.4, I describe my evaluations. The
strongest attack I found used GAS-IS and AR-PS.
8.1 Preliminary Witches’ Brew Evaluation
Before evaluating the broader ACT method, I first considered whether Witches’ Brew alone
can perform a successful poisoning attack against DeiT-Ti/16 with its recommended training
setup. In particular, I trained models as Touvron et al. (2021) describe, but without distillation.
Using recommended hyperparameters for fine-tuning for classification of CIFAR-100, an attack
was only successful in a special case that departed from the data poisoning setting of this work.
When heavy augmentation was removed, instead using a model training procedure with hyper-
parameters traditionally suited to CNNs, Witches’ Brew was able to create a successful attack on
DeiT-Ti/16 with loose constraints.
For this Witches’ Brew evaluation, I used R = 4 restarts for resource-intensive Transformer-
style training and R = 12 restarts for ResNet-style training. In either case, I used the standard
number S = 250 of poison brewing steps.
8.1.1 Model Training Procedure
The recommended (Touvron et al., 2021) retraining pipeline for DeiT-Ti/16 on CIFAR-100
involves heavy augmentations. I used this for the first two experiments of this subsection. First,
images were re-scaled to 439 by 439 pixels by bilinear interpolation and then center-cropped to
384 by 384 pixels. The color was then regularized to fit ImageNet statistics. During training,
both Mixup and Cutmix were used to combine random pairs of samples and a learned AutoAug-
ment provided further transformations post-mixes. See Figures 4 and 5 for examples of train-
ing images pre-AutoAugment. I used both Mixup and Cutmix while training clean surrogate
models fc and then while poisoning models f . In both cases, I started from pre-trained weights
made available by Touvron et al. (2021). I used AdamW (Loshchilov & Hutter, 2019) as an op-
timizer and used a cosine learning rate schedule with a warmup period. I scaled learning rate to
44
accommodate a batch size of 64 as recommended by Touvron et al. (2021). While fine-tuning on
CIFAR-100 with clean data, I was able to obtain 76% top-1 validation accuracy by training for
30 epochs. This accuracy was calculated by evaluating accuracy on CIFAR’s 10,000 test images,
which I did not using during training nor will I during future experiments.
Touvron et al. (2021) achieved 90.8% validation accuracy on CIFAR-100 with the larger
DeiT-B architecture. One of the only figures given for DeiT-Ti is 72.2% validation accuracy on
ImageNet. Meanwhile, they found that DeiT-B achieved 81.8% accuracy classifying ImageNet.
Given that DeiT-Ti’s ImageNet accuracy is about 10 percentage points less than DeiT-B’s accu-
racy, I believe the 76% figure is reasonably close to the optimum accuracy for the tiny version
DeiT-Ti on the CIFAR-100 classification task. Achieving such a good approximation to optimum
performance is desirable in general for attackers creating clean models fc because the victim V
will search themselves for the optimum in their lower-level objective as in the bilevel optimiza-
tion problem of section 7.1.
Later, I additionally trained DeiT-Ti/16 using augmentations and hyperparameters that Geip-
ing et al. (2021) use for their ResNet evaluation experiments. In this case, the task was classi-
fying CIFAR-10 instead of -100. One motivation for using the former hyperparameters was to
create an easier model to attack for the beginning of ACT method evaluation experiments than
one trained with heavy augmentations. The choice of CIFAR-10 also allows comparison with
some of Geiping et al. (2021)’s experiments. As before, the training techniques and hyperparam-
eters were shared between the clean model fc used by the attacker A and the multiple poisoned
models f trained by the victim V . I did not upscale images in this case; I instead opted to keep
CIFAR-10 images at their native 32 by 32 pixel resolution, even though this corresponds to only
4 tokens. I used light augmentation including random flips and random crops. As before, I used
the AdamW optimizer, but now with a linear learning rate schedule. I trained all models in this
light-augmentation category for 40 epochs.
45
8.1.2 Mid-Training Poisoning
Before performing experiments in the data poisoning setting that is the focus of this thesis, I
first verified that it is plausible to poison the DeiT-Ti/16 architecture to begin with. To this end,
I used Witches’ Brew without modification except that I deployed the poison differently. Unlike
the data poisoning setting in which the attacker A provides poisoned images at the beginning of
training, I simulated an attacker with access to the training data in the middle of training. Specif-
ically, A has access to the specific weights in the victim’s model f in the middle of training and
is able to use Witches’ Brew to poison images in-place at this middle epoch to hijack training.
In practice for my evaluation here, instead of initializing victim networks f to Touvron et al.
(2021)’s pre-trained weights, I instead resumed training from the clean model fc’s weights. The
significance of this relaxed setting is that poisoned images X ′P are maximally prepared to influ-
ence training on the first epoch of resumed training because the parameters are identical to what
∆ was optimized with.
I trained DeiT-Ti/16 as recommended by the model’s designers for 5 epochs, calculated ∆
with Witches’ Brew from resulting parameters, but then resumed training with the poisoned
dataset X ′P ∪XC instead of training with poisoned data from the beginning.
Using a poison image budget of 0.1% (50 images in CIFAR-100, see Figure 7 for the poi-
soned images) and magnitude constraint ϵ = 32, for 5 more epochs I resumed training fc but on
poisoned data. Before the beginning of that resumed training, the randomly selected target image
xT was assigned its true label by f . By the end, a validation accuracy of 74% was close to ref-
erence clean validation accuracy 76%. Confident that successful poisoning would unlikely be a
fluke, I observed that the most confident class predicted by f(xT ) was indeed the attacker’s target
yT instead of xT ’s true class. Poisoning in this modified setting was successful.
8.1.3 Heavy Augmentation Defense
Equipped with knowledge that it is possible in principle to fool DeiT-Ti/16 with a poisoning
attack, I next attempted to use Witches’ Brew to poison it in the conventional setting and with
46
Figure 7: The poisoned data X ′P generated by Witches’ Brew with no modifications for poi-
soning CIFAR-100. Images are made to look natural by imposing a somewhat-tight constraint
ϵ = 32. The attacker objective was to classify a test image of a cockroach as a camel.
recommended Vision-Transformer-specific training. That is, unlike resuming training in the pre-
vious section, after poisoned images X ′P were generated I set aside the clean model fc and poi-
soned a new model f initialized with the standard pre-trained parameters. Unlike experiments
described elsewhere in this section, however, fc and f were trained over different numbers of
epochs. fc was trained on clean data for 10 epochs while f was trained on poisoned data for 15
epochs. Like the last experiment, I used a poison budget of 0.1% and generated poison within a
magnitude constraint of ϵ = 32.
This attack on DeiT-Ti/16 failed to induce f(xT ) to predict yT . For reference, Witches’
Brew’s successful poisoned Google’s Cloud AutoML with the same poison budget and ϵ-
constraint (Geiping et al., 2021) and with a victim training for ImageNet classification. In the
same task, which is more difficult than CIFAR-100 classification, they found that Witches’ Brew
was successful 80% of the time with constraints as tight as a 0.05% poison budget and ϵ = 8
against ResNet-18 models. Their attacks on the easier CIFAR-10 classification task, which is the-
oretically more difficult to poison than ImageNet, were likewise successful in 91% of their trials
47
with a 1% poison budget and ϵ = 16 against ResNet-18 models. Regardless of reference class,
my failure to poison a DeiT-Ti/16 model trained on a theoretically easier-to-poison classification
task CIFAR-100, even with the SOTA Witches’ Brew algorithm, suggests that Vision Transform-
ers have a heightened natural robustness against data poisoning attacks.
Validation accuracy was degraded from 74% (measured on fc) to 73%, so lacking accuracy
likely wasn’t to blame for attack failure. One notable difference between the Vision Transformer
training I used and those used by Geiping et al. (2021) for evaluating Witches’ Brew was that I
used Mixup and Cutmix for augmentation. As discussed in section 6.5, those augmentations are
already individually known to increase robustness against test-time attacks. Geiping et al. (2021)
found that other data augmentations can reduce poison success rate to 32.5% from 100% other-
wise, which additionally suggests that an augmentation like them could contribute to poisoning
attack robustness. To start isolating the effects of heavy augmentation, in the next experiment I
set aside Vision-Transformer-style training in favor of that used for ResNets as described at the
beginning of this subsection.
8.1.4 Light Augmentation Poisoning
Now that I was using smaller images, I increased the image budget to 1% to be more com-
mensurate with a different selection of Geiping et al. (2021)’s results. Previously, I had used
fewer due the long time it took to perform experiments with large images. For benchmarking rea-
sons, I changed the classification task of the victim to CIFAR-10. I set the target image to be the
dog of Figure 9 and set the target class to FROG. Again using Witches’ Brew with no modifica-
tions, I set the poison magnitude constraint to ϵ = 32. I evaluated the poisoned dataset X ′P ∪XC’s
effectiveness a total of 8 times by poisoning 8 victim models f .
The clean model fc achieved a validation accuracy of 73% while the poisoned models f
achieved accuracies between 70.53% and 74.05%. In all 8 cases, poisoning was successful:
f(xT ) assigned yT the highest probability at the end of training. I observed this by providing the
target image xT as input to each poisoned model, but this is also evident from the attacker loss
48
Latt curves in Figure 8. Zero loss is only achieved when the model assigns 100% probability to
the target class yT , which is exactly what we see after each of the victim models is trained for as
little as 6 epochs. There are two implications of this success. First, I had a baseline from which to
increase constraints and improve attack strength using heuristics. Second, I found that relatively
high accuracy on CIFAR-10 could be achieved quickly with DeiT-Ti/16 without its recommended
heavy augmentations. While discussing subsequent experiments, I will refer back to this one as a
baseline and use heuristics to create stronger attacks.
Figure 8: Over 8 attack validation runs of training f with poisoned data X ′P ∪ XC generated by
Witches’ Brew with no modifications (1% image budget, ϵ = 32), the attacker’s loss Latt versus
training epoch. The bold red line shows average loss over the validation runs and the shaded
region shows the minimum and maximum losses at each epoch. Attack success is evidenced by
achieving 0 loss after a small number of epochs.
8.2 Model Training Procedure
With cause to search for stronger attacks targeted at Vision Transformers, I standardized the
training procedure for clean models fc and poisoned validation models f . In pursuit of discover-
ing computationally inexpensive attacks, I immediately introduced constraints on patch count k
that may be poisoned. For the purpose of studying the effect of patch coverage as a percentage of
49
entire poison images, I evaluated poisoning models with various input image sizes. For each of
image sizes 32 × 32, 64 × 64, and 96 × 96 pixels, I trained one clean model fc. I used each of
these three models’ parameters for all subsequent poisoning experiments, rather than training a
new one each time.
In all cases, I fine-tuned from pre-trained weights to classify CIFAR-10 (Krizhevsky, 2009),
which consists of 60,000 images of size 32 by 32 pixels and 3 color channels. It includes 50,000
images for training complete with labels categorizing them into 10 classes. I use its 10,000 vali-
dation images as a repository of test images from which to draw attacker objectives (xT , yT ). For
use with DeiT-Ti, pre-trained for ImageNet’s 1,000 classes, I made multiple modifications:
1. Images were upscaled as needed using bilinear interpolation to slightly larger sizes 64× 64
or 96× 96.
2. Since the number of input tokens was changed, I changed the token position embeddings
using bicubic interpolation.
3. I randomly re-initialized the last layer, responsible for classification, to output a probability
distribution over 10 classes instead of 1,000.
I allowed all network weights to change during training, which may be described as re-training
instead of fine-tuning.
Training itself lasted 40 epochs and used the AdamW optimizer with a learning rate sched-
ule. I randomly applied augmentations such as mirroring images and cropping. The learning rate
schedule was multi-linear: it started with λ = 0.001 and the rate was reduced by 1 after 15, 25,
10
and 35 epochs. In between these epochs, the schedule was locally linear in epoch for a smooth
change.
8.3 Evaluation Criteria
I evaluated the effectiveness of attacks by considering three factors.
50
1. Attacks should be stealthy in not greatly degrading the accuracy of victim models f apart
from the target image xT . So, I compared the top-1 accuracies of clean models and corre-
sponding poisoned models. I expected the poisoned models to have somewhat degraded
accuracy, but for a stealthy attack such reduction should not be much greater than those ob-
tained by other poisoning attacks such as Poison Frogs (Shafahi et al., 2018) or Witches’
Brew on ResNets (Geiping et al., 2021). Such attacks observed a worst-case reduction on
the orders of 0.4% and 0.1% respectively. I did not observe stability on this scale in final
validation accuracy, even among clean models. So, I define a looser stealth success criteria
that is met if poisoned models’ validation accuracies do not drop more than 5 percentage
points compared with clean models.
2. I checked the probability distribution output by the poisoned models f when evaluated on
xT as the course of training progressed. The attacker intends yT to have the highest prob-
ability, as opposed to xT ’s true label or some other class entirely. The main goal of the
attacker is for f(xT ) to output a yT -favorable distribution after training is complete, but
I also evaluated how the distribution evolves over the course of training. There is then a
spectrum of probability shift success corresponding to how great of probability f assigns to
the yT class at various epochs and how early during training yT is assigned greatest proba-
bility, if at all. I assessed this shift qualitatively by inspection of loss and probability versus
epoch curves. For more rigorous comparisons of probability shift, I calculated the average
attacker loss Latt over all validation runs of a poison set X ′P at the end of training victim
models f . See section 7.1 for a reminder of how that is calculated for each validation run.
Since this is a loss, stronger attacks will achieve less loss by the end of training.
3. Finally, and most importantly for the attacker, f(xT ) should assign greatest probability
to label yT after training on poisoned data. As opposed to probability shift success, this
labeling success criteria requires that the model give the most probability to yT out of
all other labels at the end of training, rather than simply assign higher probability than a
51
clean model at various epochs. I evaluated this success by simply providing xT to f af-
ter training and observing whether or not yT was assigned the highest probability of any
class. Since I often used the same brewed poison to attack multiple random instantiations
of victim models f , I specifically evaluated labeling success rate (LSR) which is a ratio of
how many models meet the attacker’s labeling objective at the end of training over the total
number of poisoned victim models trained.
For the following experiments, I do not always explicitly state which of these success criteria
were met. Where not otherwise stated, I implicitly attest stealth success. For probability shift
success and labeling success, if the latter is achieved I am generally less inclined to mention the
former; labeling success necessarily implies probability shift success. When tie-breaking attack
effectiveness, I will share more quantitiative details than simply LSR, including average attacker
loss Latt.
In addition to isolated per-experiment criteria, I evaluated attack strength according to how
strong of constraints could be imposed before labeling failure or weakened probability shifting.
For instance, in section 8.4.4 I found that a combination of G-IS and AR-PS, which I will abbre-
viate AR-PS+G-IS, allows labeling success with a difficult constraint of ϵ = 32 and k = 8 (out of
P = 16 total patches) while attacks with R-IS can only achieve consistent success in general with
k = 12 or higher. This suggests that this combination creates a stronger attack than AR-PS+R-IS,
which can only achieve success at that constraint for some poison-target combinations.
8.4 ACT Method Evaluation
Previously, I summarized results conducting attacks on DeiT-Ti/16 using Witches’ Brew with
no modifications. In preparation for evaluating my ACT methods in general, I described the
model training procedure and evaluation criteria. Finally, here I will describe my experiments
to assess various heuristics and discuss the results.
For each of these experiments, there are six variables that may change. The first three are
constraints: the number of poison images n = |XP | (the poison budget, often expressed as a
52
percentage of the full training data X), the number of patches that may be poisoned per image
k, and the constraint on adversarial noise magnitude ϵ (the ϵ-constraint). The next two variables
are which patch selection and image heuristic are used as described in section 7. These may be
random or not present, but I already evaluated the extreme case where the method is equivalent to
Witches’ Brew in section 8.1.4. I will abbreviate both these heuristic variables together on occa-
sion with the patch selection heuristic first and then the image selection, for instance AR-PS+G-
IS denotes using attention rollout patch selection and gradient image selection together. The sixth
and final variable is the attacker objective, specified by target test image xT (drawn from CIFAR-
10 validation images) and target class yT . I will sometimes state the objective in POISON-TARGET
form, which indicates the class yT that poison data belong to and the true class of xT . For most
experiments, I selected xT to be the picture of a dog in Figure 9 and chose a subset of images of
frogs from the training set to be poisoned (yT = FROG). I will specially indicate when a different
attacker objective than this FROG-DOG was used.
Figure 9: The target image xT of a dog that the attacker aims to be classified as a frog.
I first evaluated each patch selection heuristic with random image selection.
8.4.1 Universal Gradient Patch Selection
For this experiment, I used the UG-PS heuristic, which means that the same patch locations
were modified in each of the poisoned images X ′P . See Figure 10 for an example of images with
patches selected. The images to poison XP themselves were selected randomly, so this exper-
iment was to evaluate the combination UG-PS+R-IS. I used the original size of CIFAR-10 im-
53
ages, 32 × 32 pixels, which corresponded to 4 patches per image. I performed 2-5 validation
runs for each of several values of k. In each validation run I trained f on poisoned data X ′P ∪XC
as generated by my ACT method. I selected images XP randomly within a 1% poison budget.
Among those images, I selected k ∈ {1, 2, 3} patches with the universal gradient heuristic. I
constrained adversarial noise magnitude to within ϵ = 32.
k LSR
1 0/5
2 0/2
3 5/5
Table 2: Using the universal gradient patch selection heuristic with ϵ = 32 and a 1% image bud-
get with random image selection, there is a successful attack only when k = 3 out of 4 total to-
kens.
For k = 1 and k = 3, I retrained 5 times each and observed labeling success in 0 and 5 runs
of training victim models f respectively. I performed two validation runs with k = 2 but didn’t
observe labeling success in either of them. I only performed two validation runs in this case due
to code interruption by a Windows update. See Table 2 for a summary.
Figure 10: When using UG-PS, the same patch is poisoned in every image. Here are some exam-
ples of poisoned images in X ′P brewed with k = 1 and ϵ = 32. The heuristic chose the top-left
patch to be poisoned in all these 32 × 32 images, evident from slight discoloration in that quad-
rant.
This result establishes that a patch-constrained poisoning attack can achieve labeling success,
but may require substantial coverage of an image, in this case 75% of the pixels. In subsequent
experiments, I evaluated heuristics that are theoretically stronger due to per-image application. In
the process, I investigated whether patch count k or percent coverage of an image k/P is more
54
diagnostic of attack success rate.
8.4.2 Individual Gradient Patch Selection
The last experiments showed hope for improved performance. In the next experiments, I up-
scaled CIFAR-10 images to 64 × 64 for 16 patches total. This was to better identify any rela-
tionships between image poison coverage and attack success rate, given that more fine-grained
control of coverage percentage is possible with larger images. Again, I used a 1% poison budget
with an ϵ = 32 constraint. I selected images randomly but used the IG-PS heuristic. See Figure
11 for an example of patch coverage varying from image-to-image.
Figure 11: Now using 64 × 64 images, I brewed poison constrained to k = 8 patches selected
by IG-PS so that specific patches could differ image-to-image. An ϵ = 32 constraint was used
and poison images XP were selected randomly. This figure shows a some examples of poisoned
images.
I created 3 poisoned models f for each of k = 8 and k = 12 constrained poison trials, which
corresponded to 50% and 75% coverage respectively. Only the latter achieved labeling success
in all 3 trials. The former did not achieve labeling success at all, a result consistent with percent
coverage of an image determining attack success cross-heuristic. In the k = 8 experiments, the
probability assigned by the models f to xT belonging to yT remained low throughout training;
there was little or no probability shift success. However, the success of an attack with k = 12
was evidenced by high probability by the end of training. See Figure 12 for how the probability
assigned to xT by poisoned models f changed over the course of training.
This result suggests that poison coverage of an image k , for P total patches, as opposed to
P
55
Figure 12: The probability assigned by attacked victim models f to the target image xT belong-
ing to the target class yT across training. I show average lines in the center while the extremes
of the shaded region are maximum and minimum probabilities at given epochs. Left, using a
constraint of k = 8 of 16 total patches, the attack achieved neither labeling nor substantial proba-
bility shift success in all three validation runs. With k = 12, right, labeling success was achieved
by the attack in all three runs.
raw number of tokens k, is an important factor for attack success. In the next experiments, I in-
vestigated this more with the attention rollout heuristic.
8.4.3 Attention Rollout Patch Selection
Next, I evaluated using a heuristic not only inspired by the need for stronger attacks on Vi-
sion Transformers, but that is also Transformer-specific: AR-PS. First, I used unscaled 32 × 32
CIFAR-10 images with a 1% poison budget (randomly selected) and ϵ = 32. First, I used a cov-
erage constraint k/P = 3/4 and observed labeling success in 2/3 poisoned models. This suggests
that attention rollout is slightly weaker than gradient as a patch-selection heuristic. In subsequent
experiments, I upscaled the images to 64 × 64 and performed poison attacks with objectives hav-
ing different poison-target combinations than FROG-DOG.
After scaling the images to have 16 patches total, I used the same heuristic with random im-
56
age selection and k = 12. I created poison with the ACT method given the same attacker objec-
tive for f ’s classifying a dog in the test set as a frog. I evaluated success by training 6 poisoned
models. I observed labeling success in 4 cases, consistent with the LSR observed with smaller
images at this coverage using other patch selection heuristics. Similarly, with the same setup ex-
cept k = 8, I observed no labeling success across 3 trials. These results further suggest that LSR
depends on the fraction of an image covered by poison k/P , rather than patch count k indepen-
dent of image size.
Next, I evaluated the same attack heuristics but with different choices of target image and
poison class. Success was highly variable. I observed labeling success for k = 12 in 3/3 trials
for CAT-AUTOMOBILE and DOG-FROG (the reverse of before), but 2/3 labeling success for BIRD-
AIRPLANE. To my intuition, BIRD-AIRPLANE and FROG-DOG involve classes more similar to each
other than CAT-AUTOMOBILE. Diminished efficacy on these similar pairs suggests that lack of sim-
ilarity between true and target classes of target images xT increases success rate. However, the
DOG-FROG pair breaks this pattern. In any case, success rates and degrees thereof vary with choice
of target image and poison class. See Table 3 for details.
Returning to the not-yet-overcome constraint k = 8 for the FROG-DOG objective, I attacked
with other poison-target combinations like CAT-AUTOMOBILE. With a ϵ = 32 magnitude con-
straint, this method can now achieve labeling success as shown in Figure 13. I chose validation
runs for the figure to represent the wider variety of poisoned training runs I observed. The BIRD-
AIRPLANE run displayed in the figure briefly succeeded mid-training, but then f regained confi-
dence in the true class. For CAT-AUTOMOBILE, the mid-training period saw multiple flips between
the true class AUTOMOBILE and the poison class CAT, but the attack was ultimately successful. On
the other end of the spectrum, attacking with the DOG-FROG objective didn’t fair much better than
what I saw earlier with FROG-DOG and individual gradient patch selection (Figure 12). All these
cases show that attack success largely depends on objective choice.
Having thoroughly explored the ϵ = 32 case, I tightened the magnitude constraint to ϵ = 16
to assess how strong of an attack the AR-PS heuristic yields. I trained validation poisoned mod-
57
(a) Target Image xT (b) Example x ∈ X ′P (c) f(xT ) vs. Epoch
Figure 13: Sample target images, a sample of corresponding poisoned training images, and an ex-
ample of probability distribution shift predicting the target image xT class as the victim model f
is trained on poisoned data crafted with the AR-PS+R-IS heuristic combination. Constraints used
in brewing were ϵ=32, k=8, and a 1% poison image budget. The poisoning attacks’ effectiveness
was evaluated over 3 validation training runs of f ’s. For the single validation runs represented
above by column (c), only the attack with the frog xT achieved labeling success. However, each
of these attacker objectives achieved at least one labeling success run recorded in Table 3.
els f in this case with all the rest of the constraints held constant to k = 12 and a 1% poison bud-
get. The dependence on choice of pair was even more exaggerated in this case, see Table 3 for
all details. With k = 12 patches, labeling success over 3 trials varies from none of the poisoned
models meeting the attacker objective to all of them meeting it. In cases like this, it is valuable to
assess probability shift success.
In Figure 14, I show the probability fy (xT ) during various epochs during three poisonedT
58
training runs. The attacker A wants f to classify xT , really an airplane (red), as a bird (blue) with
maximum probability. Across all poisoned training runs, there was spike in f ’s assigning more
probability to the image being a bird near epoch 15, but in one of the runs the true label predic-
tion quickly recovers. Probability shift success was still achieved in that case; fy (xT ) is still aT
sizeable 20%.
Figure 14: With k/P = 12/16 and ϵ = 16, I show three training runs poisoned with a 1% budget
with adversarial noise crafted by the ACT method with AR-PS+R-IS. The attack aimed to induce
an airplane’s picture be classified as a bird, see Figure 13 for the precise airplane. These figures
show the probability assigned to the BIRD and AIRPLANE class by f(xT ) during various epochs of
f ’s fine-tuning. The top-right plot shows a failed poisoning attempt that still shifts the probability
of yT above 0.
In summary, success of this patch selection heuristic with different values of k can vary sig-
nificantly with choice of target image and poison class.
Finally, I returned to the old FROG-DOG objective combination with ϵ = 32. I relaxed the
constraint on poison budget to be 2%. For k = 8, labeling success was achieved in 2 out of 3
trials, whereas for k = 4 no labeling success was observed over 3 trials. These results are not
shown in the table, but see Figure 15 for a probability distribution f(xT ) versus training epoch
observed with k = 4. This representative sample shows that not even probability shift success
59
was achieved in any meaningful sense in this 25% image coverage case. However, I observed
labeling success for the first time on the FROG-DOG task with 50% coverage. This suggests that in-
creasing the number of images in the poison set can increase effectiveness, but not as much as in-
creasing the number of poisoned patches per image. That is, constraining the number of patches
per image is a more difficult challenge for an attack to overcome than a constrained poison bud-
get. With this observation, I shifted focus to experiments to evaluate the effectiveness of image
selection heuristics.
Figure 15: Using AR-PS with k = 4 and ϵ = 32, attacking with the FROG-DOG objective, and ran-
domly choosing a 2% poison budget, here is a representative sample of the probability distribu-
tion shift of f(xT ) across poisoned training epochs. The attacker’s goal is for xT to be classified
as a frog, but this was not achieved in any meaningful sense.
8.4.4 Gradient Image Selection
Figure 16: A sample of images selected by the gradient heuristic, k = 12 patches within them se-
lected by attention rollout, and then those patches poisoned within ϵ = 16 magnitude.
60
POISON-TARGET AR-PS+R-IS AR-PS+G-IS G-PS+G-IS AR-PS+GAS-ISk ϵ LSR Latt LSR Latt LSR Latt LSR Latt
FROG-DOG 4/6 2.117 3/3 0.001 3/3 0.004 3/3 0.000
12 32 CAT-AUTOMOBILE 3/3 0.002 3/3 0.000 3/3 0.000 3/3 0.000BIRD-AIRPLANE 2/3 0.748 3/3 0.004 3/3 0.000 3/3 0.000
DOG-FROG 3/3 0.000 3/3 0.000 3/3 0.000 3/3 0.000
FROG-DOG 0/3 6.727 3/3 0.154 1/3 2.307 3/3 0.011
12 16 CAT-AUTOMOBILE 1/3 1.436 2/3 0.244 3/3 0.038 3/3 0.004BIRD-AIRPLANE 2/3 0.709 3/3 0.000 3/3 0.062 3/3 0.000
DOG-FROG 3/3 0.000 3/3 0.000 3/3 0.000 3/3 0.000
FROG-DOG 0/3 6.201 3/3 0.015 3/3 0.243 2/3 0.594
8 32 CAT-AUTOMOBILE 1/3 1.388 3/3 0.000 3/3 0.056 3/3 0.088BIRD-AIRPLANE 1/3 1.786 2/3 0.284 3/3 0.011 3/3 0.001
DOG-FROG 3/3 0.000 3/3 0.000 3/3 0.000 3/3 0.000
Table 3: Quantitative comparisons of poisoning attack strengths with ACT methods using various
heuristic combinations and attacking under a variety of constraints. All experiments used a 1%
poison budget and 64×64 CIFAR-10 images which have P=16 patches. The best average attacker
loss achieved at the end of training is highlighted in bold.
For my next experiments, I used G-IS to choose images in XP . I also used one of AR-PS or
G-PS for patch selection. I had two goals for these experiments: determine whether deliberate
image selection increases attack strength and determine which patch selection heuristic works
best with G-IS. To meet the first goal, I used the same patch constraints k and poison magnitude
constraints ϵ that I used in the previous experiments to evaluate AR-PS+R-IS.
For all experiments, I upscaled CIFAR images to 64 × 64 and set a 1% poison budget. I
brewed poison under several combinations of k ∈ {8, 12} and ϵ ∈ {16, 32}, tightening con-
straints to evaluate how strong attacks with this image selection and the two patch selection
heuristics are. For each, I evaluated effectiveness over 3 poisoned models. I used the familiar
FROG-DOG poison-target attacker objective combination as well as BIRD-AIRPLANE,
CAT-AUTOMOBILE, and DOG-FROG.
In summary, labeling success was achieved in nearly all trials, see Table 3 for details. While
attacks with AR-PS alone struggled to achieve labeling success, even failing to achieve probabil-
ity shift success in some trials of k = 4 and ϵ = 32, incorporating image selection leads to very
consistent labeling success. Among 8 experiments in which using AR-PS versus G-PS had differ-
61
ent Latt at the end of training that I can compare, in 5 experiments AR-PS had more probability
shift success by that measure. This suggests that AR-PS has a slight probability shift effective-
ness advantage over G-PS when deliberate image selection is being used, contrary to the pattern
observed with R-IS. In any case, this strong success suggests that image choice is more important
than patch selection.
8.4.5 GAS Image Selection
At this point, the next image selection heuristic to evaluate was GAS-IS. Using the familiar
setup of 1% poison budget, upscaled CIFAR-10 images to 64 × 64, and three validation runs per
poisoned image set X ′P , I collected the results in Table 3.
For the selection of attacker objectives and constraints I evaluated, ties were somewhat fre-
quent. Comparing to AR-PS+G-IS, of the 7 experiments in which one attack had a lower Latt
than the other, this image selection with AR-PS was stronger in 5. Comparing to G-PS+G-IS, of
the 6 experiments with a disparity in average attacker losses, AR-PS+GAS-IS was stronger in 4
experiments. These results suggest that AR-PS+GAS-IS has a slight attack strength advantage
over AR-PS+G-IS, which makes it at least competitive for the best heuristic combination in ACT
methods.
Unlike earlier experiments, but like the others represented in Table 3, a tighter restriction on
k is easier to overcome than a tighter restriction on ϵ; compare the k = 12 and ϵ = 32 losses
with the other constraint combinations. More experiments are needed to draw a strong conclusion
about relative difficulty. For now, the evidence is inconclusive.
8.4.6 Mixup and Cutmix Defense
In my final four experiments, I isolated the effects of Mixup and Cutmix augmentations in the
case of CIFAR-10 classification. While my preliminary study found that Witches’ Brew failed
to successfully poison DeiT-Ti/16 under recommended training hyperparameters as it learned to
classify CIFAR-100, here I will explain how I isolated the effect of Mixup, Cutmix, and label
62
smoothing. To be commensurate in various ways with both my preliminary experiments as well
as those I performed for the broader class of ACT methods, I used a 0.1% poison budget, an ϵ =
32 magnitude constraint, and used the attacker objective FROG-DOG for all experiments. For a fair
comparison with Witches’ Brew (WB) once more, I did not use patch constraints, opting instead
to only use image selection. As usual, I performed three validation runs per poisoned image set
as brewed with a specific image selection heuristic and using the augmentations or not.
For each image selection heuristic R-IS, G-IS, and GAS-IS I brewed a poisoned set X ′P with
models using the mix augmentations or not. For those not using the mix augmentations, they
used the same light augmentations described in section 8.2 above. In those experiments using
Mixup and Cutmix, 100% of training images were batch-wise applied either Mixup or Cutmix,
each having equal probability. Mixup was limited to α ≤ 0.8 while Cutmix was allowed unlim-
ited cutting and pasting with appropriate label interpolations. Labels were also smoothed so that
each training label would have at least 0.1 in each class.
Even with my strong ACT method, labeling success with such few images was lacking. So,
I focused on probability shift success as captured by the quantitative average attacker loss Latt
measure. Note that the attack using GAS-IS against a victim using the mix augmentations was
only validated twice due to the job it was part of running out of time on the Talapas cluster. I
found that GAS-IS provides the greatest reduction in loss compared to Witches’ Brew for both
augmentation cases, but it is still not strong enough to create a successful attack in this setting.
With only a 0.1% poison budget, only the attack using GAS-IS against a victim not using mix
augmentations achieved labeling success in 3 of 3 trials. None of the rest of this batch achieved
Latt R-IS (WB) G-IS GAS-IS
No Mix Augmentations 8.652 3.047 0.197
Mix Augmentations 4.442 3.789 3.130
Table 4: Average attacker loss at the end of training for a variety of image selection heuristics
against a victim using Mixup/Cutmix/label smoothing or not. The best loss for each augmenta-
tion configuration is bolded. Here, Witches’ Brew (WB) is equivalent to using R-IS as in the first
column.
63
labeling success even once. See Table 4 for details.
There are two important takeaways from these results. First, in light of the slight advantage
that GAS-IS already had in previous experiments, its performance here solidifies it as the best
image selection heuristic I evaluated. Second, the Mixup, Cutmix, and label smoothing augmen-
tations in combination provide an effective defense against ACT methods. Using strong heuris-
tics, as opposed to random image selection as in Witches’ Brew, improves probability shift suc-
cess, but in this environment is not enough to achieve labeling success.
64
9 Discussion and Conclusion
While introducing this work, I presented three questions to investigate. Through extensive
experiments, I have answered all three. Additionally, I discovered that the previous SOTA data
poisoning attack method, Witches’ Brew, is weakened when poisoning Vision Transformers com-
pared with other computer vision architectures. I developed a class of poisoning attack meth-
ods, ACT methods, to target Vision Transformers and exploit their image patch tokenization in a
novel way with an attention rollout heuristic. I evaluated a variety of heuristic combinations, pro-
gressively tightening constraints to measure attack strength. The strongest ACT method I eval-
uated used GAS image selection in combination with attention rollout patch selection. Future
directions of study include evaluating the effectiveness of this attack on larger PFMs, different
machine learning tasks, and further studying the comparitive effects of constraints on k, ϵ, and
n. Through a careful literature review, my preliminary study of unmodified Witches’ Brew, and
experiments to isolate the augmentation combination Mixup+Cutmix+label smoothing, I found
for the first time (to my knowledge) that such training augmentations promote robustness against
train-time attacks.
Mixup and Cutmix are already known to be effective at improving robustness against test-
time adversarial examples (Yun et al., 2019; Zhang et al., 2018). In my evaluation of training
with heavy Transformer-style augmentations including both, I observed consistent attack failures
until such heavy augmentation was removed. Given the results here, I recommend heavy aug-
mentation such as Mixup and Cutmix for defending against poisoning attacks. To directly answer
my question of how Vision Transformers may be defended against train-time attacks: use heavy
augmentations like Mixup and Cutmix as are generally recommended for Transformer training.
My first question was “Are Vision Transformer classifiers vulnerable to fast train-time at-
tacks optimized by exploiting tokenization of images?” I found that DeiT-Ti/16 is vulnerable,
at least when lighter augmentations are used. Reliable labeling success is possible with con-
straints on patch coverage up to 50% as long as patch selection is accompanied by deliberate im-
age selection. The coverage of images determines attack success rate stronger than the number of
65
patches k poisoned alone. This success allows faster attack computations in theory by constrain-
ing patches poisoned, but my personal implementation of the ACT method did not leverage this
speedup. I answered the next question of which heuristics can lead to more effective train-time
attacks while I was answering this one through extensive experiments.
I studied three patch selection heuristics and found the most consistently high LSR with
individual gradients, if deliberate image selection is not used. I studied AR-PS most in-depth
and found that the success of using it across a variety of problem setups is highly dependent
on choice of target image and poison class. When image selection heuristics were introduced,
attacks choosing patches by AR-PS achieved slightly more probability shift success. The Vi-
sion Transformer DeiT-Ti/16 proved to have a natural robustness against the previous SOTA for
single-target attacks Witches’ Brew, but the strength of my ACT method attacks using heuristics
is enough to observe high attack success rate despite this robustness.
I observed the highest LSR when image selection heuristics were used in combination with
patch selection, specifically GAS-IS and AR-PS. This strength allowed attacks with tight con-
straints on ϵ and k to be not only be successful, but minimize the attacker loss Latt the most by
the end of training. In future work, adaptations of the same image and patch selection heuris-
tic combination I used may yield similarly strong train-time attacks in different settings such as
multi-target attacks. For now, I contribute a new class of data poisoning attacks, ACT methods,
which are to my knowledge the new SOTA. Given that such successful attacks are possible, I rec-
ommend use of heavy augmentations like Mixup and Cutmix when training Vision Transformers.
66
10 Appendix
10.1 96 x 96 CIFAR-10 Experiments and Results
In addition to poisoning 32 × 32 and 64 × 64 images, I performed limited experiments with
96 × 96 images. Using k = 27 for 75% coverage, ϵ = 32, and a 1% poison budget, I only ob-
served a 1/3 success rate. See Figure 17 for an example of a poisoned image under this budget.
The clean model fc trained for poison brewing achieved 86% validation accuracy, almost 10 per-
centage points higher than the clean model for 32 × 32 images. This was not degraded more than
5 percentage points by poisoning; stealth success was always achieved. This suggests a future di-
rection of study into the relationship between model validation accuracy and robustness against
poisoning attacks. Intuitively, higher validation accuracy suggests that more test images will be
classified correctly, including target images xT . Geiping et al. (2021) found an inverse relation-
ship between poison success (increased by larger ϵ allowance) and validation accuracy. My result
seems to extend the inverse relationship pattern to differences in models’ ability to achieve high
validation accuracy.
Figure 17: A 96× 96 image of a frog, k = 27 patches poisoned.
10.2 Poison Brewing Loss Curves
Here, I present a representative sample of attacker surrogate loss B, “witch’s loss” to use
Witches’ Brew terminology, as a function of iteration over the poison generation step.
67
(a) A brewing run to poison 64× 64 images. Using gradient image selection and attention rollout patch se-
lection with constraints k = 8, ϵ = 32, and a 1% poison image budget, this brew successfully poisoned 3/3
validation models.
(b) This brewing run to poison 64 × 64 images had an objective of classifying an airplane as a bird. Using
random image selection and attention rollout patch selection with constraints k = 12, ϵ = 16, and a 1%
poison image budget, this brew successfully poisoned 2/3 validation models.
Figure 18: A representative selection of poison brewing curves: B, the “Witch’s Loss“, versus
attack steps during poison generation. Notice that final loss is not diagnostic of attack success
rate; run (b) achieved a smaller final loss than (a) yet had a lower labeling success rate.
68
11 References
Abnar, S., & Zuidema, W. (2020, May 31). Quantifying attention flow in transformers. Retrieved
October 3, 2022, from http://arxiv.org/abs/2005.00928
Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., & Mukhopadhyay, D. (2018). Adver-
sarial attacks and defences: A survey. arXiv:1810.00069 [cs, stat]. Retrieved January 4,
2022, from http://arxiv.org/abs/1810.00069
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale
hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern
Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., & Guo,
B. (2022, December 7). PeCo: Perceptual codebook for BERT pre-training of vision
transformers. Retrieved April 16, 2023, from http://arxiv.org/abs/2111.12710
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021, June 3). An
image is worth 16x16 words: Transformers for image recognition at scale. Retrieved May
16, 2022, from http://arxiv.org/abs/2010.11929
Geiping, J., Fowl, L., Huang, W. R., Czaja, W., Taylor, G., Moeller, M., & Goldstein, T. (2021).
Witches’ brew: Industrial scale data poisoning via gradient matching. arXiv:2009.02276
[cs]. Retrieved May 3, 2022, from http://arxiv.org/abs/2009.02276
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Gu, J., Tresp, V., & Qin, Y. (2022, July 18). Are vision transformers robust to patch perturba-
tions? Retrieved October 3, 2022, from http://arxiv.org/abs/2111.10659
Hammoudeh, Z., & Lowd, D. (2022). Identifying a training-set attack’s target using renormalized
influence estimation. Proceedings of the 2022 ACM SIGSAC Conference on Computer
and Communications Security, 1367–1381. https://doi.org/10.1145/3548606.3559335
He, K., Zhang, X., Ren, S., & Sun, J. (2015, December 10). Deep residual learning for image
recognition. Retrieved April 16, 2023, from http://arxiv.org/abs/1512.03385
69
Huang, W. R., Geiping, J., Fowl, L., Taylor, G., & Goldstein, T. (2021). MetaPoison: Practical
general-purpose clean-label data poisoning. NeurIPS. https://arxiv.org/pdf/2004.00225.
pdf
Jagielski, M., Severi, G., Pousette Harger, N., & Oprea, A. (2021). Subpopulation data poisoning
attacks. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communi-
cations Security, 3104–3122. https://doi.org/10.1145/3460120.3485368
Joshi, A., Jagatap, G., & Hegde, C. (2021, October 8). Adversarial token attacks on vision trans-
formers. Retrieved May 16, 2022, from http://arxiv.org/abs/2110.04337
Kingma, D. P., & Ba, J. (2017, January 29). Adam: A method for stochastic optimization. Re-
trieved April 16, 2023, from http://arxiv.org/abs/1412.6980
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. https://www.cs.
toronto.edu/~kriz/learning-features-2009-TR.pdf
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021, August 17). Swin
transformer: Hierarchical vision transformer using shifted windows. Retrieved April 16,
2023, from http://arxiv.org/abs/2103.14030
Loshchilov, I., & Hutter, F. (2019, January 4). Decoupled weight decay regularization. Retrieved
April 19, 2023, from http://arxiv.org/abs/1711.05101
Lv, P., Ma, H., Zhou, J., Liang, R., Chen, K., Zhang, S., & Yang, Y. (2021, November 22). DBIA:
Data-free backdoor injection attack against transformer networks. Retrieved May 16,
2022, from http://arxiv.org/abs/2111.11870
Muñoz-González, L., Pfitzner, B., Russo, M., Carnerero-Cano, J., & Lupu, E. (2019). Poisoning
attacks with generative adversarial nets. arXiv:1906.07773v2 [cs.LG]. https://arxiv.org/
abs/1906.07773
Razmi, F., & Xiong, L. (2021). Classification auto-encoder based detector against diverse data
poisoning attacks. arXiv:2108.04206 [cs]. Retrieved April 20, 2022, from http://arxiv.org/
abs/2108.04206
70
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-
propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015, January 29). ImageNet large
scale visual recognition challenge. Retrieved April 16, 2023, from http://arxiv.org/abs/
1409.0575
Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., & Goldstein, T. (2018).
Poison frogs! targeted clean-label poisoning attacks on neural networks. arXiv:1804.00792
[cs, stat]. Retrieved January 19, 2022, from http://arxiv.org/abs/1804.00792
Subramanya, A., Saha, A., Koohpayegani, S. A., Tejankar, A., & Pirsiavash, H. (2022, June 16).
Backdoor attacks on vision transformers. Retrieved October 7, 2022, from http://arxiv.
org/abs/2206.08477
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017, August 3). Revisiting unreasonable effec-
tiveness of data in deep learning era. Retrieved April 3, 2023, from http://arxiv.org/abs/
1707.02968
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021, January 15).
Training data-efficient image transformers & distillation through attention. Retrieved Oc-
tober 4, 2022, from http://arxiv.org/abs/2012.12877
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polo-
sukhin, I. (2017). Attention is all you need. https://doi.org/10.48550/ARXIV.1706.03762
Wang, X., Li, J., Kuang, X., Tan, Y.-a., & Li, J. (2019). The security of machine learning in an
adversarial setting: A survey. Journal of Parallel and Distributed Computing, 130, 12–23.
https://doi.org/10.1016/j.jpdc.2019.03.003
Xiao, H., Xiao, H., & Eckert, C. (2012). Adversarial label flips attack on support vector ma-
chines. European Conference on Artificial Intelligence, 6. https: / /doi .org/https : / /doi .
org/10.3233/978-1-61499-098-7-870
71
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022, June 13). CoCa:
Contrastive captioners are image-text foundation models. Retrieved April 16, 2023, from
http://arxiv.org/abs/2205.01917
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019, August 7). CutMix: Regulariza-
tion strategy to train strong classifiers with localizable features. Retrieved April 27, 2023,
from http://arxiv.org/abs/1905.04899
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018, April 27). Mixup: Beyond empiri-
cal risk minimization. Retrieved February 11, 2023, from http://arxiv.org/abs/1710.09412
Zhao, M., An, B., Gao, W., & Zhang, T. (2017). Efficient label contamination attacks against
black-box learning models. Proceedings of the Twenty-Sixth International Joint Confer-
ence on Artificial Intelligence, 3945–3951. https://doi.org/10.24963/ijcai.2017/551
Zhou, C., Li, Q., Li, C., Yu, J., Liu, Y., Wang, G., Zhang, K., Ji, C., Yan, Q., He, L., Peng, H., Li,
J., Wu, J., Liu, Z., Xie, P., Xiong, C., Pei, J., Yu, P. S., & Sun, L. (2023, February 18). A
comprehensive survey on pretrained foundation models: A history from BERT to Chat-
GPT. Retrieved March 17, 2023, from http://arxiv.org/abs/2302.09419
72