CERTIFIED AND FORENSIC DEFENSES AGAINST POISONING AND
BACKDOOR ATTACKS
by
ZAYD HAMMOUDEH
A DISSERTATION
Presented to the Department of Computer Science
and the Division of Graduate Studies of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
December 2023
DISSERTATION APPROVAL PAGE
Student: Zayd Hammoudeh
Title: Certified and Forensic Defenses against Poisoning and Backdoor Attacks
This dissertation has been accepted and approved in partial fulfillment of the
requirements for the Doctor of Philosophy degree in the Department of Computer
Science by:
Daniel Lowd Chair
Thien Nguyen Core Member
Humphrey Shi Core Member
Luca Mazzucato Institutional Representative
and
Krista Chronister Vice Provost for Graduate Studies
Original approval signatures are on file with the University of Oregon Division of
Graduate Studies.
Degree awarded December 2023
2
© 2023 Zayd Hammoudeh
All rights reserved.
3
DISSERTATION ABSTRACT
Zayd Hammoudeh
Doctor of Philosophy
Department of Computer Science
December 2023
Title: Certified and Forensic Defenses against Poisoning and Backdoor Attacks
Data poisoning and backdoor attacks manipulate model predictions by
inserting malicious instances into the training set. Most existing defenses against
poisoning and backdoor attacks are empirical and easily evaded by an adaptive
attacker. In addition, existing empirical defenses provide, at best, minimal insights
into an attacker’s identity, goals, and methods. In contrast, this work proposes two
classes of poisoning and backdoor defenses: (1) certified defenses, which provide
provable guarantees on their robustness and (2) forensic defenses, which provide
actionable, human-interpretable insights into an attack’s goals so as to stop the
attack via intervention outside the ML system. We focus on certified defenses for
regression, where the model predicts a continuous value, and sparse (ℓ0) attacks,
where the adversary controls an unknown subset of the training and test features.
Our forensic defense identifies the target of poisoning and backdoor attacks while
simultaneously mitigating the attack; we validate our forensic defense on a wide
range of data modalities, including speech, text, and vision.
This dissertation includes previously published and unpublished coauthored
material.
4
CURRICULUM VITAE
NAME OF AUTHOR: Zayd Hammoudeh
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, OR
University of California Santa Cruz, Santa Cruz, CA
San José State University, San Jose, CA
Drexel University, Philadelphia, PA
DEGREES AWARDED:
Doctor of Philosophy, Computer Science, 2023, University of Oregon
Master of Science, Computer Science, 2016, San José State University
Master of Science, Computer Engineering, 2006, Drexel University
Bachelor of Science, Computer Engineering, 2006, Drexel University
AREAS OF SPECIAL INTEREST:
Certified Adversarial Defenses
Training Data Influence Analysis
Data Poisoning
Adversarial Machine Learning
Positive-Unlabeled Learning
PROFESSIONAL EXPERIENCE:
ML Applied Scientist, Qualtrics, 2023–Present
Graduate Researcher, University of Oregon, 2018–2023
Graduate Researcher, University of California Santa Cruz, 2017–2018
Wireless Power Engineer, Integrated Device Technology, 2011–2017
Applications Development Engineer, Teradyne, 2006-2011
Undergrad. Researcher & Teaching Assistant, Drexel University, 2003–2006
5
GRANTS, AWARDS, AND HONORS:
Gurdeep Pall Graduate Student Fellowship, University of Oregon 2022
J. Donald Hubbard Family Scholarship, University of Oregon, 2021
Travel Award, International Joint Conference on Artificial Intelligence
(IJCAI) 2019
Travel Award, SAT Association 2018
Travel Award, Federated Logic Conference (FLoC) 2018
Best Student Paper, International Conference on Theory and Applications
of Satisfiability Testing (SAT) 2018
Chancellor’s Fellowship, University of California, Santa Cruz 2017
Undergraduate Student Research Award, Drexel University 2005
Arnold H. Kaplan Stochastic Achievement and Academic Excellence
Scholarship, Drexel University 2005
Alvin W. Wene Engineering Scholarship, Drexel University 2004
Teaching Assistant Excellence Award, Drexel University 2004
PUBLICATIONS:
W. You, Z. Hammoudeh, and D. Lowd. Large Language Models Are
Better Adversaries: Exploring Generative Clean-Label Backdoor
Attacks Against Text Classifiers”. In: Findings of the Association for
Computational Linguistics. ELMNLP’23. 2023.
Z. Hammoudeh and D. Lowd. Feature Partition Aggregation: A Fast
Certified Defense Against a Union of ℓ0 Attacks. In: Proceedings of the
2nd ICML Workshop on New Frontiers in Adversarial Machine Learning,
AdvML-Frontiers’23, 2023.
W. You, Z. Hammoudeh, and D. Lowd. Large Language Models are
Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks
Against Text Classifiers. In Proceedings of the 2nd ICML Workshop on
New Frontiers in Adversarial Machine Learning, AdvML-Frontiers’23,
2023.
J. Brophy, Z. Hammoudeh, and D. Lowd. Adapting and Evaluating
Influence-Estimation Methods for Gradient-Boosted Decision Trees.
In: Journal of Machine Learning Research vol. 24 (2023), pp. 1–48.
6
Z. Hammoudeh and D. Lowd. Reducing Certified Regression to Certified
Classification for General Poisoning Attacks. In Proceedings of the
1st IEEE Conference on Secure and Trustworthy Machine Learning,
SaTML’23, 2023.
Z. Hammoudeh and D. Lowd. Identifying a Training-Set Attack’s Target
using Renormalized Influence Estimation. In Proceedings of the 29th
ACM SIGSAC Conference on Computer and Communications Security,
CCS’22, 2022.
Z. Hammoudeh and D. Lowd. Simple, Attack-Agnostic Defense Against
Targeted Training Set Attacks Using Cosine Similarity. In Proceedings
of the 3rd ICML Workshop on Uncertainty and Robustness in Deep
Learning, UDL’21, 2021.
Z. Xie, J. Brophy, A. Noack, W. You, K. Asthana, C. Perkins, S. Reis,
Z. Hammoudeh, D. Lowd, and S. Singh. What Models Know
About Their Attackers: Deriving Attacker Information From Latent
Representations. In Proceedings of the 4th BlackboxNLP Workshop on
Analyzing and Interpreting Neural Networks for NLP, 2021.
Z. Hammoudeh and D. Lowd. Learning from Positive and Unlabeled Data
with Arbitrary Positive Shift. In Proceedings of the 34th Conference on
Neural Information Processing Systems, NeurIPS’20, 2020.
S. Jamshidi, Z. Hammoudeh, R. Durairajan, D. Lowd, R. Rejaie, and W.
Willinger. On the practicality of learning models for network telemetry.
In Proceedings of the 4th Network Traffic Measurement and Analysis
Conference, TMA’20, 2020.
Z. Hammoudeh and D. Lowd. Positive-Unlabeled Learning with
Arbitrarily Non-Representative Labeled Data. In Proceedings of the
37th International Conference on Machine Learning’s Workshop on
Uncertainty & Robustness in Deep Learning, UDL’20, 2020.
D. Achlioptas, Z. Hammoudeh, and P. Theodoropoulos. Fast Sampling
of Perfectly Uniform Satisfying Assignments. In Proceedings of
the 21st International Conference on Theory and Applications of
Satisfiability Testing, SAT’2018, 2018. (Best Student Paper Award.
Authors alphabetical)
7
Z. Hammoudeh and C. Pollett. Clustering-Based, Fully Automated
Mixed-Bag Jigsaw Puzzle Solving. In Proceedings of 17th International
Conference on Computer Analysis of Images and Patterns, CAIP’17,
2017.
8
To my mother.
9
TABLE OF CONTENTS
Chapter Page
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 34
2. PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1. Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2. On Attacker Threat Models . . . . . . . . . . . . . . . . . . 39
2.3. On the Defender Objectives . . . . . . . . . . . . . . . . . . 39
3. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1. Defenses Against Evasion Attacks . . . . . . . . . . . . . . . . 41
3.1.1. Empirical Defenses . . . . . . . . . . . . . . . . . . . . 42
3.1.2. Certified Evasion Defenses . . . . . . . . . . . . . . . . 43
3.2. Defenses Against Poisoning and Backdoor Attacks . . . . . . . . . 44
3.2.1. Empirical Classification Defenses . . . . . . . . . . . . . . 45
3.2.2. Certified Pointwise Classifiers . . . . . . . . . . . . . . . 46
3.2.3. Robust Regression . . . . . . . . . . . . . . . . . . . . 48
3.2.3.1. Resilient Regression . . . . . . . . . . . . . . . 48
3.2.3.2. Certified Regression . . . . . . . . . . . . . . . 49
3.3. Defenses Outside the ML System . . . . . . . . . . . . . . . . 50
4. REDUCING CERTIFIED REGRESSION TO CERTIFIED
CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.1. One-Sided vs. Two-Sided Certification Bounds . . . . . . . . 53
4.1.2. Relating Regression and Binary Classification . . . . . . . . 53
4.2. Warmup: Perturbing a Set’s Median . . . . . . . . . . . . . . . 54
4.2.1. Unweighted Swap Paradigm . . . . . . . . . . . . . . . . 55
10
Chapter Page
4.2.2. Insertion/Deletion Paradigm . . . . . . . . . . . . . . . 55
4.2.3. Weighted Swap Paradigm . . . . . . . . . . . . . . . . . 57
4.3. Reducing Regression to Voting-Based Binary Classification . . . . . 58
4.4. Certified Instance-Based Regression . . . . . . . . . . . . . . . 61
4.4.1. Fixed-Population Neighborhood . . . . . . . . . . . . . . 62
4.4.2. Region-Based Neighborhood . . . . . . . . . . . . . . . . 63
4.4.3. Computational Complexity . . . . . . . . . . . . . . . . 64
4.5. Certified Regression for General Models . . . . . . . . . . . . . 65
4.5.1. Partitioned Certified Regression . . . . . . . . . . . . . . 65
4.5.2. Weighted Partitioned Certified Regression . . . . . . . . . . 67
4.5.3. Computational Complexity . . . . . . . . . . . . . . . . 68
4.6. Certified Regression Using Overlapping Training Data . . . . . . . 69
4.6.1. Overlapping Certified Regression . . . . . . . . . . . . . . 69
4.6.2. Weighted Overlapping Certified Regression . . . . . . . . . 71
4.6.3. Computational Cost . . . . . . . . . . . . . . . . . . . 72
4.7. Certifying Any Model Beyond Unit Cost . . . . . . . . . . . . . 73
4.7.1. Combining Instance-Based Learners & Ensembles . . . . . . 74
4.7.2. Certifying Non-Unit Costs by Construction . . . . . . . . . 74
4.7.3. More Submodels vs. Weighted Costs . . . . . . . . . . . . 76
4.8. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8.1. Experimental Setup . . . . . . . . . . . . . . . . . . . 77
4.8.2. Analyzing the Certified Accuracy . . . . . . . . . . . . . 79
4.9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 81
5. CERTIFIED DEFENSE AGAINST A UNION OF ℓ0 ATTACKS . . . . 85
5.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 87
11
Chapter Page
5.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3. Certifying Feature Robustness . . . . . . . . . . . . . . . . . 91
5.3.1. Feature Robustness Under Plurality Voting . . . . . . . . . 92
5.3.2. Feature Robustness Under Run-Off Elections . . . . . . . . 93
5.3.3. Advantages of Feature Partition Aggregation . . . . . . . . 96
5.4. Feature Partitioning Strategies . . . . . . . . . . . . . . . . . 96
5.4.1. Feature Partitioning Paradigms . . . . . . . . . . . . . . 97
5.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . 98
5.5.2. Main Results . . . . . . . . . . . . . . . . . . . . . . 101
5.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 104
6. IDENTIFYING POISONING AND BACKDOOR ATTACK
TARGETS WHILE MITIGATING THE ATTACK . . . . . . . . . . 105
6.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Review of Influence Analysis and Estimation . . . . . . . . . . . 110
6.3. Why Influence Estimation Often Fails and How to Fix It . . . . . . 115
6.3.1. A Simple Experiment . . . . . . . . . . . . . . . . . . . 115
6.3.2. Why Influence Estimation Performs Poorly . . . . . . . . . 116
6.3.3. Renormalizing Influence Estimation . . . . . . . . . . . . 120
6.3.4. Renormalization and More Advanced Attacks . . . . . . . . 122
6.3.5. Renormalization and Non-Adversarial Data . . . . . . . . . 125
6.4. Identifying Attack Targets . . . . . . . . . . . . . . . . . . . 127
6.4.1. Measuring (Renormalized) Influence . . . . . . . . . . . . 128
6.4.2. Identifying Anomalous Influence . . . . . . . . . . . . . . 130
6.4.3. Target Driven Attack Mitigation . . . . . . . . . . . . . . 133
6.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12
Chapter Page
6.5.1. Training-Set Attacks Evaluated . . . . . . . . . . . . . . 135
6.5.2. Identifying Adversarial Set Dadv . . . . . . . . . . . . . . 137
6.5.3. Identifying Attack Targets . . . . . . . . . . . . . . . . 139
6.5.4. Target-Driven Mitigation . . . . . . . . . . . . . . . . . 141
6.6. Adaptive Attacks . . . . . . . . . . . . . . . . . . . . . . . 142
6.7. Discussion and Conclusions . . . . . . . . . . . . . . . . . . . 146
7. CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . 149
APPENDICES
A. NOMENCLATURE REFERENCE . . . . . . . . . . . . . . . . 152
B. PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.1. Proofs for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . 159
B.2. Proofs for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . 169
B.3. Proof for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . 174
C. DETAILED EMPIRICAL RESULTS . . . . . . . . . . . . . . . 177
C.1. Chapter 4 Detailed Results . . . . . . . . . . . . . . . . . . . 178
C.1.1. Baseline Accuracy . . . . . . . . . . . . . . . . . . . . 178
C.1.2. Numerical Results . . . . . . . . . . . . . . . . . . . . 178
C.1.3. kNN-CR Full Certified Accuracy Plots . . . . . . . . . . . 185
C.2. Chapter 5 Detailed Results . . . . . . . . . . . . . . . . . . . 187
C.2.1. Non-Robust Accuracy . . . . . . . . . . . . . . . . . . 187
C.2.2. Detailed Median Certified Robustness Results . . . . . . . . 188
C.2.3. Feature Partition Aggregation vs. Randomized Ablation
Certified Accuracy Detailed Comparison . . . . . . . . . . 193
C.2.3.1. Numerical Comparison of Feature Partition
Aggregation and Randomized Ablation . . . . . . . 194
13
Chapter Page
C.2.3.2. Graphical Comparison of Feature Partition
Aggregation and Randomized Ablation . . . . . . . 199
C.3. Chapter 6 Detailed Results . . . . . . . . . . . . . . . . . . . 203
C.3.1. Speech Recognition Backdoor Full Results . . . . . . . . . . 203
C.3.2. Vision Backdoor Full Results . . . . . . . . . . . . . . . 205
C.3.3. Natural Language Poisoning Full Results . . . . . . . . . . 208
C.3.4. Vision Poisoning Full Results . . . . . . . . . . . . . . . 211
C.4. Convex Polytope Poisoning and GAS Joint Optimization . . . . . 214
C.4.1. Adversarial-Set Identification of the Jointly Optimized
Poisoning Attack . . . . . . . . . . . . . . . . . . . . . 218
C.4.2. Target Identification of the Jointly Optimized Poisoning Attack 220
C.4.3. Target-Driven Attack Mitigation of the Jointly Optimized
Poisoning Attack . . . . . . . . . . . . . . . . . . . . . 222
D. EVALUATION SETUPS . . . . . . . . . . . . . . . . . . . . . 223
D.1. Evaluation Setup for the Experiments in Chapter 4 . . . . . . . . 224
D.1.1. Dataset Configuration . . . . . . . . . . . . . . . . . . 224
D.1.2. Dataset Target Value Statistics . . . . . . . . . . . . . . 225
D.1.3. Hyperparameters . . . . . . . . . . . . . . . . . . . . 226
D.2. Evaluation Setup for the Experiments in Chapter 5 . . . . . . . . 228
D.2.1. Hardware Setup . . . . . . . . . . . . . . . . . . . . . 228
D.2.2. Baselines . . . . . . . . . . . . . . . . . . . . . . . . 228
D.2.3. Datasets . . . . . . . . . . . . . . . . . . . . . . . . 230
D.2.4. Network Architectures . . . . . . . . . . . . . . . . . . 231
D.2.5. Hyperparameters . . . . . . . . . . . . . . . . . . . . 233
D.3. Evaluation Setup for the Experiments in Chapter 6 . . . . . . . . 235
D.3.1. Dataset Configurations . . . . . . . . . . . . . . . . . . 235
14
Chapter Page
D.3.1.1. Training Set Sizes . . . . . . . . . . . . . . . . 237
D.3.1.2. Target Set Sizes . . . . . . . . . . . . . . . . . 237
D.3.2. Hyperparameters . . . . . . . . . . . . . . . . . . . . 238
D.3.2.1. Model Training . . . . . . . . . . . . . . . . . 238
D.3.2.2. Upper-Tail Heaviness Hyperparameters . . . . . . . 238
D.3.2.3. Target-Driven Mitigation Hyperparameters . . . . . 239
D.3.2.4. Adversarial Set Dadv Crafting . . . . . . . . . . . 240
D.3.2.5. Baselines . . . . . . . . . . . . . . . . . . . . 242
D.3.3. Network Architectures . . . . . . . . . . . . . . . . . . 244
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 247
15
LIST OF FIGURES
Figure Page
1. Unweighted Median Perturbation: (1a) Blue denotes elements
in subset Vl, i.e., elements in V with value at most ξ = 5.4. Vu’s
values are red. Each “swap” (1b) switches a value in Vl with an
arbitrarily large replacement. Deletions (1d) and insertions (1c) are
interchangeable (suppl. Lemma B.1), with both yielding the same
median value in the same number of modifications made to V. In
Figs. 1b to 1d above, any additional modifications to the set would
perturb the median. . . . . . . . . . . . . . . . . . . . . . . . 56
2. Weighted Swap Paradigm: Extension of Fig. 1 to weighted costs.
For simplicity and w.l.o.g., let R = {3, . . . , 7}, i.e., ∀l rl = νl + 1
Fig. 2a is identical to Fig. 1a except below each element νl is
its corresponding weight rl. Observe ∆ = 1 and R̃l = {3, 4}.
Fig. 2b shows that for R = 6 (visualized below each element), it
is impossible to perturb the median, and any additional weight
would be sufficient to swap out ν2 = 3. . . . . . . . . . . . . . . 57
3. Certified Regression to Certified Classification Reduction:
For xte ∈ X , the decision function is f(xte) := medV – just like
voting-based certified classification. Certified regression binarizes V
into V±1, which is used by the robustness certifier (optionally with
weights R) to determine R. . . . . . . . . . . . . . . . . . . . . 59
4. Certified Instance-Based Regression: Fig. 4a visualizes an
unperturbed IBL model. Test instance xte’s neighborhood is
visualized as a dashed line with neighborhood N (xte) identical
to V in Fig. 1a. Fig. 4b shows an attack on a kNN-m model where
the neighborhood’s cardinality (L = 5) is fixed, and the one attack
instance ( ) replaces one instance in Vl ( ) (source Fig. 1b). A
rNN-median model is shown in Fig. 4c, where the two inserted
instances ( ) do not change the neighborhood’s radius (source Fig. 1c). 63
5. Overlapping Certified Ensemble: Simple visualization of
the ensemble architecture for (weighted) overlapping certified
regression. Function htr partitions training set D into (m = 7)
blocks. Function hf defines each of the L = 5 submodel training
sets, D1, . . . ,D5. The ensemble prediction is the median submodel
prediction, i.e., f(xte) := med {fl(xte; 1), . . . , fl(xte;L)}. . . . . . . . . 68
16
Figure Page
6. Overlapping Certified Regression Integer Linear Program:
Adapted from the partial set (multi)cover integer linear program.
Calculates certified robustness R for both OCR and W-OCR
with indicator variable σ adjusting the program to account
for weighted costs. For arbitrary feature vector xte, Tl is
the set of submodels that predict f (j)l(xte) ≤ ξ. Variable ω
contains the number of modifications made to training set
block D(j). Binary variable δl = 1 if submodel fl has been
sufficiently modified for fl(xte) > ξ and 0 otherwise. . . . . . . . . . 73
7. Certified Accuracy: Mean certified accuracy (larger is better) for
our five primary certified regressors. kNN-CR is always trained on
all of training set D (i.e., q = 1). Ensemble submodels are trained
on 1 -th of D, with three q values tested per dataset. The x-axis is
q
clipped to enhance readability; see suppl. Sec. C.1.3 for kNN-CR’s
full results. The best performing method depends on the target
certified robustness R. For smaller R values, W-OCR achieves the
best certified accuracy. For larger R values, kNN-CR outperforms
the ensemble methods. This result aligns with previous findings
on certified classification [Jia+22a]. Sec. 4.8.2 summarizes these
experiments’ primary takeaways. Figure continued on the next page. . 83
7. Certified Accuracy (cont.): Mean certified accuracy (larger
is better) for our five primary certified regressors. kNN-CR is
always trained on all of training set D (i.e., q = 1). Ensemble
submodels are trained on 1 -th of D, with three q values tested per
q
dataset. The x-axis is clipped to enhance readability; see suppl.
Sec. C.1.3 for kNN-CR’s full results. The best performing method
depends on the target certified robustness R. For smaller R values,
W-OCR achieves the best certified accuracy. For larger R values,
kNN-CR outperforms the ensemble methods. This result aligns
with previous findings on certified classification [Jia+22a]. Sec. 4.8.2
summarizes these experiments’ primary takeaways. See Sec. C.1 for
the numerical results, including variance. . . . . . . . . . . . . . 84
17
Figure Page
8. Feature partition aggregation example prediction for: test
instance x ∈ X , n = 3, d = 4, and |Y| = 3. Feature partitioning
across L = 4 submodels, where the l-th submodel uses only feature
dimensions Sl = {l} ⊂ [4] and training set Dl, i.e., the tuple
containing the l-th column of feature matrix X (denoted Xl) and
label vector y := [y1, y2, y3]. xS denotes the subvector of x restrictedl
to the feature dimensions in Sl. Plurality label ypl = 0; runner-up
label yru = 1; and run-off label yRO = 0. Under the plurality voting
decision function (Sec. 5.3.1), f(x) has certified feature robustness
Rpl = 0. With run-off (Sec. 5.3.2), f(x)’s certified feature robustness
is RRO = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9. Renormalized Influence: CIFAR10 & MNIST joint, binary
classification for [frog] vs. [airplane & MNIST 0] with
|Dcl| = 10,000 & |Dadv| = 150. Existing influence estimators (upper
half) consistently failed to rank Dadv’s MNIST training instances as
highly influential on MNIST test instances. In contrast, all of our
renormalized influence estimators (Section 6.3.3) outperformed their
unnormalized version – with AUPRC improving up to 25×. Results
averaged across 30 trials. . . . . . . . . . . . . . . . . . . . . 111
10. CIFAR10 and MNIST Intra-training Loss Tracking:
Dadv’s ( ) and Dcl’s ( ) median cross-entropy losses (L ) at
each training checkpoint for binary classification – frog vs.
airplane & MNIST 0. The shaded regions correspond to each
training set loss’s interquartile range. MNIST’s training losses are
generally several orders of magnitude smaller than CIFAR10’s losses.
Gradient norm ratio ( ) shows the tight coupling of loss and training
gradient magnitude. . . . . . . . . . . . . . . . . . . . . . . . 117
11. Layerwise Decomposition of an Attack Target’s
Intra-Training Gradient Magnitude: One-pixel and
blend backdoor adversarial triggers (dashed and solid lines
respectively) trained separately on CIFAR10 binary classification
(ytarg = airplane and yadv = bird) using ResNet9. The network’s
first convolutional (Conv1) and final linear layers are a small
fraction of the parameters (0.03% and 0.01% resp.) but constitute
most of the target’s gradient magnitude (∥ĝtarg∥) with the dominant
layer attack dependent. Results are averaged over 20 trials. . . . . . . 124
18
Figure Page
12. Effect of Removing Influential, Non-Adversarial Training
Data: Test example zfilt’s misclassification rate (larger is
better) when filtering the training set using influence rankings
based on influence functions (top) and TracIn (bottom).
Renormalization (Rn.) always improved mean performance across
all training set filtering percentages. Results are averaged across
five CIFAR10 class pairs with 30 trials per class pair and 20 models
trained per method per trial. Results are separated by the reference
influence estimator. . . . . . . . . . . . . . . . . . . . . . . . 126
13. GAS renormalized influence, v, density distributions for two
training set attacks: CIFAR10 vision poisoning [Zhu+19]
(ytarg = dog and yadv = bird) and speech-recognition backdoor
[Liu+18] (ytarg = 4 and yadv = 5). Theoretical normal ( ) is w.r.t.
D := Dadv ∪ D. Observe that target examples (Figs. 13a and 13d)
have significant Dadv mass ( ) well to the right of Dcl’s mass ( ).
This upper-mass phenomenon is absent in non-targets (Figs. 13b
and 13e). Training example gradient norms (Fig. 13c and 13f) are
poorly correlated with whether the training example is adversarial.
For example, speech recognition has Dcl mass well to the right of
even the right-most Dadv mass, necessitating renormalization. See
Sections 6.5.1 and D.3 for more details on these attacks. . . . . . . . 129
14. Adversarial-Set Identification: Mean AUPRC identifying
adversarial set Dadv using a randomly selected target for Sec. 6.5.1’s
four attacks. Results averaged across related setups with ≥10 trials
per setup. See supplemental Section C.3 for the full granular results. . 138
15. Static Influence Adversarial-Set Identification: Comparing
the mean adversarial-set identification AUPRC of the static
influence estimators and their corresponding renormalized (Rn.)
versions. For all attacks, renormalization improved the static
estimators’ mean performance by up to a factor of >600×. These
experiments also highlight layerwise renormalization’s performance
gains, e.g., influence functions on natural-language poison. Results
are averaged across related experimental setups with ≥10 trials per
setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
16. Target Identification: Mean target identification AUPRC for
Sec. 6.5.1’s four attacks. “FIT w/ GAS” denotes GAS was FIT’s
influence estimator with matching notation for GAS-L. Results
averaged across setups with ≥10 trials per setup. See Sec. C.3 for
the full granular results. . . . . . . . . . . . . . . . . . . . . . 140
19
Figure Page
17. Adversarial-Set Identification for the Adaptive Vision
Poison Attack: Mean AUPRC identifying the adversarial
set where Zhu et al.’s vision poison attack is adapted to jointly
minimize the adversarial loss and the GAS influence. The baseline
results (orange) used Zhu et al.’s standard attack. Our jointly-
optimized attack reduced the GAS similarity by 7% at the cost of
a 19% decrease in ASR w.r.t. Table 6. See suppl. Sec. C.4 for the
granular results. . . . . . . . . . . . . . . . . . . . . . . . . 148
18. Target Identification for the Adaptive Vision Poison Attack:
Mean target identification AUPRC where Zhu et al.’s vision poison
attack is jointly optimized with minimizing GAS. FIT with GAS’s
mean target identification AUPRC declined only 9% versus the
baseline – an average change in target rank of 1.16 to 1.28 – still
strong performance. Results are averaged across related setups with
≥10 trials per setup. See suppl. Sec. C.4 for the full results. . . . . . 148
C.19. kNN-CR vs. W-OCR Certified Accuracy: Full plots of the
mean certified accuracy for Sec. 4.8’s six datasets. The shaded
regions visualize one standard deviation of the certified accuracy for
each R value. W-OCR’s q value for each dataset is in Table C.19. . . 186
C.20. Classification certified accuracy envelope for datasets
CIFAR10 (d = 1024) and MNIST (d = 784) for feature partition
aggregation (FPA) and baseline randomized ablation (RA). Each
method’s envelope considers the corresponding hyperparameters
in Tables C.25 and C.26, emulating a certified defense where the
hyperparameters are roughly tuned to maximize the certified
accuracy at each robustness level. Subfigures C.20a and C.20b
visualize each method’s certified accuracy envelope (larger is better);
also shown in these subfigures is a naive baseline where the decision
function always predicts label f(x) = 1. Subfigures C.20c and C.20d
visualize the improvement in certified accuracy when using FPA
with the run-off decision function over the two randomized ablation
baselines from Levine and Feizi [LF20b] and Jia et al. [Jia+22b].
The envelope plots’ underlying numerical values are provided in
Table C.25 for CIFAR10 and Table C.26 for MNIST. . . . . . . . . 201
20
Figure Page
C.21. Regression certified accuracy envelope for the
Weather [Mal+21] (d = 128) and Ames [Coc11] (d = 352) datasets
for feature partition aggregation (FPA) and baseline randomized
ablation (RA). Each method’s envelope considers the corresponding
hyperparameters in Tables C.27 and C.28, emulating a certified
defense where the hyperparameters are tuned to maximize each
robustness level’s certified accuracy. Subfigures C.21a and C.21b
visualize each method’s certified accuracy envelope (larger is better);
also shown in these subfigures is a naive baseline that always
predicts the median training data target value. Subfigures C.21c
and C.21d visualize the improvement in certified accuracy when
using FPA (with plurality voting) as the decision function over the
two randomized ablation baselines from Levine and Feizi [LF20b]
and Jia et al. [Jia+22b]. FPA outperforms randomized ablation
for smaller certified robustness values, while Jia et al.’s [Jia+22b]
version of RA marginally outperformed both FPA and the naive
baseline at larger robustness values. The envelope plots’ underlying
numerical values are provided in Table C.27 for Weather and
Table C.28 for Ames. . . . . . . . . . . . . . . . . . . . . . . 202
C.22. Speech Backdoor Adversarial Set Identification: Mean
backdoor set (Dadv) identification AUPRC across 30 trials for
all 10 class pairs with 21 ≤ |Dadv| ≤ 28 (varies by class pair, see
Tab. D.57). GAS and GAS-L outperformed all baselines in all
experiments, with GAS-L the overall top performer on 6/10 class
pairs. See Table C.29 for the numerical results. . . . . . . . . . . . 203
C.23. Speech Backdoor Target Identification: See Table C.30 for
numerical results. . . . . . . . . . . . . . . . . . . . . . . . . 204
C.24. Vision Backdoor Adversarial-Set Identification: Backdoor
set, Dadv, identification mean AUPRC across >30 trials for Weber
et al.’s [Web+23] three CIFAR10 backdoor attack patterns with
a randomly selected reference ẑtarg. All experiments performed
binary classification on randomly-initialized ResNet9. |Dadv| = 150.
Notation ytarg → yadv. See Table C.32 for the numerical results. . . . . 205
C.25. Vision Backdoor Target Identification: Mean target
identification AUPRC across 15 trials for Weber et al.’s [Web+23]
three CIFAR10 backdoor attack patterns and randomly selected
reference ẑtarg. All experiments performed binary classification on
randomly-initialized ResNet9. |Dadv| = 150. Notation ytarg → yadv.
See Table C.33 for the numerical results. . . . . . . . . . . . . . . 206
21
Figure Page
C.26. Natural Language Poisoning Adversarial-Set Identification:
See Table C.35 for the numerical results. . . . . . . . . . . . . . . 208
C.27. Natural Language Poisoning Target Identification: See
Table C.36 for the numerical results. . . . . . . . . . . . . . . . 209
C.28. Vision Poisoning Adversarial-Set Identification: Adversarial
set (Dadv) identification mean AUPRC across >15 trials for four
CIFAR10 class pairs with |Dadv| = 50. Our renormalized influence
estimators, GAS and GAS-L, using just initial parameters θ0 and
with 5 subepoch checkpointing outperformed all baselines for all
class pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
C.29. Vision Poisoning Target Identification: See Table C.39 for the
numerical results. . . . . . . . . . . . . . . . . . . . . . . . . 213
C.30. Adversarial-Set Identification for the Adaptive Vision
Poison Attack: Mean AUPRC identifying the adversarial
set where Zhu et al.’s vision poison attack is jointly optimized
with minimizing GAS with ≥10 trials per setup as described
in Section C.4. Section 6.6’s baseline results set trade-off
hyperparameter β = 0, meaning the poison was not jointly
optimized. The jointly optimized results used β = 10−2 as explained
in suppl. Section C.4. This joint optimization reduces the GAS
similarity by 7% at the cost of a 19% decrease in ASR w.r.t. Table 6.
See Table C.42 (below) for the numerical results. . . . . . . . . . . 219
C.31. Target Identification for the Adaptive Vision Poison
Attack: Mean target identification AUPRC where Zhu et al.’s
[Zhu+19] vision poison attack is jointly optimized with minimizing
GAS. Section 6.6’s baseline results set trade-off hyperparameter
β = 0, meaning the poison was not jointly optimized. The jointly
optimized results used β = 10−2 as explained in suppl. Section C.4.
See Table C.43 (below) for the numerical results. . . . . . . . . . . 221
22
LIST OF TABLES
Table Page
1. Evaluation Dataset Summary: Training set size (n), data
dimension, overlapping spread degree (d), error threshold (ξ), and
submodel architecture for the six datasets. Error thresholds that are
a percentage of each instance’s true target value are denoted X% · y.
Alternate ξ values are evaluated in the original paper [HL23c,
Fig. 9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2. Median certified robustness. Each dataset’s best performing
method is in bold. Our median robustness was 20–30% larger for
classification and 3 to 4× larger for regression while simultaneously
providing stronger guarantees. For detailed results, see Sec. C.2.2. . . . 101
3. Classification accuracy (% – larger is better). We report FPA’s
accuracy at both RA’s (middle, bold) and FPA’s (blue) best
median robustness levels. At RA’s best median robustness, FPA
had better classification accuracy for all four datasets. For full
results, see Sec. C.2.2. . . . . . . . . . . . . . . . . . . . . . . 101
4. CIFAR10 certified patch accuracy (% – larger is better) for
FPA, RA, and three dedicated patch defenses. FPA is competitive
despite making fewer assumptions and providing stronger guarantees
than patch defenses. . . . . . . . . . . . . . . . . . . . . . . 103
5. Mean certification time in seconds for FPA and Jia et al.’s
[Jia+22b] randomized ablation (RA). FPA is 2 to 3 orders of
magnitude faster than baseline RA. . . . . . . . . . . . . . . . . 103
6. Target Driven Attack Mitigation: Alg. 6’s target-driven,
iterative data sanitization applied to Sec. 6.5.1’s four attacks for
randomly selected targets. The attacks were neutralized with
few clean instances removed and little change in test accuracy.
Attack success rate (ASR) is w.r.t. the analyzed target. Results are
averaged across related setups with ≥10 trials per setup. Detailed
results appear in Sec. C.3.1–C.3.4. . . . . . . . . . . . . . . . . . 143
23
Table Page
7. Attack Mitigation for the Adaptive Vision Poison Attack:
Algorithm 6’s target-driven data sanitization where Zhu et al.’s
[Zhu+19] vision poison attack is jointly optimized with minimizing
the GAS influence. The results below consider exclusively the
jointly-optimized attack with β = 10−2. Clean-data removal remains
low, and test accuracy either improved or stayed the same for in but
one setup. The performance is comparable to the results with Zhu
et al.’s [Zhu+19]’s standard vision poisoning attack (see Table C.40).
Bold denotes the best mean performance with ≥10 trials per class
pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.8. General Nomenclature Reference: This table contains symbols
that are relevant to one or more chapters in this dissertation.
Related symbols are grouped together with groups separated by
dotted lines. . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.9. Chapter 4 Nomenclature Reference: Notation specific to
Chapter 4 with related symbols grouped together. Groups are
separated by dotted lines. Note that this table spans multiple pages. . . 153
A.9. Chapter 4 Nomenclature Reference (Continued): Notation
specific to Chapter 4 with related symbols grouped together.
Groups are separated by dotted lines. Note that this table spans
multiple pages. . . . . . . . . . . . . . . . . . . . . . . . . . 154
A.10. Chapter 5 Nomenclature Reference: Notation specific to
Chapter 5 with related symbols grouped together. Groups are
separated by dotted lines.Note that this table spans multiple pages. . . 155
A.10. Chapter 5 Nomenclature Reference (Continued): Notation
specific to Chapter 5 with related symbols grouped together.
Groups are separated by dotted lines. Note that this table spans
multiple pages. . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.11. Chapter 6 Nomenclature Reference: Notation specific to
Chapter 6 with related symbols grouped together. Groups are
separated by dotted lines. Note that this table spans multiple pages. . . 157
A.11. Chapter 6 Nomenclature Reference (Continued): Notation
specific to Chapter 6 with related symbols grouped together.
Groups are separated by dotted lines. Note that this table spans
multiple pages. . . . . . . . . . . . . . . . . . . . . . . . . . 158
24
Table Page
C.12. Baseline Accuracy: Summary of the baseline (i.e., uncertified)
accuracy mean and standard deviation for Sec. 4.8’s six datasets.
Submodels were trained on all of training set D (i.e., q = 1). Beside
each dataset’s name is the submodel architecture used by the
ensemble. Threshold ξ matches values in Table 1. . . . . . . . . . . 178
C.13. Ames Housing Full Results: Certified accuracy mean and
standard deviation for the Ames Housing [Coc11] dataset. Each
ensemble submodel was trained on 1 -th of the training set with
q
three q values tested per dataset, while kNN-CR was always
trained on the whole training set (i.e., q = 1). The certified accuracy
results of five robustness values (R) are reported per q value. Also
reported as a baseline is the uncertified accuracy (R = 0) when
training a single model on all of training set D (q = 1). Results
are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold. . . . . . . . . . . . . . . . . . . . . 179
C.14. Austin Housing Full Results: Certified accuracy mean and
standard deviation for the Austin Housing [Pie21] dataset. Each
ensemble submodel was trained on 1 -th of the training set with
q
three q values tested per dataset, while kNN-CR was always
trained on the whole training set (i.e., q = 1). The certified accuracy
results of five robustness values (R) are reported per q value. Also
reported as a baseline is the uncertified accuracy (R = 0) when
training a single model on all of training set D (q = 1). Results
are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold. . . . . . . . . . . . . . . . . . . . . 180
C.15. Diamonds Full Results: Certified accuracy mean and standard
deviation for the Diamonds [Wic16] dataset. Each ensemble
submodel was trained on 1 -th of the training set with three q values
q
tested per dataset, while kNN-CR was always trained on the
whole training set (i.e., q = 1). The certified accuracy results of five
robustness values (R) are reported per q value. Also reported as a
baseline is the uncertified accuracy (R = 0) when training a single
model on all of training set D (q = 1). Results are averaged across
10 trials per method, with each R’s best mean certified accuracy in bold. 181
25
Table Page
C.16. Weather Full Results: Certified accuracy mean and standard
deviation for the Weather [Mal+21] dataset. Each ensemble
submodel was trained on 1 -th of the training set with three q values
q
tested per dataset, while kNN-CR was always trained on the
whole training set (i.e., q = 1). The certified accuracy results of five
robustness values (R) are reported per q value. Also reported as a
baseline is the uncertified accuracy (R = 0) when training a single
model on all of training set D (q = 1). Results are averaged across
10 trials per method, with each R’s best mean certified accuracy in bold. 182
C.17. Life Full Results: Certified accuracy mean and standard deviation
for the Life [Raj21] dataset. Each ensemble submodel was trained
on 1 -th of the training set with three q values tested per dataset,
q
while kNN-CR was always trained on the whole training set
(i.e., q = 1). The certified accuracy results of five robustness
values (R) are reported per q value. Also reported as a baseline
is the uncertified accuracy (R = 0) when training a single model on
all of training set D (q = 1). Results are averaged across 10 trials
per method, with each R’s best mean certified accuracy in bold. . . . 183
C.18. Spambase Full Results: Certified accuracy mean and standard
deviation for the Spambase [Hop+17] dataset. Each ensemble
submodel was trained on 1 -th of the training set with three q values
q
tested per dataset, while kNN-CR was always trained on the
whole training set (i.e., q = 1). The certified accuracy results of five
robustness values (R) are reported per q value. Also reported as a
baseline is the uncertified accuracy (R = 0) when training a single
model on all of training set D (q = 1). Results are averaged across
10 trials per method, with each R’s best mean certified accuracy in bold. 184
C.19. W-OCR q Values: As detailed in Sec. 4.8.1, ensemble submodels
were trained on 1 -th of the training data where q varies by dataset.
q
Below are the W-OCR q values used in Fig. C.19. . . . . . . . . . 185
C.20. Non-Robust Accuracy: Prediction accuracy when training
a single model on all model features, i.e., L = 1. These values
represent an upper bound on the potential accuracy of our method
given the training set, model architecture, and hyperparameters. . . . 187
26
Table Page
C.21. CIFAR10 Detailed Results: Classification accuracy (%)
and median certified robustness (larger is better) for the
CIFAR10 [KNH14] dataset (d = 1024) for our certified sparse
defense, feature partition aggregation (FPA), and baseline
randomized ablation (RA) across various hyperparameter settings.
Each certification method’s hyperparameter setting with the best
median robustness is shown in bold. The best overall median
robustness is shown in blue. . . . . . . . . . . . . . . . . . . . 189
C.22. MNIST Detailed Results: Classification accuracy (%)
and median certified robustness (larger is better) for the
MNIST [LeC+98] dataset (d = 784) for our certified sparse defense,
feature partition aggregation (FPA), and baseline randomized
ablation (RA) across various hyperparameter settings. Each
certification method’s hyperparameter setting with the best median
robustness is shown in bold. The best overall median robustness is
shown in blue. . . . . . . . . . . . . . . . . . . . . . . . . . 190
C.23. Weather Detailed Results: Classification accuracy (%)
and median certified robustness (larger is better) for the
Weather [Mal+21] dataset (d = 128) for our certified sparse defense,
feature partition aggregation (FPA), and baseline randomized
ablation (RA) across various hyperparameter settings. FPA
considers only plurality voting-based certification (Sec. 5.3.1)
since the reduction is from certified regression to certified binary
classification. FPA results are reported using both GBDTs [Ke+17]
and linear submodels. Median robustness “−∞” denotes that
the classification accuracy was less than 50%. Each approach’s
hyperparameter setting with the best median robustness is shown in
bold. The best overall median robustness is shown in blue. . . . . . 191
C.24. Ames Detailed Results: Classification accuracy (%) and
median certified robustness (larger is better) for the Ames [Coc11]
dataset (d = 352) for our certified sparse defense, feature partition
aggregation (FPA), and baseline randomized ablation (RA) across
various hyperparameter settings. FPA considers only plurality
voting-based certification (Sec. 5.3.1) since the reduction is from
certified regression to certified binary classification. FPA results
are reported using both GBDTs [Ke+17] and linear submodels.
Median robustness “−∞” denotes that the classification accuracy
was less than 50%. Each approach’s hyperparameter setting with
the best median robustness is shown in bold. The best overall
median robustness is shown in blue. . . . . . . . . . . . . . . . 192
27
Table Page
C.25. CIFAR10 (d = 1024) certified accuracy for feature partition
aggregation (FPA) and baseline randomized ablation (RA).
“Plurality” denotes FPA with plurality voting as the decision
function while “Run-Off” denotes FPA using run-off elections
as the decision function. “[LF20b]” denotes Levine and Feizi’s
[LF20b] original version of RA while “[Jia+22b]” denotes Jia et al.’s
[Jia+22b] improved version of RA. We also consider an additional
naive baseline that always predicts f(x) = 1. For each certified
robustness level, each method’s best performing hyperparameter
setting is shown in bold with the overall best performing method
shown in blue. . . . . . . . . . . . . . . . . . . . . . . . . . 195
C.26. MNIST (d = 784) certified accuracy for feature partition
aggregation (FPA) and baseline randomized ablation (RA).
“Plurality” denotes FPA with plurality voting as the decision
function while “Run-Off” denotes FPA using run-off elections
as the decision function. “[LF20b]” denotes Levine and Feizi’s
[LF20b] original version of RA while “[Jia+22b]” denotes Jia et al.’s
[Jia+22b] improved version of RA. We also consider an additional
naive baseline that always predicts f(x) = 1. For each certified
robustness level, each method’s best performing hyperparameter
setting is shown in bold with the overall best performing method
shown in blue. . . . . . . . . . . . . . . . . . . . . . . . . . 196
C.27. Weather [Mal+21] dataset (d = 128) certified accuracy for feature
partition aggregation (FPA) and baseline randomized ablation (RA).
“[LF20b]” denotes Levine and Feizi’s [LF20b] original version of RA
while “[Jia+22b]” denotes Jia et al.’s [Jia+22b] improved version
of RA. Hammoudeh and Lowd’s [HL23c] reduction is from certified
regression to certified binary classification. Run-off is identical to
plurality voting under binary classification, so we report only the
plurality voting results below. We also consider an additional naive
baseline that always predicts the median training set target value
(i.e., f(x) = med{yi}ni=1). For each certified robustness level, each
method’s best performing hyperparameter setting is shown in bold
with the overall best performing method shown in blue. These
numerical results are visualized graphically as envelope plots in
Figure C.21. . . . . . . . . . . . . . . . . . . . . . . . . . . 197
28
Table Page
C.28. Ames [Coc11] dataset (d = 352) certified accuracy for feature
partition aggregation (FPA) and baseline randomized ablation (RA).
“[LF20b]” denotes Levine and Feizi’s [LF20b] original version of RA
while “[Jia+22b]” denotes Jia et al.’s [Jia+22b] improved version
of RA. Hammoudeh and Lowd’s [HL23c] reduction is from certified
regression to certified binary classification. Run-off is identical to
plurality voting under binary classification, so we report only the
plurality voting results below. We also consider an additional naive
baseline that always predicts the median training set target value
(i.e., f(x) = med{yi}ni=1). For each certified robustness level, each
method’s best performing hyperparameter setting is shown in bold
with the overall best performing method shown in blue. These
numerical results are visualized graphically as envelope plots in
Figure C.21. . . . . . . . . . . . . . . . . . . . . . . . . . . 198
C.29. Speech Backdoor Adversarial Set Identification: Mean
AUPRC across 30 trials for speech backdoor dataset [Liu+18] with
21 ≤ |Dadv| ≤ 28. GAS(-L) always outperformed the baselines.
Bold denotes the best mean performance. Mean results are shown
graphically in Figs. 14 and C.22. Variance results appear in the
original paper [HL22a, Sec. F.1.1]. . . . . . . . . . . . . . . . . 203
C.30. Speech Backdoor Target Identification: Bold denotes the
best mean performance. Mean results are shown graphically
in Figures 16 and C.23. Variance results appear in the original
paper [HL22a, Sec. F.1.1]. . . . . . . . . . . . . . . . . . . . . 204
C.31. Speech Backdoor Attack Mitigation: Bold denotes the best
mean performance with 10 trials per class pair. Aggregated results
are shown in Table 6. . . . . . . . . . . . . . . . . . . . . . . 204
C.32. Vision Backdoor Adversarial-Set Identification: Backdoor
set, Dadv, identification mean AUPRC across >30 trials for Weber
et al.’s [Web+23] three CIFAR10 backdoor attack patterns with
a randomly selected reference ẑtarg. All experiments performed
binary classification on randomly-initialized ResNet9. |Dadv| = 150.
Notation ytarg → yadv. Bold denotes the best mean performance.
Mean results are shown graphically in Figures 14 and C.24.
Variance results appear in the original paper [HL22a, Sec. F.1.2]. . . . 205
29
Table Page
C.33. Vision Backdoor Target Identification: Target identification
mean AUPRC across 15 trials for Weber et al.’s [Web+23] three
CIFAR10 backdoor attack patterns and randomly selected
reference ẑtarg. All experiments performed binary classification
on randomly-initialized ResNet9. Bold denotes the best mean
performance. Mean results are shown graphically in Figures 16
and C.25. Variance results appear in the original paper [HL22a,
Sec. F.1.2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
C.34. Vision Backdoor Attack Mitigation: Bold denotes the best
mean performance with 15 trials per setup. Aggregated results are
shown in Table 6. . . . . . . . . . . . . . . . . . . . . . . . . 207
C.35. Natural Language Poisoning Adversarial-Set Identification:
Poison identification mean AUPRC across 10 trials for 4 positive
and 4 negative sentiment SST-2 movie reviews [Soc+13] with
|Dadv| = 50. GAS-L perfectly identified all poison in all but one
trial. Bold denotes the best mean performance. Mean results are
shown graphically in Figures 14 and C.26. Variance results appear
in the original paper [HL22a, Sec. F.1.3]. . . . . . . . . . . . . . . 208
C.36. Natural Language Poisoning Target Identification: Bold
denotes the best mean performance with 10 trials per review. Mean
results are shown graphically in Figures 16 and C.27. Variance
results appear in the original paper [HL22a, Sec. F.1.3]. . . . . . . . 210
C.37. Natural Language Poisoning Attack Mitigation: Bold denotes
the best mean performance with 10 trials per review. Aggregated
results are shown in Table 6. . . . . . . . . . . . . . . . . . . . 210
C.38. Vision Poisoning Adversarial-Set Identification: Adversarial
set (Dadv) identification mean AUPRC across >15 trials for four
CIFAR10 class pairs with |Dadv| = 50. Our renormalized influence
estimators, GAS and GAS-L, using just initial parameters θ0 and
with 5 subepoch checkpointing outperformed all baselines for all
class pairs. Bold denotes the best mean performance. Mean results
are shown graphically in Figure 14 and C.28. Variance results
appear in the original paper [HL22a, Sec. F.1.4]. . . . . . . . . . . . 212
C.39. Vision Poisoning Target Identification: Bold denotes the best
mean performance with ≥15 trials per class pair. Mean results are
shown graphically in Figures 16 and C.29. Variance results appear
in the original paper [HL22a, Sec. F.1.4]. . . . . . . . . . . . . . . 213
30
Table Page
C.40. Vision Poisoning Attack Mitigation: Bold denotes the best
mean performance with ≥15 trials per class pair. Aggregated results
are shown in Table 6. . . . . . . . . . . . . . . . . . . . . . . 213
C.41. Effect of joint-optimization hyperparameter β on the attacker’s
success rate (ASR). Observe that even at β = 0, the attack success
rate is significantly lower than the 77.9% ASR in Table 6 due to the
fewer surrogate models that could be used during jointly-optimized
poison crafting as explained above. . . . . . . . . . . . . . . . . 217
C.42. Adversarial-Set Identification for the Adaptive Vision
Poison Attack: Adversarial-set identification mean AUPRC with
≥10 trials per setup as described in Section C.4. Section 6.6’s
baseline results set trade-off hyperparameter β = 0, meaning the
poison was not jointly optimized. The jointly optimized results
used β = 10−2 as explained in suppl. Section C.4. Bold denotes
the best mean performance. Mean results are shown graphically
in Figures 17 and C.30.Variance results appear in the original
paper [HL22a, Sec. F.2.1]. . . . . . . . . . . . . . . . . . . . . 218
C.43. Target Identification for the Adaptive Vision Poison
Attack: Target identification mean AUPRC where Zhu et al.’s
[Zhu+19] vision poison attack is jointly optimized with minimizing
GAS. Section 6.6’s baseline results set trade-off hyperparameter
β = 0, meaning the poison was not jointly optimized. The jointly
optimized results used β = 10−2 as explained in suppl. Section C.4.
Bold denotes the best mean performance with ≥10 trials per class
pair. Mean results are shown graphically in Figures 18 and C.31.
Variance results appear in the original paper [HL22a, Sec. F.2.2]. . . . 220
C.44. Target-Driven Attack Mitigation for the Adaptive Vision
Poison Attack: Algorithm 6’s target-driven data sanitization
where Zhu et al.’s [Zhu+19] vision poison attack is jointly optimized
with minimizing the GAS influence. The results below consider
exclusively the jointly-optimized attack with β = 10−2. Clean-data
removal remains low, and test accuracy either improved or stayed
the same for in but one setup. The performance is comparable to
the results with Zhu et al.’s [Zhu+19]’s standard vision poisoning
attack (see Table C.40). Bold denotes the best mean performance
with ≥10 trials per class pair. . . . . . . . . . . . . . . . . . . 222
D.45. Target Value Test Distribution Statistics: Mean (ȳ), standard
deviation (σy), minimum value (ymin) and maximum value (ymax) for
the test instances’ target y value for Sec. 4.8’s five regression datasets. . 225
31
Table Page
D.46. Ridge Regression Hyperparameters: Hyperparameter settings
for the three datasets that used ridge regression as the ensemble
submodel architecture. Hyperparameters are reported for the
three q values used in Fig. 7 and Sec. C.1. We also report the
hyperparameters for uncertified accuracy when q = 1. . . . . . . . . 228
D.47. XGBoost Hyperparameters: Hyperparameter settings for
the three datasets that used XGBoost as the ensemble submodel
architecture. Hyperparameters are reported for the three q values
used in Fig. 7 and Sec. C.1. We also report the hyperparameters for
uncertified accuracy when q = 1. . . . . . . . . . . . . . . . . . 229
D.48. Evaluation dataset information . . . . . . . . . . . . . . . . . . 231
D.49. Target Value Test Distribution Statistics: Mean (ȳ), standard
deviation (σy), minimum value (ymin) and maximum value (ymax) for
the test instances’ target y value for regression datasets Weather
and Ames. . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
D.50. ResNet9 neural network architecture . . . . . . . . . . . . . . . . 232
D.51. Network-in-Network neural network architecture . . . . . . . . . . . 233
D.52. FPA’s neural network training hyperparameters . . . . . . . . . . . 234
D.53. Regression datasets LightGBM submodel training hyperparameters . . 234
D.54. Regression datasets linear submodel training hyperparameters . . . . . 235
D.55. SST-2 movie reviews selected by Wallace et al.’s [Wal+21] poisoning
attack implementation. . . . . . . . . . . . . . . . . . . . . . . 236
D.56. Chapter 6 target identification dataset sizes . . . . . . . . . . . . . 237
D.57. Number of backdoor training examples for each speech backdoor
digit pair. As detailed above, Liu et al.’s [Liu+18] dataset provides
30 backdoored instances for each digit pair. The remainder of the 30
instances for each digit pair are part of the fixed, validation set. . . . 237
D.58. Target and non-target set sizes used in Section 6.5.3’s target
identification experiments. . . . . . . . . . . . . . . . . . . . . 238
D.59. Renormalized influence model training hyperparameter settings . . . . 238
D.60. Training-set attack model training hyperparameter settings . . . . . . 239
D.61. Upper-tail heaviness cutoff count (κ) . . . . . . . . . . . . . . . . 239
32
Table Page
D.62. Target-driven attack mitigation hyperparameters . . . . . . . . . . 240
D.63. CIFAR10 vision backdoor adversarial trigger maximum ℓ2-norm
perturbation distance . . . . . . . . . . . . . . . . . . . . . . 241
D.64. Convex polytope poison crafting [Zhu+19] hyperparameter settings . . 242
D.65. SST-2 sentiment analysis poison crafting hyperparameter settings.
These are identical to Wallace et al.’s [Wal+21] hyperparameter settings. 242
D.66. Influence functions hyperparameter settings . . . . . . . . . . . . . 243
D.67. Simplified ResNet9 neural network architecture used for Sec. 6.5’s
CIFAR10 binary classification . . . . . . . . . . . . . . . . . . . 245
D.68. Speech recognition convolutional neural network . . . . . . . . . . . 246
33
CHAPTER 1
INTRODUCTION
Machine learning systems are increasingly being applied in domains critical to
human safety and well-being [HL22b; Awa+18]. Simultaneously, algorithmic decisions
are increasingly black boxes where the factors that led to a model prediction are not
human interpretable – in particular for neural networks. Exacerbating this is neural
networks’ propensity to learn spurious correlations or “shortcuts” [DAm+20; Gei+20].
Robust machine learning models are urgently needed because when (not if) today’s
brittle models fail, society will have to carry the burden of that failure.
Spurious correlations occur naturally in most training data [Ily+19] and are often
non-malicious [Fel20]. For example, model misbehavior can arise due to training
outliers drawn from the tails of the data distributions. Similarly, measurement or
labeling noise can also introduce spurious relationships into the training set. Rather
than focusing on these disparate causes of spurious training data, this dissertation
focuses on adversarial attacks, where an adversary introduces or exploits worst-
case spurious correlations [Car+23]. Making a model robust against worst-case
training modifications simultaneously makes the model robust against less severe (e.g.,
benign) training set issues, including those mentioned above. Today’s models are
highly susceptible to numerous different types of adversarial attacks [Li+22; LXL23;
Kum+20]. However, defenses against adversarial attacks remain relatively primitive
and “lack fundamental security rigor” [Kum+20]. The current defense landscape has
even been likened to “crypto Pre-Shannon” [Car19].
This dissertation focuses on two related types of adversarial attacks. First,
poisoning attacks manipulate model predictions on pristine or “natural” test instances
by adversarially modifying the training sets. Second, backdoor attacks manipulate
34
predictions by combining perturbations to the training set and with perturbations to
the test set. A recent survey of governmental and corporate organizations [Kum+20]
found that poisoning and backdoor attacks were the first and third biggest ML security
concerns, respectively, due to previously successful attacks [Lee16; Mur16]. Where
applicable, this dissertation also considers evasion attacks, which adversarially perturb
only test instances.
This dissertation proposes three novel defenses to make ML systems more robust
against poisoning and backdoor attacks. We focus on two primary strategies to stop an
attacker. We first consider two certified defenses which provide guaranteed robustness
given a specific threat model – i.e., definition of the attacker’s capability. We then
propose a forensic defense that provides insights into the identity, goals, and methods
of an attacker so as to stop the attack via intervention outside the ML system. In
practice, these two types of defenses are complementary and can be deployed together
to enhance their effectiveness.
Below we briefly summarize this dissertation’s primary contributions. Chapters 4
to 6 each provide a more detailed list of the chapter’s corresponding contributions.
– A reduction from certified regression to certified classification (Chapter 4). Our
reduction allows regression tasks to directly reuse methods that were previously
used only for classification.
– A unified, certified defense against sparse1 (ℓ0) poisoning, backdoor, and evasion
attacks (Chapter 5) – ℓ0 or otherwise. To the extent of our knowledge, our
method is the first to provide non-trivial guarantees over this union of attack
types.
1A “sparse” or ℓ0 attacker arbitrarily controls an unknown subset of the training and/or test
features [Sch+19; LF21; Jia+22b].
35
– A defense that simultaneously identifies the target(s) of poisoning and backdoor
attacks while also mitigating the attack (Chapter 6).
Note that all proofs appear in the appendix (Chapter B).
Before detailing our specific contributions, Chapter 2 first reviews general
nomenclature. Chapter 3 then reviews related defenses against adversarial attacks.
Note that Chapters 3, 4, 5, and 6 as well as Appendices B, D, and C contained
published and unpublished material coauthored with Daniel Lowd.
36
CHAPTER 2
PRELIMINARIES
This section introduces our primary nomenclature and includes a brief discussion
of the attacker threat models and defender objectives.
2.1 Nomenclature
In cases where a specific chapter uses specialized nomenclature, the custom
notation is introduced at the beginning of that chapter. See Chapter A in the
appendix for a full nomenclature reference.
Let [a] denote integer set {1, . . . , a}, and denote the corresponding power set 2[a].
1[q] is the in∑dicator function which equals 1 is predicate q is true and 0 otherwise.
Let H(a) := a 1i=1 denote the a-th harmonic number. Denote (multi)set A’s mediani
as medA. In cases where A’s cardinality is even, the median is the midpoint between
A’s |A| -th and ( |A| + 1)-th largest values.
2 2
Let x ∈ X ⊆ Rd denote a feature vector, where d := |x| denotes the feature
dimension. The feature set is denoted [d]. y ∈ Y ⊆ R denotes a dependent target
value; we consider both discrete and continuous target values. Let Z := X × Y denote
the instance space. Training set D := {zi}n ni=1 ⊂ Z consists of n instances where the
i-th training instance is tuple zi := (xi, yi).
Model f : X → Y is trained on D. Given arbitrary test instance (xte, yte), the
model prediction is denoted ŷte := f(xte). This dissertation considers both ensemble
and singleton models. We introduce the notation for these two types of models below.
Singleton Model We consider both parametric and non-parametric singleton
models. In the case of parametric models, let θ ∈ Rp denote f ’s model parameters.
Let f(x; θ) denote a parameterized prediction for x ∈ X . We usually drop parameter
vector θ for brevity and to enhance readability.
37
Parametric models are trained using any iterative, first-order optimization
algorithm (e.g., gradient descent, Adam [KB15]). Let L : Y × Y → R≥0 denote the
loss function. Denote instance z’s empirical risk w.r.t. θ as L(z; θ) := L (f(x; θ), y).
θ0 denotes f ’s initial parameters with θ0 randomly set and/or pre-trained. During
each training iteration t ∈ {1, . . . ,T}, the optimizer updates parameters θt using
loss L , parameters θt−1, and batch Bt ⊆ D, where b := |Bt|. Gradients are denoted
(t)
gi := ∇θL(zi; θt); the gradient’s superscript “(t)” is dropped when the iteration is
clear from context.
Ensemble Model In cases where f is an ensemble, let L denote the number
of submodels, where fl : X → Y is the l-th submodel (l ∈ [L]). Ensemble submodels
are deterministic, meaning given a fixed submodel training set and xte, submodel
prediction fl(xte) is always the same.
A decision function aggregates the L submodel predictions to form ensemble f ’s
overall prediction; f ’s decision function∑is voting-based. LetL
ċy(xte) := 1[fl(x) = y] (2.1)
l=1
be the number of ensemble submodels that predict label y ∈ Y for xte ∈ X . The
plurality label for xte is defined as
ypl = argmax ċy(xte). (2.2)
y∈Y
The runner-up label (i.e., the label that receives the second-most votes) is defined as
yru = argmax ċy(x). (2.3)
y∈Y\ypl
All ties are broken by selecting the smallest class indices.
38
2.2 On Attacker Threat Models
A threat model defines the assumptions made regarding an attacker’s capabilities.
Chapters 4, 5, and 6 detail this dissertation’s primary theoretical contributions. Each
chapter considers a different poisoning or backdoor threat model, and we discuss the
corresponding threat model towards the beginning of each of these three chapters.
2.3 On the Defender Objectives
Differences in the attacker’s threat model lead to differences in the defender’s
objective(s). At the beginning of Chapters 4, 5, and 6, we detail the chapter’s
corresponding defender objective.
39
CHAPTER 3
RELATED WORK
This chapter draws on previously published, coauthored material [HL22a; HL23c;
HL23a]. Hammoudeh wrote this entire section, including adding new related work that
did not appear in the coauthored material. Lowd provided supervision and editorial
suggestions in the original papers [HL22a; HL23c].
Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set Attack’s
Target Using Renormalized Influence Estimation”. In: Proceedings of the
29th ACM SIGSAC Conference on Computer and Communications Security.
CCS’22. Los Angeles, CA: Association for Computing Machinery, 2022. url:
https://arxiv.org/abs/2201.10055
Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression
to Certified Classification for General Poisoning Attacks”. In: Proceedings
of the 1st IEEE Conference on Secure and Trustworthy Machine Learning.
SaTML’23. 2023. url: https://arxiv.org/abs/2208.13904
Zayd Hammoudeh and Daniel Lowd. “Feature Partition Aggregation:
A Fast Certified Defense Against a Union of ℓ0 Attacks”. In: Proceedings of
the 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning.
AdvML-Frontiers’23. 2023. url: https://arxiv.org/abs/2302.11628
Defenses against adversarial attacks partition into two broad categories. First,
empirical defenses derive from understandings and observations about the underlying
mechanisms adversarial attacks exploit to change a network’s predictions. Empirical
defenses provide no guarantees of their effectiveness, and adversaries can adapt their
attacks to bypass the defense – often with very little effort [Tra+20]. In contrast,
certified defenses provide formal guarantees on their effectiveness given a specific set
40
of assumptions (i.e., threat model). These two defense categories are complementary
and can be deployed together for better performance.
The threat model defines the types of attacks against which the defense is directed.
Generally, most adversarial defenses target a single type of attack, e.g., poisoning,
backdoor, evasion, etc. Very few defenses – certified or empirical – are robust across
attack types [Web+23; HL23a]. We, therefore, organize the discussion of related
adversarial defenses below based on the type of attack they consider.
3.1 Defenses Against Evasion Attacks
Recall from Chapter 1 that evasion attacks manipulate model predictions
by perturbing test instances. Formally, given some test instance (xte, yte), the
adversary attempts to find some perturbation δ ∈ B such that yte ̸= f(xte + δ), where
B ⊂ X defines the perturbation neighborhood. Often B is constrained to limit the
perturbation’s perceptibility. Perturbation δ is typically constructed iteratively using
either the target model or a surrogate. Common perturbation optimization algorithms
include fast gradient sign method (FGSM) [GSS15] and projected gradient descent
(PGD) [Mad+18].
What constitutes an imperceptible adversarial perturbation is subjective and
implicitly human-centric [LSF21]. For continuous domains like vision, the perturbation
neighborhood is usually constrained based on some ℓp norm, where p ∈ N ∪∞.
Formally, an ℓp-bounded attack defines the Bp,ϵ(xte) neighborhood w.r.t. xte ∈ X
as
Bp,ϵ(xte) := {x− xte : ∥x− xte∥p ≤ ϵ}, (3.1)
where ϵ ∈ R is the perturbation radius [LXL23]. Chapter 5 considers a sparse,
or ℓ0, attacker that arbitrarily controls an unknown subset of the features. For
discrete inputs (e.g., text), an ℓp perturbation model generally does not apply;
41
instead, text perturbation models focus on preserving syntactic structure and semantic
meaning [Ebr+18; Jin+20].
Below we discuss empirical and certified defenses against evasion attacks.
3.1.1 Empirical Defenses. Empirical evasion defenses can be broadly
categorized into two classes: adversarial training and gradient obfuscation methods.
We discuss both of these defense strategies below.
First, adversarial training is perhaps the best-known method to improve a model’s
robustness against adversarial attack [Bai+21]. Formally, empirical risk minimization
(without regularization) defines the optimal∑model parameters as
θ∗ := argmin L(zi; θ). (3.2)
θ
(xi,yi)∈D
Adversarial training considers a minimax constraint that, ideally, minimizes the
empirical risk over worst-case perturbations given B where the adversarially optimized
model parameters are ∑
θ∗adv := argmin maxL((xi + δi, yi); θ). (3.3)
θ δi∈B
(xi,yi)∈D
Typically, adversarial training considers ℓp-bounded attacks for one or more definitions
of p [TB19; MWK20; SGF22].
Finding the worst-case adversarial perturbation is provably hard [Wal+21]. First-
order and second-order gradient-based methods often only approximate Eq. (3.3)’s
inner maximization [LXL23]. While adversarial training generally improves robustness,
the robustness improvement is generally not verifiable against worst-case perturbations,
meaning adversarial training is only an empirical defense.
Gradient Obfuscation Methods The second class of empirical evasion defenses
(implicitly) exploit the reality that, in practice, most adversarial defenses are evaluated
42
against gradient-based attacks (e.g., FGSM, PGD) [KGB16; CW17; Mad+18]. These
defenses [Buc+18; Ma+18; Guo+18; Dhi+18; XZZ20] (unknowingly) rely on obfuscated
gradients [ACW18], which reduce the utility of the gradients used to construct
the adversarial example. Without useful gradients, traditional iterative, gradient-
based attacks fail. However, obfuscation-based defenses provide a false sense of
security [ACW18] since they are easily bypassed by an adaptive attacker [TB19].
The brittleness of empirical evasion defenses was a primary impetus that spurred the
emergence of certified defenses, which we discuss next.
3.1.2 Certified Evasion Defenses. Recall that a certified defense provides
provable robustness guarantees under a specific threat model. Numerous orthogonal
strategies to certify evasion robustness have been proposed. For example, some
methods bound a network’s curvature or Lipschitz constant to prove the impossibility
of an adversarial example under the threat model [Wen+18; LLP20]. Other methods
use a linear relaxation of a ReLU network to prove that a prediction is robust w.r.t.
Bp,ϵ(xte) [Gow+19]. A complete review of certified evasion defenses is well beyond
the scope of this work. For a detailed review of existing certified evasion defenses
including a taxonomy of the methods, we refer the reader to the excellent survey by
Li et al. [LXL23].
The subclass of certified evasion defenses most relevant to this work are based on
randomized smoothing [Li+19; CRK19; Léc+19]. Formally, given some xte, smoothing-
based methods create a smoothed classifier f̃ whereby prediction f̃(xte) is the most
probable label within some predefined region around xte. For example, for ℓ2 smoothing,
the evaluated region is based on isotropic Gaussian N (xte, σ2I), where σ > 0 is a
user-specified hyperparameter. In most situations, it is intractable for randomized
smoothing to measure each class’s probability exactly; instead, class probabilities
43
are estimated using Monte Carlo methods [CRK19]. Smoothing-based methods then
commonly use the Neyman-Pearson lemma [NP33] to bound the certified radius.
In terms of smoothing-based evasion defenses, the method most relevant to
this dissertation is randomized ablation (RA) – a specialized form of randomized
smoothing [CRK19] for sparse (ℓ0) evasion attacks
1 where the adversary arbitrarily
controls an unknown subset of the features [LF20b]. RA creates a smoothed
classifier by repeatedly evaluating different ablated inputs, each of which keeps a
small random subset of the features unchanged and masks outs (ablates) all other
features. Randomized smoothing certifies ℓ0-norm robustness, where the attacker
arbitrarily controls a subset of xte’s features.
3.2 Defenses Against Poisoning and Backdoor Attacks
Evasion attacks are just one means to manipulate a model’s predictions. Poisoning
and backdoor attacks are an alternative approach whereby an attacker manipulates
predictions by modifying the training set. Poisoning and backdoor attacks differ only
in that the latter allows test instance (xte) perturbations while the former considers
pristine test instances.
Poisoning attacks can be subclassified based on their effect on the model.
An indiscriminate or availability poisoning attack degrades the model’s overall
performance, e.g., accuracy [BNL12; Xia+15; Fow+21]. In contrast, targeted attacks
seek to manipulate an ML system’s prediction on specific target instances [YHL23a;
YHL23b]. A single-target attack seeks to influence one specific test prediction [Che+17],
while multitarget attacks manipulate multiple test predictions – usually with some
shared property (e.g., instances related to a specific individual or company) [Jag+21;
1Sec. 5.2 formalizes ℓ0-norm robustness.
44
Lin+20]. Generally, all backdoor attacks are multitarget, where the backdoor trigger
can be added to any (related) test instance.
Below we describe both empirical and certified defenses for poisoning attacks.
In practice, these two defense categories are complementary and can be deployed
together for better performance.
3.2.1 Empirical Classification Defenses. Like empirical evasion defenses,
empirical defenses against backdoor and poisoning attacks provide no guarantees of
their effectiveness. Empirical poisoning and backdoor defenses are generally founded
on insights into the mechanisms and characteristics of specific attacks. As such, most
existing empirical poisoning and backdoor defenses assume highly restricted threat
models, including specific data modalities (e.g., only vision [Gao+19; Ude+19; VB20;
Zhu+21]), model architectures (e.g, CNNs [Kol+19]), optimizers [HNM19], or training
paradigms [Sor+20].
Below we review a few common categories of empirical poisoning and backdoor
defenses. For a more comprehensive review, we direct the reader to the survey by Li
et al. [Li+22].
Sanitization-Based Defenses These methods seek to identify and remove
(i.e., sanitize) the adversarially perturbed instances in the training set. Sanitization-
based methods all generally follow the same paradigm. They first identify training
instances meeting some “outlier” criteria. The outliers are then removed from training
set D, and the model retrained [Per+20]. For example, Tran et al. [TLM18] found
that backdoor attacks tend to leave a “spectral signature,” i.e., a detectable trace in
the spectrum of the covariance of the feature representation. Tran et al. score and
filter training instances based on their variance from typical feature representations.
Similarly, Chen et al. [Che+19] cluster training instances based on the principal
45
components of the last linear layer; any training instances that are far away in feature
space from other instances with the same label are sanitized from the training set.
A primary limitation of sanitization-based defenses is determining how many
training instances to remove. Oversanitization results in excess removal of clean
training instances, degrading the model’s clean performance. Undersanitization means
that the attack may still succeed, albeit at a lower rate.
Model Disinfectant Defenses Rather than cleaning the training set, model
disinfectant defenses try to directly repair the poisoned model itself. For example,
Liu et al. [LXS17] neutralize any corrupted model weights by finetuning the model
on known-clean data hoping that any corruption is deactivated through deliberate
catastrophic forgetting. Another common disinfectant strategy relies on the insight
that poisoning and backdoor attacks activate rarely-used neurons. Pruning-based
defenses identify and disable these “unimportant” neurons expecting this will stop an
attack.
Trigger Synthesis Defenses These methods seek to reconstruct any backdoor
trigger(s) a model learned [Gao+19; Ude+19; VB20; Zhu+21]. The identified triggers
are added to known-clean data and the model retrained in the expectation catastrophic
forgetting deactivates the trigger. Note that trigger-synthesis defenses are specific to
backdoor attacks since poisoning attacks are triggerless.
3.2.2 Certified Pointwise Classifiers. Recent years have seen a marked
shift away from empirical poisoning and backdoor defenses to certified methods [SKL17;
JCG21; Web+23; WLF22b; Rez+23]. Certified defenses differ concerning the
assumptions they make about the attacker’s ability to perturb the training set. For
example, Rosenfeld et al. [Ros+20] consider an attack that is only able to perturb
46
(i.e., “flip”) training labels. Weber et al. [Web+23] propose an alternate threat model
that bounds the total ℓ2 perturbation distance of the training set.
The instance-wise poisoning threat model allows the attacker to arbitrarily
insert or delete entire instances in the training set and provide pointwise guarantees,
i.e., certify the robustness w.r.t. individual predictions. These general-purpose certified
poisoning classifiers are voting-based and derive their guarantees by lower bounding
the number of training set modifications required to flip the predicted label. The
primary difference between these certified classifiers is in the mechanism used to
generate the “votes” multiset. We briefly review a few of these methods below.
Jia et al.’s [Jia+22a] certified poisoning classifier based on nearest neighbor
methods is the simplest certified poisoning classifier, where the multiset of “votes” is
the training labels from the test instance’s neighborhood. Formally, let N (xte) denote
the multiset of labels for the k training instances nearest to xte. Given plurality label
ypl = f(xte⌈),∑the pointwise certified∑poisoning robustness is ⌉
y∈N (x ) 1[ypl = y]−te y∈N (xte) 1[yru = y] + 1[ypl > yru]R = − 1, (3.4)
2
where the indicator function breaks ties deterministically by choosing whichever label
is assigned the larger index.
The second class of certified poisoning classifiers is ensemble based. Deep partition
aggregation (DPA) was the first such method [LF21]. Levine and Feizi [LF21, Thm. 1]
specify DPA’s certified rob⌊ustness bound as ⌋
ċy (xte)− (ċyru(xte) + 1[ypl ru < ypl])R = , (3.5)
2
where the indicator function breaks deterministically ties by choosing whichever label is
assigned the smaller index.2 Implicitly, Eq. (3.5) assumes the worst-case that a single
2Observe that other than how ties are broken, Eqs. (3.4) and (3.5) calculate instance-wise
robustness R in functionally the same way.
47
perturbation to any submodel’s training set can change that submodel’s prediction
arbitrarily. We formalize this assumption below.
Def. 3.1. Unit-Cost Assumption: Any modification to a submodel’s training set
changes the submodel arbitrarily.
In practice, there are limits to how much a single training set modification will
alter a submodel and its predictions – in particular for models with strong inductive
biases (e.g., linear models). Therefore, the unit-cost assumption’s pessimism can cause
methods like DPA to underestimate a prediction’s true robustness. Nonetheless, this
assumption greatly simplifies certifying ensemble classifier robustness by reducing the
task to just submodel vote counting.
Multiple improvements to DPA have been proposed. Wang et al. [WLF22a]
modify DPA’s ensemble so that submodels can be trained on overlapping data, which
(slightly) improves the ensemble’s certification bounds. More recently, Rezaei et al.
[Rez+23] propose run-off elections, a novel decision function for DPA that can improve
DPA’s certified robustness by several percentage points.
3.2.3 Robust Regression. So far, we have focused on exclusively robust
methods for classification, where the label space is finite. Many of the methods
proposed above do not (directly) generalize to cases where target space Y is continuous
or has unbounded cardinality. Below we discuss previous methods to improve the
robustness of regression.
3.2.3.1 Resilient Regression. Early methods were rooted in robust
statistics and focused on mitigating the effect of training set outliers. For example,
various trimmed loss functions (e.g., Huber [Hub64], Tukey [BT74]) cap a training
outlier’s influence on a model [JW78; Lec89]. Methods like RANSAC [FB81] are based
on data sanitization [TZ00; RH11].
48
3.2.3.2 Certified Regression. The above robust regressors primarily
target random noise/outliers. As explained in Chapter 1, adversarial training instances
can be much more insidious since they are crafted to avoid detection by appearing
uninfluential and may only affect a very small fraction of test predictions [Che+17;
Wal+21]. These factors can combine to make adversarial training instances difficult
for resilient methods to fully detect and correct [Li+22].
Some existing poisoning and backdoor regression defenses do provide pointwise
robustness guarantees, albeit under strong assumptions about the underlying data
distribution [KKM18]. For example, some work assumes that the training set
follows a linear data distribution with arbitrary white, Gaussian noise [CCM13;
Liu+20a]. Others assume the data distribution’s feature matrix is low rank [Liu+17].
Conditioning a guarantee on a specific data distribution is inherently precarious – in
particular if the strong distributional assumption rarely holds and cannot be easily
verified. If the distributional assumption does not hold, any guarantee is no guarantee
at all.
Note that there are some poisoning defenses for regression that provide guarantees
without making distributional assumptions [Jag+18; KKM18]. However, their
robustness guarantees are themselves distributional. For example, Jagielski et al.
[Jag+18] bound the clean training data’s mean error but provide no pointwise
guarantees. In other words, such methods do not provide insight into each prediction’s
robustness.
In summary, while certified classification methods has seen numerous promising
advances in recent years, certified regression still largely lags behind. Better certified
regressors that make fewer strong assumptions are sorely needed.
49
3.3 Defenses Outside the ML System
All of the ideas above seek to improve an ML system’s robustness by improving
the model itself. A motivated attacker will search for and exploit the weakest point in
the ML system, which may not be the model. For example, human failures are often
a common cause of security breaches [Col22; Ric22].
The best way to defend an ML system may lie outside of the ML system.
For example, email spammers can be stopped by blocking their access to payment
processors [Lev+11]. These outside defenses require information about an attacker’s
goals, and methods [Tur20]. Knowledge about an attack, including its target, enables
forensic and security analysts to reason about an attacker’s identity. Furthermore,
insight into an attacker and their motivations helps anticipate future attacks [Pit+09]
and build cost-effective, targeted defenses [Aga+19].
50
CHAPTER 4
REDUCING CERTIFIED REGRESSION TO CERTIFIED CLASSIFICATION
This chapter contains previously published, coauthored material [HL23c].
Hammoudeh developed the primary method, developed all code, conducted all
experiments, and wrote the manuscript. Lowd provided supervision, editorial
suggestions, and proposed some supplemental experiments.
Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression to
Certified Classification for General Poisoning Attacks”. In: Proceedings
of the 1st IEEE Conference on Secure and Trustworthy Machine Learning.
SaTML’23. 2023. url: https://arxiv.org/abs/2208.13904
Section 3.2.3.2 explains that multiple certifiably-robust classifiers have recently
been proposed. These methods certify robustness against the insertion and deletion of
arbitrary instances in the training set. For example, Levine and Feizi [LF21] propose
deep partition aggregation, which uses an ensemble of L deterministic, independent
submodels trained on disjoint training sets. In addition, Jia et al. [Jia+22a] propose a
certified classifier based on nearest-neighbor classification. Certified regression has not
kept pace with these rapid advances in certified classification.
Formally, a problem Q is reducible to a different problem Q′ if an efficient
algorithm to solve Q′ can also efficiently solve Q [DPV08]. Our key insight is that
certified regression is reducible to voting-based certified classification. Mapping
certified regression to certified classification requires only minimal changes to the
certified classifier’s architecture, with the robustness certification function identical.
Given reducibility, an important takeaway is that certified regression can be viewed
as no harder than certified classification.
51
Coupling our reduction with existing certified classifiers [Jia+22a; LF21; WLF22a],
we propose six new certifiably-robust regressors. To the extent of our knowledge, our
methods are the first to provide pointwise regression robustness guarantees against
poisoning without both distributional and model assumptions.
This chapter’s primary contributions are enumerated below.
1. We formalize three paradigms based on median perturbation to map certified
regression to certified classification. All of this chapter’s certified regressors
apply one of these paradigms.
2. We propose two provably-robust instance-based regressors – one based on
k-nearest neighbors and the other based on all training instances within a
feature-space region.
3. We separately propose four ensemble-based certified regressors, where one pair of
regressors trains submodels on disjoint data while the other pair allows submodels
to be trained on overlapping data.
4. We significantly improve the certification performance of our ensemble-based
regressors and existing certified classifiers via a tighter analysis of submodel
prediction stability.
5. We demonstrate our methods’ effectiveness on both regression and classification
datasets, where we certify significant fractions of the training set and even
outperform state-of-the-art certified classifiers on binary classification.
4.1 Preliminaries
Below, we formalize this chapter’s threat model and defender objective. We then
review certification bounds for regression and how a certified regressor can be reused
as certified binary classifier.
52
Threat Model For arbitrary test instance (xte, yte), the adversary’s objective is
to alter the model so that the prediction error |f(xte)− yte| is as large as possible.
Our primary threat model considers an adversary that can insert arbitrary instances
into training set D and arbitrarily delete instances from D.1 The attacker has perfect
knowledge of the learner and our method. We make no assumptions about the
underlying data distribution or adversarial training instances.
Our Objective Determine certified robustness R – a guarantee on the number of
training instances that can be inserted into or deleted from training set D without the
model prediction ever violating the requirement that ξl ≤ f(xte) ≤ ξu, where ξl, ξu ∈ R
are user-specified and application dependent. Note that robustness R is pointwise,
meaning each prediction f(xte) is certified individually.
4.1.1 One-Sided vs. Two-Sided Certification Bounds. For simplicity,
the remaining sections exclusively describe how to certify a one-sided upper bound,
f(x) ≤ ξ, since all other bounds reduce to this base case. For example, certifying
a one-sided lower bound reduces to certifying an upper bound via negation as
f(x) ≥ ξ ⇔ −f(x) ≤ −ξ. Likewise, a two-sided bound is equivalent to the worst
one-sided robustness as ( ) ( )
ξl ≤ f(x) ≤ ξu ⇔ (f(x) ≥ ξl ∧ )f(x() ≤ ξu ) (4.1)
⇔ −f(x) ≤ −ξl ∧ f(x) ≤ ξu .
4.1.2 Relating Regression and Binary Classification. Binary
classification can be viewed as a simple form of regression where Y = {±1}.
The model’s decision function becomes sgn f(xte) where sgn a = +1 if a > 0 and
1Sec. 4.7 considers a somewhat restricted threat model where attackers only make arbitrary
deletions but no insertions. This allows us to empirically evaluate our method despite few base
models fully utilizing our threat model.
53
−1 otherwise. While our primary focus is regression, our methods also achieve
state-of-the-art results for binary classification.
4.2 Warmup: Perturbing a Set’s Median
Traditional center statistics such as mean have a breakdown point of 0, i.e., altering
a single value in a set can shift the mean arbitrarily. In contrast, median has maximum
robustness, i.e., a breakdown of 50%. A high breakdown point entails that a statistic
is stable and resistant to change. We formalize changes to median below.
Def. 4.1. Median Perturbation: The task of altering a set’s contents so that its
median exceeds some specified ξ ∈ R.
Throughout this work, determining pointwise robustness R simplifies to
quantifying the number of changes that can be made to a set without perturbing
its median. To better foster intuitions, we first formalize robustness R w.r.t. simply
perturbing a multiset’s median and unrelated to any model. Later sections apply
these ideas to link certified regression and certified classification.
Formally, let V be a multiset of cardinality L := |V|. Denote the subset of
elements in V that are at most ξ as Vl := {νl ∈ V : νl ≤ ξ} and denote its complement
Vu := V \ Vl.
Below we define three different paradigms that constrain how V is modified.
Figure 1 visualizes our first two unweighted paradigms. Note that Fig. 1’s values are
repeatedly used throughout this chapter, including in Fig. 2 for our third median
perturbation paradigm and later in Figs. 4 and 5. In all cases below, consider when
medV ≤ ξ since the degenerate case of medV > ξ is by definition non-robust.
54
4.2.1 Unweighted Swap Paradigm. Here, set V has fixed, odd-valued2
cardinality L. All modifications to V take the form of “swaps” where a single value in V
is replaced with any real number. Fig. 1b visualizes the unweighted swap paradigm on
a simple set V = {2, . . . , 6} of L = 5 values. Lemma 4.2 tightly bounds the number of
arbitrary swaps R that can be made to V without perturbing its median.
Lemma 4.2. For ξ ∈ R, real multiset V where medV ≤ ξ with L := |V| odd, and
Vl := {νl ∈ V : νl ≤ ξ}, let Ṽ be a multiset formed from V where elements have been
arbitrarily replaced. If the number of elements replaced in Ṽ does not exceed
⌈ ⌉
L
R = |Vl| − , (4.2)
2
it is guaranteed that med Ṽ ≤ ξ.
⌈ ⌉
Proof sketch. For a set of odd cardinality L, the med⌈ia⌉n is always the set’s L -th2
largest value. For V ’s median to be at most ξ, at least L items in V cannot exceed ξ.
2
Each swap reduces the number of elements not exceeding ξ by at mos⌈t o⌉ne. If there are
|Vl| elements less than or equal to ξ in V and there mu⌈st⌉be at least L such elements2
to avoid perturbing the median, then at most |Vl| − L swaps can be performed.2
4.2.2 Insertion/Deletion Paradigm. For the second paradigm, V is no
longer fixed cardinality (it may expand or contract), and L may be even or odd. Each
modification of V takes the form of either a single deletion or insertion but not both.
Figs. 1c and 1d visualize median perturbation under insertions and deletions resp.
with certified robustness R following Lem. 4.3. Suppl. Sec. B.1 proves that worst-case
insertions and deletions perturb a set’s median in exactly the same way and thus
2Fixing L as odd simplifies the overall formulation and presentation since it ensures that V’s
median is always an element in V . In all cases here where L is fixed as odd, L is always a user-selected
hyperparameter. Extending our formulation to consider even L is not challenging but is verbose.
55
Initial Median ξ
V: 2 3 4 5 6 ∞ ∞
(a) Initial set V := Vl ⊔ Vu
V: 2 3 4 5 6 ∞ ∞
New Median
(b) Unweighted swap paradigm with R = 1
V: 2 3 4 5 6 ∞ ∞
New Median
(c) Insertion only with R = 2
V: 2 3 4 5 6 ∞ ∞
New Median
(d) Deletion only with R = 2
Figure 1. Unweighted Median Perturbation: (1a) Blue denotes elements in
subset Vl, i.e., elements in V with value at most ξ = 5.4. Vu’s values are red. Each
“swap” (1b) switches a value in Vl with an arbitrarily large replacement. Deletions (1d)
and insertions (1c) are interchangeable (suppl. Lemma B.1), with both yielding the
same median value in the same number of modifications made to V . In Figs. 1b to 1d
above, any additional modifications to the set would perturb the median.
are interchangeable. That is why Figs. 1c and 1d have identical certified robustness
(R = 2).
Lemma 4.3. For ξ ∈ R and real multiset V where medV ≤ ξ, define L := |V| and
Vl := {νl ∈ V : νl ≤ ξ}. Let Ṽ be any multiset formed from V where elements have
been arbitrarily deleted and/or inserted. Then, if the total number of inserted and
deleted elements in Ṽ does not exceed
R = 2|Vl| − L− 1, (4.3)
it is guaranteed that med Ṽ ≤ ξ.
56
Initial Median ξ
V: 2 3 4 5 6 ∞ ∞
R: 3 4 5 6 7
(a) Initial sets where V = {2, . . . , 6} and R = {3, . . . , 7}
V: 2 3 4 5 6 ∞ ∞
R = 3 + 0 + 0 + 0 + 3
(of 4) (of 5) (of 6) (of 7) (of 3)
New Median
(b) Weighted swap paradigm with R = 6
Figure 2. Weighted Swap Paradigm: Extension of Fig. 1 to weighted costs. For
simplicity and w.l.o.g., let R = {3, . . . , 7}, i.e., ∀l rl = νl + 1 Fig. 2a is identical to
Fig. 1a except below each element νl is its corresponding weight rl. Observe ∆ = 1
and R̃l = {3, 4}. Fig. 2b shows that for R = 6 (visualized below each element), it is
impossible to perturb the median, and any additional weight would be sufficient to
swap out ν2 = 3.
Eq. (4.2)’s bound may be non-tight by 1. We did this for consistency with other ideas.
Comparing Eqs. (4.2) & (4.3), the insertion/deletion paradigm’s robustness R is
about twice that of the unweighted swap paradigm. Intuitively, this is because one
swap entails two separate operations – both an insertion and a deletion.
4.2.3 Weighted Swap Paradigm. The two median-perturbation
paradigms above assume that each modification to V has equivalent cost. Consider
a generalized swap paradigm where each value νl ∈ V has an associated weight/cost
rl ∈ N. We seek to tightly bound the budget an attacker could spend with it remaining
guaranteed that f(xte) ≤ ξ; we still denote this budget R.
Given Vl as abo⌈ve⌉, Rl := {rl : νl ∈ Vl} contains Vl’s corresponding weights/costs.
Define ∆ := |V Ll| − , and let multiset R∆ be the ∆ smallest values in R2 l
(i.e., |R∆| = ∆). Directly applying Lem. 4.2, an obvious but non-optimal bound
is
57
∑
R ≥ r. (4.4)
r∈R∆
Recall Fig. 1a where V = {2, . . . , 6} and ξ = 5.4. Consider its weighted extension
where for simplicity and w.l.o.g. R = {3, . . . , 7}, i.e., ∀l rl = νl + 1. Eq. (4.4) certifies
robustness R = 3 for this example. However, Fig. 2b shows R = 6 since the budget
of the second (i.e., (∆ + 1)-th) largest value in Rl can be partially used. Lemma 4.4
formalizes this insight into a tight bound for median perturbation under weighted
swaps.
Lemma 4.4. For ξ ∈ R and real multiset V where medV ≤ ξ, let R be V’s
corresponding integral weight multiset where L := |V| = |⌈R|⌉is fixed and odd. Define
Rl := {rl ∈ R : νl ≤ ξ}, and let R̃ Ll be the smallest (|V| − + 1) values in Rl. Then2
the cost to perturb V’s median exceeds
∑
R = r − 1. (4.5)
r∈R̃l
4.3 Reducing Regression to Voting-Based Binary Classification
We now show how methods used to certify binary classification can be adapted to
certify regression. During inference, all voting-based certified methods (both classifiers
and regressors) follow the same basic procedure.
First, the model generates a multiset of votes, which for binary classification
we denote V±1. Certified classifiers only differ in how V±1 is constructed and in the
consequences that construction has on certifying R. For example, V±1 could be a kNN
neighborhood or the submodel predictions in an ensemble. Nonetheless, for binary
classification, V±1 contains at most two unique values (+1 and −1), meaning V±1’s
majority label is also its median. In other words, f(xte) = medV±1.
58
Inputs Certified Regressor Outputs
V Median f(xte)
V
ξ − sgn(·) ±1 Robustness
R
R Certifier
Figure 3. Certified Regression to Certified Classification Reduction: For
xte ∈ X , the decision function is f(xte) := medV – just like voting-based certified
classification. Certified regression binarizes V into V±1, which is used by the robustness
certifier (optionally with weights R) to determine R.
To certify robustness R, existing methods rely on a function we term the robustness
certifier. The function’s inputs are votes V±1 and optionally weights/costsR. Implicitly,
the certifier knows how the votes were generated and how changes to training set D
could affect V±1. Generally, a simple procedure to construct V±1 entails a simple
certifier, and complex construction implies a complex certifier. Fundamentally, for
voting-based, binary classification, robustness certification always reduces to the same
core idea. If f(xte) = medV±1, then for the runner-up label to overtake the majority
label, V±1’s median must be perturbed. Therefore, certifying voting-based, binary
classification is simply certifying median perturbation.
To generalize a voting-based, certified classifier to certify regression, two primary
modifications are required; we visualize our regression to classification reduction in
Fig. 3.
First, the model is modified from generating binary votes V±1 to generating real-
valued ones denoted V . The changes necessary to make this switch are specific to the
underlying certified classifier. In some cases, no change is required [Jia+22a]; for others,
ensemble submodel classifiers are simply replaced with submodel regressors [LF21;
WLF22a].
59
The second modification is more subtle. If V is real-valued, a robustness certifier
expecting binary votes cannot be directly applied. That is where ξ ∈ R fits in; it
partitions V into two subsets: Vl containing all “votes” at most ξ and Vu containing
all “votes” exceeding ξ. We can think of these subsets as two different classes
where if f(xte) ≤ ξ, Vl is the majority class and Vu runner-up. For any prediction
f(xte) := medV , the robustness certifier’s output R equals the number of training set
modifications that can be made without ever perturbing medV beyond ξ.
Lemma 4.5 formalizes the connection between real-valued and binarized robustness.
This symmetry in robustness derives from both tasks’ (implicit) shared reliance on
median.
Lemma 4.5. For ξ′ ∈ R and real multiset V ′ where medV ′ ≤ ξ′, let
V±1 := {sgn (νl − ξ′) : νl ∈ V ′}. (4.6)
Let R be the corresponding integral weight multiset of V ′ where |V ′| = |R|. Then, under
the (un)weighted swap and insertion/deletion paradigms, both V = V ′ with ξ = ξ′ and
V = V±1 with ξ = 0 have equivalent robustness R.
By binarizing V , Lem. 4.5 enables us to directly reuse robustness certifiers from binary
classification to certify regression.
Our reduction to certified classification entails two primary benefits. First, it
allows us to repurpose for regression the diverse set of voting-based, certified classifiers
that already exist [Jia+22a; LF21; WLF22a]. Moreover, as new voting-based, certified
classifiers are proposed in the future, these yet undiscovered methods can also be
reformulated as certified regressors.
Although this work focuses on certified poisoning defenses, other types of certified
defenses also rely on voting-based schemes, including randomized smoothing methods
60
for evasion attacks [LF20b; Jia+22b]. Our certified regression to certified classification
reduction can also be applied to these other types of voting-based defenses as well.
As mentioned above, the procedure to construct the set of votes and to certify
robustness is unique to each classifier. The next three sections describe how to certify
regression using progressively more complex models, with each method based on a
reduction to an existing voting-based certified classifier.
4.4 Certified Instance-Based Regression
For the first method, recall from Sec. 3.2.2 that Jia et al. [Jia+22a] propose
a state-of-the-art certified classifier based on kNN. Nearest-neighbor methods are
a specific type of instance-based learner (IBL), where predictions are made using
memorized training instances [AKA91]. IBLs generally rely on the intuition that
instances close together in feature space (X ) have similar target values (Y). Specifically,
IBLs search for stored training instances most similar to xte and derive the model
prediction from these retrieved neighbors.
We partition IBLs into two subcategories:
– Fixed-population neighborhood methods specify the exact number of “neighbors”
when making a prediction.
– Region-based neighborhood methods define a neighborhood as all training
instances in a specific feature-space region.
These two subcategories calculate certified robustness differently and are discussed
separately below.
All IBLs considered here use the same decision rule. Formally, given xte ∈ X and
real multiset neighborhood N (xte) returned by the IBL, the model’s prediction is the
neighborhood’s median, i.e., f(xte) := medN (xte). Recall that our goal is to certify
61
that if at most R arbitrary insertions or deletions are made to D, it is guaranteed that
f(xte) ≤ ξ.
4.4.1 Fixed-Population Neighborhood. As the name indicates, fixed-
population neighborhood IBLs make predictions using a fixed number of training
instances, i.e., ∀xte L = |N (xte)|. k-nearest neighbors is perhaps the best-known fixed-
population method. Traditionally, kNN returns the neighborhood’s mean value. For
clarity, we will refer to the version of kNN that uses the neighborhood’s median value
as k-Nearest Neighbors Median, or simply kNN-m.
Our threat model allows the adversary to insert arbitrary training instances
and/or delete any existing instances. Fig. 4b visualizes an example attack on a
kNN-m regressor. Since k is fixed, inserting a new instance ( ) into the neighborhood
causes one neighborhood instance to be ejected; in other words, insertions are simply
instance swaps. As a worst-case, we assume that the ejected element equals at most
threshold ξ, meaning each insertion always maximally increases the neighborhood’s
median. Under this simplifying assumption, adversarial insertions are always at least
as harmful as deletions for fixed-population neighborhood IBLs.
Neighborhood size k is a user-specified hyperparameter so let k be odd-valued.
Therefore, these fixed-population neighborhood IBL regressors satisfy all of the
criteria of median perturbation under the unweighted swap paradigm where L = k.
Theorem 4.6 then follows directly from Lemma 4.2.
Theorem 4.6. Let f be an instance-based regressor trained on set D. Given
ξ ∈ R and xte ∈ X , let real multiset N (xte) be xte’s neighborhood under f with
fixed, odd-valued cardinality L := |N (xte)|. Define Vl := {y ∈ N (xte) : y ≤ ξ}. Given
f(xte) := medN (xte) ≤ ξ, then if model f is trained on a modified D where the total
number of inserted and deleted training instances does not exceed
62
2 2 ∞ 2 ∞ ∞
3 3 3
x 6te x
6
te x
6
te
5 4 5 4 5 4
(a) Unperturbed (b) Fixed-population (c) Region-based
R = 1 R = 2
Figure 4. Certified Instance-Based Regression: Fig. 4a visualizes an unperturbed
IBL model. Test instance xte’s neighborhood is visualized as a dashed line with
neighborhood N (xte) identical to V in Fig. 1a. Fig. 4b shows an attack on a kNN-m
model where the neighborhood’s cardinality (L = 5) is fixed, and the one attack
instance ( ) replaces one instance in Vl ( ) (source Fig. 1b). A rNN-median model is
shown in Fig. 4c, where the two inserted instances ( ) do not change the neighborhood’s
radius (source Fig. 1c).
⌈ ⌉
L
R = |Vl| − , (4.7)
2
it is guaranteed that f(xte) ≤ ξ.
We denote kNN-m certified regression as kNN-CR. We defer the reader to the
supplement of the original paper [HL23c, Lemma 15] for the proof that when under
binary classification, kNN-CR and Jia et al.’s [Jia+22a] kNN classifier yield identical
robustness guarantees.
4.4.2 Region-Based Neighborhood. Neighborhood membership does
not need to be tied to the number of neighbors. Rather, a neighborhood can be
defined by specific criteria, with all stored training instances satisfying those criteria
included in the neighborhood. For instance, radius nearest neighbors (rNN) defines
xte’s neighborhood as all training instances within a given distance of xte [Ben75].
Alternatively, fully-random decision trees recursively partition the feature space into
63
disjoint regions, and a neighborhood is defined as all instances within the same feature
region [GEW06].
Fig. 4c visualizes an attack on an rNN-median learner, where the adversary inserts
malicious instances ( ) to perturb the median prediction. Unlike fixed-population
neighborhoods, the inserted instances do not cause any existing training instances to
be ejected. Rather, inserting and deleting training instances are distinct operations.
It is easy to see that region-based IBLs with median as the decision operator
follow Sec. 4.2.2’s insertion/deletion paradigm. Theorem 4.7 then follows directly from
Lemma 4.3.
Theorem 4.7. Let f be an instance-based regressor trained on D that partitions X
into disjoint regions. Given xte ∈ X , let real multiset N (xte) be xte’s neighborhood
under f where L := |N (xte)|. For ξ ∈ R, define Vl := {y ∈ N (xte) : y ≤ ξ}. If model f
is trained on a modified D where the total number of inserted and deleted training
instances does not exceed
R = 2|Vl| − L− 1, (4.8)
it is guaranteed that f(xte) ≤ ξ.
Jia et al. propose an rNN-based certified classifier, with the robustness certifier
identical to their kNN method. By using our insertion/deletion paradigm for the
robustness certifier instead of Jia et al.’s approach, Eq. (4.8)’s R roughly doubles.
4.4.3 Computational Complexity. Eqs. (4.7) and (4.8) require
determining Vl’s cardinality, which has complexity O(L). However, constructing
neighborhood N (xte) can require scanning the entire training set and has
64
complexity O(n). Therefore, certifying each IBL regression prediction’s robustness is
in O(n) – the same as Jia et al.’s certified kNN and rNN classifiers.
4.5 Certified Regression for General Models
Instance-based learners lend themselves to robustness certification. However, there
are many applications where IBLs perform poorly. This section explores reducing
certified regression to a second certified classifier, which will now allow us to use
whichever model architecture has the best performance.
Recall from Sec. 3.2.2 that Levine and Feizi’s [LF21] certified classifier, DPA,
uses an ensemble trained on partitioned training data. In this section, we first
reduce certified regression to certified classification using DPA. We then improve
the certification performance of DPA and by extension our certified regressor by
using tighter, weighted analysis. All certified regression ensembles we consider have
L submodels denoted f1, . . . , fL, and the ensemble decision function uses median,
i.e., f(xte) := med {fl(xte; 1), . . . , fl(xte;L)} .
Since ensemble size L is always a user-specified hyperparameter, select odd L.
For arbitrary xte ∈ X , let V := {fl(xte) : l ∈ [L]}. Our goal remains to determine R
– a pointwise guarantee on the total number of training set modifications where it
remains guaranteed that f(xte) ≤ ξ.
4.5.1 Partitioned Certified Regression. Here, the L submodel regressors
are fully-independent, meaning their training sets are disjoint, and each submodel
prediction provides no direct insight into any other submodel’s behavior. This simple
framework makes no assumptions about the submodel architecture; the submodels may
be non-parametric or parametric, deep or shallow, etc. The only requirement is that
each submodel returns a deterministic prediction given its training set and feature
vector xte.
65
Levine and Feizi enforce disjoint submodel training sets by using deterministic
function htr to partition training set D into L disjoint blocks, D(1), . . . , D(L). Formally,
for all l ∈ [L], submodel fl’s training set is D = D(l)l .
Since each training instance is assigned to exactly one submodel, any training set
modification can only affect one submodel. Under the unit-cost assumption (Def. 3.1),
each training set modification changes the corresponding submodel’s prediction from
fl(xte) to ∞ in the worst case. Thus, perturbing a partitioned ensemble’s median
prediction follows Sec. 4.2.1’s unweighted swap paradigm where, as explained above,
each perturbed submodel entails one training set modification.
Via reduction to DPA, Theorem 4.8 directly applies Lemma 4.2 to certify unit-
cost, partitioned regression’s robustness under arbitrary training set insertions and
deletions.
Theorem 4.8. For xte ∈ X , ξ ∈ R, and deterministic function htr that partitions
set D into disjoint blocks D(1), . . . , D(L), let f be an ensemble of L submodels where
L is odd, and each deterministic submodel fl is trained on block D
(l). Define
Vl := {fl(xte) : fl(xte) ≤ ξ}. Given f(xte) := med {fl(xte; 1), . . . , fl(xte;L)} ≤ ξ, if
model f is trained on a modified D where the total number of inserted and deleted
training instances does not exceed
⌈ ⌉
|V | − LR = l , (4.9)
2
it is guaranteed that f(xte) ≤ ξ.
We denote this disjoint ensemble regressor as partitioned certified regression (PCR).
The original paper [HL23c, Lemma 16] proves that when regression is used for binary
classification, PCR and DPA yield identical robustness guarantees (R).
66
4.5.2 Weighted Partitioned Certified Regression. Levine and Feizi
only consider the maximally pessimistic unit-cost assumption. For a feature vector xte,
it may take multiple training set insertions/deletions to corrupt a submodel’s prediction.
For example, Theorems 4.6 and 4.7 prove that IBL predictions are robust to multiple
training set modifications.
Fixing the regressor’s overall architecture, one obvious way to improve certified
robustness R is to improve the robustness certifier. Below, we introduce tighter
analysis of each PCR submodel’s pointwise robustness so as to move beyond unit
cost. Let rl ∈ N denote the minimum number of insertions/deletions required to
change3 the submodel enough where fl(xte) > ξ. By definition, if fl(xte) > ξ without
any training set modifications, rl = 0. When ∃l rl > 1, better certified guarantees are
possible through a weighted framework. Theorem 4.9 directly applies Lemma 4.4’s
weighted swap paradigm to adapt PCR (and DPA) to weighted perturbation costs.
We denote this extension weighted partitioned certified regression (W-PCR).
Theorem 4.9. For xte ∈ X , ξ ∈ R, and function htr that partitions set D into
disjoint blocks D(1), . . . , D(L), let f be an ensemble of L submodels where L is
odd. Each deterministic submodel f is trained on block D(l)l and requires at least
rl ∈ Z+ modific⌈atio⌉ns to D(l) for fl(xte) > ξ. For R := {rl : fl(xte) ≤ ξ}, let R̃l be R’s
smallest |R| − L + 1 values. Given f(xte) := med {fl(xte; 1), . . . , fl(xte;L)} ≤ ξ, if2
model f is trained on a modified D where the total number of inserted and deleted
training instances does not exceed
3Certified robustness R is the total number of training set modifications that can be made with
it remaining guaranteed that f(xte) ≤ ξ. In contrast, rl is minimum the number of modifications
needed to perturb submodel l’s prediction enough that fl(xte) > ξ. If Rl were the certified robustness
of just submodel l, then rl = Rl + 1. rl’s definition here follows related work [Ran+21].
67
Submodel Ensemble of Submodel
Training Sets L = 5 Submodels Predictions
Block Mapping
by h D
(2)
f f 2
Training set D 1
D(4)
partitioned by htr
D(3)
D(1) f2 3
D(6)
D(2)
D(3)
D(4) f 4 Initial
D(4) 3 Median
D(5)
D(5)
D(6) D(1) f4 5
D(7) D(7) Threshold
ξ
D(5) f5 6
D(6)
Figure 5. Overlapping Certified Ensemble: Simple visualization of the ensemble
architecture for (weighted) overlapping certified regression. Function htr partitions
training set D into (m = 7) blocks. Function hf defines each of the L = 5 submodel
training sets, D1, . . . ,D5. The ensemble prediction is the median submodel prediction,
i.e., f(xte) := med {fl(xte; 1), . . . , fl(xte;L)}.
∑
R = r − 1, (4.10)
r∈R̃l
it is guaranteed that f(xte) ≤ ξ.
It can be easily shown that W-PCR always yields certified robustness at least
as good as PCR. Although proposed in the context of regression, our weighted
formulation also notably improves certified classification as shown in Sec. 4.8.2.
4.5.3 Computational Complexity. Both PCR and W-PCR require
training O(L) models. As established by Lemmas 4.2 and 4.4, the computational
complexity of PCR and W-PCR (resp.) to certify each ensemble prediction
is O(L) [Blu+73] – the same complexity as DPA.4
4Not included in W-PCR’s complexity is the time to determine r1, . . . , rL.
68
4.6 Certified Regression Using Overlapping Training Data
This section reduces certified regression to a third certified classifier, specifically
Wang et al.’s [WLF22a] reformulation of DPA where the submodels are trained on
overlapping data. This makes the submodels interdependent, meaning one training set
modification may alter multiple submodel predictions. Fig. 5 visualizes an ensemble
trained on overlapping training sets. Again, L is the number of submodels.5 Function
htr : Z → [m] still partitions the instance space into m disjoint blocks, where m ≥ L.
Following Wang et al., a second deterministic function hf : [m] → 2[L] maps each
training set block to o⊔ne or more submodel training sets. Formally, submodel fl’s
training set is Dl := l∈h (j) D(j). Let d(j) := |hf (j)| denote D(j)’s spread degree,f
i.e., the number of models that use D(j) during training. Denote the maximum spread
degree as d := max{d(1), . . . , d(m)max }. The ensemble’s decision function is still the
median submodel prediction.
Below, we first consider certified regression on overlapping data under the unit-
cost assumption. We then improve overlapping regression by leveraging our weighted
reformulation.
4.6.1 Overlapping Certified Regression. Irrespective of whether the
submodels are t⌈rai⌉ned on disjoint or overlapping data, under the unit-cost assumption,
at least |V | − Ll submodel predictions must exceed ξ to perturb the ensemble’s2
median. Observe that each submodel training set Dl ⊂ D is composed of one or more
dataset blocks. Perturbing any block in Dl is sufficient to perturb the submodel’s
prediction, with an optimal attacker minimizing the number of training set (block)
modifications.
5In practice, for overlapping certified regression to guarantee better robustness than (W-)PCR
the number of submodels generally must increase by several folds over partitioned regression.
69
If the goal were to perturb all L submodels, then for arbitrary block mapping
function hf , determining the minimum number of blocks that need to be modified
reduces to minimum set cover, which is NP-hard [Sla97a]. Specifically, the set to cover
is Tl := {l : fl(xte){≤ ξ}, i.e., the submodels pre}dicting at most ξ, and the collection
of subsets is S := {j : l ∈ hf (j)} : fl(xte) ≤ ξ , which contains the dataset blocks
each relevant submodel is trained on.
However, recall that for medi⌈an⌉perturbation under unweighted swaps, we only
need to perturb (i.e., cover) |V | − Ll submodels – not all of them. Therefore, rather2
than mapping to set cover, our problem reduces to the related problem of partial set
cover, where only a constant fraction of the instances (i.e., submodels) need to be
covered. For arbitrary block mapping function hf , Lemma 4.10 below establishes that
finding the optimal R here is NP-hard [Sla97b; EK10].
Lemma 4.10. Finding optimal certified robustness R for overlapping certified
regression is NP-hard.
Although our problem is NP-hard, it is polynomial-time approximable.
Specifically, the approximation uses the famous greedy set-cover algorithm where in
each iteration, the subset (training block D(j)) covering the most remaining elements
(submodels) is selected [Chv79; S⌈la9⌉7b]. Let G denote the bound found by this greedy
method, and define ∆ := |Vl| − L . Then for the non-naive case where ∆ ≥ 2,2
⌈ ⌉
≥ GR , (4.11)
min{H(dmax), ln∆− ln ln∆ + 3 + ln ln 32− ln 32}
where H(dmax) is the dmax-th harmonic number. This bound follows directly from
partial set cover approximation factor analysis ([Sla97a, Thm. 4]; [Sla97b, Thm. 3]).6
6Eq. (4.11)’s bound is tighter (often significantly so) than the much more famous approximation
factor, H(∆), of Johnson [Joh74] and Lovász [Lov75].
70
Slav́ık [Sla97a] shows that the difference between this approximation factor’s overall
lower and upper bound is only roughly 1.1, meaning this general approximation is
quite good overall.
However, in most cases, the performance advantage of overlapping versus disjoint
unit-cost regressors is small enough that the greedy optimality gap wipes out all gains.
Instead, we rely on Fig. 6’s integer linear program (ILP) to bound R in the overlapping
case.7 This ILP is directly adapted from standard partial set-cover, where for unit
costs ∀l rl = 1.
While the ILP is still NP-hard in the worst case, modern LP solvers often
find a (near) optimal solution in reasonable time (e.g., a few seconds) [Gur22]. In
cases where finding true robustness R is computationally expensive, these solvers
generally return guaranteed bounds on R that are (much) better than the greedy
approximation [Van14].8 We refer to this unit-cost, ILP-based approach as overlapping
certified regression (OCR).
4.6.2 Weighted Overlapping Certified Regression. Recall that
Sec. 4.5.2 improves certified regressor PCR by reformulating DPA so as not to
be restricted by the unit-cost assumption. Here, we follow the same approach of
improving certified regressor OCR by generalizing Wang et al.’s [WLF22a] certified
classifier to non-unit costs.
As with W-PCR earlier, rl > 1 entails that submodel fl’s training set must
be modified at least rl times for fl(xte) > ξ. This prevents weighted overlapping
regression from applying partial set cover since each submodel fl now has a coverage
requirement. Instead, partial set multicover (PSMC) generalizes partial set cover
7Fig. 6 jointly formulates calculating R under unit and weighted costs.
8Sec. 4.8’s experiments use a fixed time limit to ensure tractability.
71
to support coverage requirements rl ≥ 0 [Shi+19; Ran+21], and we adapt PSMC to
weighted, overlapping regression. PSMC, and by extension our task, is provably hard.
Corollary 4.10.1. Finding the optimal certified robustness R for weighted overlapping
certified regression is NP-hard.
PSMC is far less studied than (partial) set cover. PSMC is polynomial-time
approximable – albeit with worse known bounds than partial set cover. Ran
et al. [Ran+21] provide the best-known PSMC bounds; their method is much more
complicated than greedy partial set cover and relies on a reduction to another NP-hard
problem, minimum densest subcollection. Let G be the solution generated by Ran
et al.’s algorithm, then
⌈ ⌉
R ≥ G √ . (4.12)
4 lgLH(dmax) ln∆ + 2 lgL L
Like with unweighted overlapping regression, Eq. (4.12)’s approximation factor is
large enough that it usually wipes out the performance gains derived from weighted
costs. Instead, we use Fig. 6’s ILP to bound R in accordance with Lem. 4.4⌈. I⌉n the
ILP, σ = 1 in the weighted case and 0 otherwise. Hence, at least (|Vl| − L + 1)2
subm∑odels must be covered (i.e., perturbed) in the weighted case. Following Eq. (4.5),
sum m (j)j=1 ω is decremented by one in the ILP.
We refer to this overlapping ILP-based approach as weighted overlapping certified
regression (W-OCR).
4.6.3 Computational Cost. See the supplement of the original
paper [HL23c, Sec. I.E] for an empirical evaluation and extended discussion of the
OCR and W-OCR ILP execution time.
72
∑m
min R = ω(j) − σ (4.13a)
j=1
s.t. Tl = {l : fl(xte) ≤ ξ}, (4.13b)
rmax = max{rl : l ∈ [L]} (4.13c)
σ∑= 1[rmax > 1]⌈ ⌉ (4.13d)
δl ≥∑|V
L
l| − + σ, (Median perturb.) (4.13e)
2
l∈Tl
rlδl ≤ ω(j), l ∈ Tl (4.13f)
D(j)⊆Dl
δl ∈ {0,1}, l ∈ Tl (4.13g)
ω(j) ∈ {0, . . . , rmax}, j ∈ [m] (4.13h)
Figure 6. Overlapping Certified Regression Integer Linear Program:
Adapted from the partial set (multi)cover integer linear program. Calculates
certified robustness R for both OCR and W-OCR with indicator variable σ
adjusting the program to account for weighted costs. For arbitrary feature
vector xte, Tl is the set of submodels that predict f (x ) ≤ ξ. Variable ω(j)l te
contains the number of modifications made to training set block D(j). Binary variable
δl = 1 if submodel fl has been sufficiently modified for fl(xte) > ξ and 0 otherwise.
4.7 Certifying Any Model Beyond Unit Cost
The preceding sections describe the benefits of having more robust ensemble
components (i.e., rl > 1) but do not address how to find rl. Apart from IBLs and
ensembles, the two methods we focus on in this work, we know of no general method
for computing insertion/deletion robustness efficiently. We attribute this scarcity
to the task’s difficulty. Nonetheless, we believe this work shows that certification
beyond unit cost merits future study. This section explores certifying beyond unit
cost from two perspectives. First, we consider the obvious idea of combining IBLs
with ensembles and explain why that performs poorly. Next, we propose a simple,
general approach to certify any (sub)model beyond unit cost, albeit with a (slightly)
more restricted threat model.
73
4.7.1 Combining Instance-Based Learners & Ensembles. The points
raised below apply to both fixed-population and region-based IBLs. We exclusively
discuss kNN-CR here with the extension to other certified IBLs straightforward.
In practice, function htr partitions instance space Z uniformly at random (u.a.r.)
into m approximately equal-sized regions. For simplicity and w.l.o.g., consider an
ensemble of kNN-CR submodels trained on disjoint subsets where L = m.
Let k′ and R denote the neighborhood size and certified robustness (resp.) of a
kNN-CR model trained on i.i.d. training set D. If D is partitioned u.a.r. to train
′
L kNN-CR submodels each with k ≈ k , then each submodel’s expected robustness
L
is roughly R . In the bes⌈t c⌉ase for the defender (∀l fl(xte) ≤ ξ), an adversary onlyL
needs to perturb at most L submodels. Combining the above with Theorem 4.9 for
2
W-PCR, this kNN-CR ensemble’s expected certified robustness is approximately
⌈ ⌉( )
L R − R R1 = + − 1 < R. (4.14)
2 L 2 2L
As n, L → ∞, then by Eq. (4.14), a kNN-CR ensemble’s expected robustness decreases
by 50% versus the single kNN-CR model baseline. Intuitively, for ensembles, an
adversary only needs to directly attack about half of the submodels and by extension
half of the training data. In contrast, when there is only a single kNN-CR model
trained on all of D, the adversary must attack the whole training set.
4.7.2 Certifying Non-Unit Costs by Construction. Since IBLs are a
poor candidate to marry with ensembles, we need an alternative approach to certify
a model’s robustness beyond r = 1. Given the dearth of existing methods (known
to us), we fill in the gap and propose a simple, general-purpose method to certify
robustness against arbitrary deletions.
74
To be clear, this is a (slightly) restricted version of the full threat model considered
so far, which allows arbitrary insertions and deletions. Nonetheless, this restricted
threat model still has broad applicability. For example, an adversary may only be
able to insert poisoned instances into a training set but not delete clean ones [Che+17;
Liu+17; Sha+18; Wal+21; HL22a].
Motivated by Cook and Weisberg’s [CW82] classic case deletion diagnostics,
we use a constructive proof to certify a (sub)model’s pointwise robustness under
instance deletions. Consider training (n+ 1) deterministic models – one model
using full set D = {(xi, yi)}ni=1 and another n models on each of the leave-one-out
subsetsD \ (xi, yi) for all i ∈ [n]. If all (n+ 1) trained models make the same prediction
(e.g., a value not exceeding ξ for some xte), then by construction, the model trained
on all of D has, at minimum, r = 2 for arbitrary deletions. Lemma 4.11 generalizes
the above for an arbitrary number of deletions r < n.
Lemma 4.11. For xte ∈ X , training set D where 2D is its power set, r ∈ [|D| − 1],
and ξ ∈ R, denote a deterministic model trained on subset D ⊆ D as fD. Given
∀ ′D′∈2D |D | < r =⇒ fD\D′(xte) ≤ ξ, then for any D̃ ⊂ D, if fD̃(xte) > ξ then at least
r instances from D were deleted in D̃.
A strength of Lemma 4.11 is its flexibility; it can be adapted to any model class,
including both classifiers and regressors. Its clear limitation is its computational
complexity.
Computational Complexity : Certifying r > 1 requires training O(n(r−1)) models; this
is a one-time, amortized cost.
Consider separately the cost to certify each prediction. During inference, the
O(n(r−1)) models are checked. While this may be problematic in some cases, it should
be contextualized. Recall that Sec. 4.4 explores IBLs like kNN, which have inference
75
complexity O(n). Therefore, our method to certify r = 2 has the same time complexity
during inference as kNN.
4.7.3 More Submodels vs. Weighted Costs. Increasing submodel
count L and using weighted costs are partially conflicting approaches to increase R.
A natural question is which of the two approaches yields better certified robustness.
Above, we explain why increasing L is a poor strategy for IBLs. For ensembles,
increasing L generally means that each submodel is trained on fewer data.
As an intuition, consider when ∀l rl = 2. For a unit-cost ensemble to certify
equivalent R, submodel count L must about double, and each submodel is trained on
50% fewer data, which can significantly degrade submodel performance. In contrast,
weighted costs with r = 2 reduces submodel training set sizes by 1 (Lem. 4.11). By
training weighted submodels on much more data, weighted submodels can outperform
submodels from ensembles with larger L. This improved submodel performance can
in turn improve certified robustness.
4.8 Evaluation
This section evaluates our five primary certified regressors: kNN-m certified
regression (kNN-CR), partitioned certified regressors PCR & W-PCR as well as
overlapping certified regressors OCR & W-OCR. Additional experimental results
are in the supplement, including full kNN-CR certification plots (C.1.3). Additional
results also appear in the supplemental materials of the original paper [HL23c, Sec. I].
To the extent of our knowledge, we propose the first pointwise certified regression
methods that make no assumptions about the test distribution or model architecture.
Without a clear baseline, we compare our five methods against each other. As
a reference on the clean-data performance, we report each dataset’s “uncertified”
(non-robust) accuracy.
76
4.8.1 Experimental Setup. For brevity, most evaluation setup details
(e.g., hyperparameters) are deferred to suppl. Sec. D.1 with a brief summary below.
For each experiment in this section, at least ten trials were performed. To improve
readability, we only report the mean values below with variances in suppl. Sec. C.1.
Dataset Configuration Each (sub)model is trained on 1 -th of the training data,
q
where q ∈ Z>0. For kNN-CR, always q = 1. For our four ensemble-based methods
(W-)PCR and (W-)OCR, q can significantly affect the ensemble’s accuracy and
best-case certified robustness (R). As such, for each dataset, we report results with
three different q values. For all ensembles, function htr partitions training set D u.a.r.9
For our partitioned regressors (W-)PCR, D is split into q blocks, with L = q. For
our overlapping regressors (W-)OCR, we followed Wang et al.’s [WLF22a] overlapping
certified classifier evaluation. Specifically, D is partitioned into qd blocks u.a.r. All
blocks have fixed spread degree d > 1 (see Tab. 1), and hf assigns blocks to submodels
at random. Hence, each overlapping ensemble necessarily has L = qd submodels.
Submodel Architectures To demonstrate their generality, our ensemble methods
use two different submodel architectures, namely ridge regression and XGBoost [CG16]
gradient-boosted forests. Model determinism is enforced via a fixed random seed.
Below, we report whichever submodel architecture performed the best on a held-out
validation set.
Evaluation Metric For each test instance (xte, yte), our goal is to determine the
largest pointwise certified robustness R that guarantees ξl ≤ f(xte) ≤ ξu. Throughout
9Each dataset’s largest q value maximized the ensembles’ certified robustness (R). For each
dataset, we also report small and medium q values. In practice, q should be as small as possible while
guaranteeing sufficient robustness given each application’s maximum anticipated poisoning rate.
77
this evaluation, ξl := yte − ξ and ξu := yte + ξ. These bounds are w.r.t. each test
example’s true target value yte, not predicted value f(xte). Therefore, a large certified
robustness R means that the prediction is both accurate and stable. Here, error
threshold ξ may be a specific fraction (e.g., 15%) of each instance’s target value yte or
a fixed value for the entire dataset (see Table 1). In practice, the appropriate ξ value
is application specific.
Our evaluation metric is certified accuracy, which is the fraction of instances with
robustness R ≥ ψ for ψ ∈ N. In each trial, we calculated the certified robustness (R)
for at least 100 random test instances and report the mean certified accuracy across
all trials. See suppl. Sec. C.1 for the certified accuracy variance. Note that existing
certified classifiers were previously evaluated using certified accuracy [Jia+22a; LF21;
WLF22a] with ξ = 0, i.e., the predicted label must match true label yte.
Datasets Our certified regressors are evaluated on six datasets: five regression and
one binary classification. Like previous work [BHL23], the datasets are preprocessed
where all categorical features are transformed into one-hot encodings. Table 1
summarizes each dataset’s key attributes, including its size, error threshold (ξ),
ensemble submodel architecture, etc. A brief description of each dataset is below.
Ames [Coc11] and Austin [Pie21] estimate home prices in two American cities.
Diamonds [Wic16] predicts a diamond’s price based on attributes such as cut, color,
carat, etc. Weather [Mal+21] estimates ground temperature (in degrees Celsius)
using date, time, and longitude/latitude information. For computational efficiency,
Weather’s size was downsampled by 10× u.a.r. Life [Raj21] estimates life expectancy
(in years) using epidemiological and other national statistics. Spambase [Hop+17]
is a binary classification dataset where emails are labeled as either spam or ham.
Spambase’s positive training prior is 39%.
78
Table 1. Evaluation Dataset Summary: Training set size (n), data dimension,
overlapping spread degree (d), error threshold (ξ), and submodel architecture for the
six datasets. Error thresholds that are a percentage of each instance’s true target value
are denoted X% · y. Alternate ξ values are evaluated in the original paper [HL23c,
Fig. 9].
Dataset Size (n) Dim. Deg. (d) Error (ξ) Submodel
Ames 2.6k 253 17 15% · y XGBoost
Austin 12k 315 13 15% · y XGBoost
Diamonds 48k 26 9 15% · y Ridge
Weather 308k 140 5 3◦C Ridge
Life 2.6k 204 13 3 years XGBoost
Spambase 4.1k 57 17 0 Ridge
Certifying r > 1 For our two weighted methods, W-PCR and W-OCR, our
evaluation attempts to certify each submodel’s robustness against deletions up to
r = 2.
4.8.2 Analyzing the Certified Accuracy. Figure 7 visualizes our methods’
mean certified accuracy for the six datasets. For brevity, the corresponding numerical
values, including variance, are deferred to Sec. C.1. Below, we briefly summarize the
experiments’ primary takeaways.
Takeaway #1: Both our ensemble and IBL regressors certify non-trivial fractions
of the training set. For the Ames and Life datasets, W-OCR certifies 50% of test
predictions up to 1% training set corruption. Similarly, kNN-CR certifies 30% of
predictions on Ames up to 4% corruption. These certified guarantees are without
explicit assumptions about the data distribution or, in the case of ensembles, the
submodel’s architecture. For other datasets, we certify predictions up to hundreds or
thousands of training set modifications.
Takeaway #2: Ensemble regressors have better peak performance. Across all six
datasets, the ensemble-based methods all had better peak certified accuracy than
79
kNN-CR. The performance gap was as large as 3.5× and is not primarily due to
feature dimension as kNN-CR performed worst on Diamonds, which has the smallest
dimension by far.
Takeaway #3: W-OCR achieves the largest certified robustness (R) amongst the
ensemble methods. This is observed using each dataset’s largest q value. For all
six datasets, there is a (significant) gap between W-OCR ( ) and our second-best
ensemble method, W-PCR ( ).
Takeaway #4: kNN-CR achieves the largest certified robustness. Although
kNN-CR certifies (far) fewer instances than the ensembles, for instances that it can
certify, its maximum robustness R is generally far larger than that of W-OCR. For
example with Weather, kNN-CR’s maximum R is 5× larger than W-OCR’s. Suppl.
Sec. C.1.3 best visualizes this trend in its plots of kNN-CR’s full certified accuracy.
Takeaway #5: W-OCR achieves state-of-the-art certified accuracy for binary
classification. While regression is this work’s primary focus, recall that binary
classification can be solved by a regressor. For binary classification, kNN-CR’s R is
identical to Jia et al.’s [Jia+22a] kNN classifier; PCR certifies equivalent robustness
as DPA, and OCR very closely approximates Wang et al.’s [WLF22a] overlapping
method. Observe that W-OCR outperforms the unweighted ensembles and kNN-CR
on Spambase’s [Hop+17] two largest q values. Note that Spambase’s maximum q
value cannot be increased further without severely degrading submodel performance.
This provides empirical evidence for Sec. 4.7.3’s claim that a weighted strategy can
outperform increasing submodel count L.
Takeaway #6: q can significantly affect certified accuracy. Previous certified
classifier evaluations [Jia+22a; WLF22a; LF21] under-explore q’s role. Those works
primarily evaluate vision datasets where the training data is supplemented by (1) using
80
a pre-trained model to extract much better features [JCG21; Jia+22a] or (2) using
vision data augmentation [LF21; WLF22a]. For the tabular datasets evaluated here,
such options are not as available.
Without such augmentation, increasing q can significantly degrade an ensemble’s
peak certified accuracy. As an example, the ensembles’ peak certified accuracy can
decline by up to 28% between training a model on all of D versus a dataset’s maximum
q value (compare to uncertified accuracy in Fig. 7). Therefore, when thinking
about certified classifiers and regressors, always consider the potential benefits of
external (clean) data augmentation. For instance, in our experiments, XGBoost
certified ensembles’ accuracy improved by multiple percent when using mixup data
augmentation [Zha+18].
4.9 Conclusions
This chapter describes a novel reduction from certified regression to certified
classification based on median perturbation. Applying this reduction, we propose six
new certified regressors that require no assumptions about the data distribution or
model architecture. As improved voting-based, certified classifiers are proposed in
the future, our reduction can be applied to those methods too. This enables certified
regression to keep pace with the rapid advances in certified classification.
While this work focuses on certified defenses against poisoning attacks, some
certified evasion defenses also rely on voting-based guarantees [LF20b; Jia+22a]. Our
reduction from certified regression to certified classification applies to those certified
evasion defenses as well.
Lastly, our empirical results show that improved certified guarantees are possible
when the unit-cost assumption is replaced by our tighter weighted analysis. These
certification gains apply to both classification and regression, but Sec. 4.7.2’s approach
81
is computationally expensive. We advocate for better methods that efficiently certify
beyond r = 1.
82
Uncertified (q = 1) PCR (ours) OCR (ours)
W-PCR (ours) W-OCR (ours) kNN-CR (ours)
80
60
40
20
0 5 10 15 20 0 10 20 30 40 0 20 40 60 80
Cert. Robustness (R), q = 25 Cert. Robustness (R), q = 125 Cert. Robustness (R), q = 251
60
45
30
15
0 5 10 15 20 0 25 50 75 100 0 50 100 150 200
Cert. Robustness (R), q = 51 Cert. Robustness (R), q = 301 Cert. Robustness (R), q = 701
60
40
20
0 30 60 90 120 150 0 100 200 300 400 0 150 300 450 600
Cert. Robustness (R), q = 151 Cert. Robustness (R), q = 501 Cert. Robustness (R), q = 1001
Figure 7. Certified Accuracy: Mean certified accuracy (larger is better) for our
five primary certified regressors. kNN-CR is always trained on all of training set D
(i.e., q = 1). Ensemble submodels are trained on 1 -th of D, with three q values tested
q
per dataset. The x-axis is clipped to enhance readability; see suppl. Sec. C.1.3 for
kNN-CR’s full results. The best performing method depends on the target certified
robustness R. For smaller R values, W-OCR achieves the best certified accuracy. For
larger R values, kNN-CR outperforms the ensemble methods. This result aligns with
previous findings on certified classification [Jia+22a]. Sec. 4.8.2 summarizes these
experiments’ primary takeaways. Figure continued on the next page.
83
Diamonds Austin Housing Ames Housing
Certified Acc. (%) Certified Acc. (%) Certified Acc. (%)
Uncertified (q = 1) PCR (ours) OCR (ours)
W-PCR (ours) W-OCR (ours) kNN-CR (ours)
80
60
40
20
0 10 20 30 40 0 300 600 900 1,200 1,500 0 600 1,200 1,800 2,400
Cert. Robustness (R), q = 51 Cert. Robustness (R), q = 1501 Cert. Robustness (R), q = 3001
80
60
40
20
0 5 10 15 20 25 0 10 20 30 40 50 0 30 60 90 120
Cert. Robustness (R), q = 25 Cert. Robustness (R), q = 101 Cert. Robustness (R), q = 201
80
60
40
20
0 5 10 15 20 25 0 25 50 75 100 0 50 100 150 200
Cert. Robustness (R), q = 25 Cert. Robustness (R), q = 151 Cert. Robustness (R), q = 301
Figure 7. Certified Accuracy (cont.): Mean certified accuracy (larger is better) for
our five primary certified regressors. kNN-CR is always trained on all of training set D
(i.e., q = 1). Ensemble submodels are trained on 1 -th of D, with three q values tested
q
per dataset. The x-axis is clipped to enhance readability; see suppl. Sec. C.1.3 for
kNN-CR’s full results. The best performing method depends on the target certified
robustness R. For smaller R values, W-OCR achieves the best certified accuracy. For
larger R values, kNN-CR outperforms the ensemble methods. This result aligns with
previous findings on certified classification [Jia+22a]. Sec. 4.8.2 summarizes these
experiments’ primary takeaways. See Sec. C.1 for the numerical results, including
variance.
84
Spambase Life Weather
Certified Acc. (%) Certified Acc. (%) Certified Acc. (%)
CHAPTER 5
CERTIFIED DEFENSE AGAINST A UNION OF ℓ0 ATTACKS
This chapter contains previously published, coauthored material [HL23a].
Hammoudeh developed the primary method, developed all code, conducted all
experiments, and wrote the manuscript. Lowd provided supervision, editorial
suggestions, and input on experiments.
Zayd Hammoudeh and Daniel Lowd. “Feature Partition Aggregation: A
Fast Certified Defense Against a Union of ℓ0 Attacks”. In: Proceedings of
the 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning.
AdvML-Frontiers’23. 2023. url: https://arxiv.org/abs/2302.11628
This chapter focuses on ℓ0 or sparse attacks,
1 where an adversary controls an
unknown subset of the features. By certifying robustness w.r.t. the number of perturbed
features, ℓ0 analysis is particularly well-suited to heterogeneous (tabular) data where
the features have different types (e.g., numerical, categorical) or scales. Moreover,
ℓ0 defenses provide provable robustness against real-world patch attacks [LF20a].
Several certified ℓ0 defenses have been proposed [Lee+19; LF20b; Cal+21; Jia+22b],
but these methods apply to evasion only, which can be limiting. For example, consider
a distributed sensor network where each (tabular) feature is independently measured
by a different sensor. Under this type of vertical partitioning where features are
sourced from multiple parties, an attacker that controls a single feature (i.e., sensor)
can partially perturb every instance – training and test – up to 100% poisoning
rate [LDD21; Wei+22]. Existing ℓ0 evasion defenses do not certify robustness over
any training perturbation rendering them moot under such an attack. Moreover,
1The ℓ0 norm of vector x ∈ Rd equals ∥x∥0 := |{j : xj ̸= 0}|, where xj ∈ R denotes the j-th
dimension of x. Put simply, x’s ℓ0-norm equals the number of non-zero dimensions in x. Intuitively,
ℓ0 attacks bound the number of dimensions of x whose perturbation value can be non-zero.
85
existing ℓ0 defenses could not be combined with instance-wise poisoning defenses
here since typically, the latter are only provably robust under small poisoning rates,
e.g., ≤1% [WLF22b; Rez+23].
To address these limitations, we propose feature partition aggregation (FPA)
– a certified sparse defense jointly robust against both training and test feature
perturbations. FPA uses a model ensemble approach, where each submodel is trained
on a disjoint feature set, meaning any adversarially perturbed feature – training or test
– affects at most one submodel prediction. Hence, FPA guarantees robustness over the
union of ℓ0 evasion, backdoor, and poisoning attacks – a strictly stronger guarantee
than existing ℓ0 methods [LF20b]. In our empirical evaluation, FPA’s certified median
guarantees are up to 4× larger than state-of-the-art ℓ0 defenses [Jia+22b] with little
to no decrease in classification accuracy; FPA is also up to 3,000× faster. In other
words, FPA provided additional dimensions of ℓ0 robustness essentially for free. Our
primary contributions are summarized below; additional theoretical analysis and all
proofs are in the supplement.
– We define a new robustness paradigm we term certified feature robustness that
generalizes ℓ0 (sparse) robustness to encompass training set feature perturbations.
– We propose feature partition aggregation, a certified feature defense that uses an
ensemble of submodels trained on disjoint feature sets. We detail two certification
schemes – a simple one based on plurality voting and the other based on multi-round
elections.
– We empirically evaluate FPA on two classification and two regression datasets.
FPA provided simultaneously larger and stronger median guarantees than the
86
state-of-the-art certified ℓ0 defenses while also being 2 to 3 orders of magnitude
faster.
5.1 Preliminaries
Notation ℓ0 norm ∥w∥0 is the number of non-zero elements in vector w. Given
some matri{x A, denote }its j-th column as Aj. In a slight abuse of notation, let
A ⊖ A′ := j : A ≠ A′j j denote the set of column indices over which equal-size
matrices A and A′ differ. Similarly, let v ⊖ v′ ⊆ [|v|] denote the set of dimensions
where vectors v and v′ differ.
Denote the training set’s feature matrix as X := [ x1 · · · x ]⊺n where X ∈ Rn×d,
and denote the label vector y := [y1, . . . , yn]. feature partition aggregation (FPA)
is an ensemble of L submodels (see Figure 8). A decision function aggregates the
L submodel predictions to form f ’s overall prediction. The model architecture and
decision function combined dictate how a prediction’s certified robustness is calculated.
For instance (x, y), let gl(x, y) be the l-th submodel’s logit value for label y, where
gl : X × Y → R. Let fl(x) denote the l-th submodel’s predicted label for x, where
fl : X → Y and fl(x) := argmaxy∈Y gl(x, y). Throughout this work, all ties are broken
by selecting the label with the smallest index.
Feature set [d] is partitioned across F⊔PA’s L submodels. Let Sl ⊂ [d] be the
features used by the l-th submodel where Ll=1 Sl = [d]. In other words, each FPA
submodel considers a fixed, disjoint subset of the features for all training and test
instances. The l-th submodel’s training set, Dl, consists of: label vector y and the
Sl columns in X. FPA submodels are deterministic, meaning fixing Dl, Sl, and x, in
turn, fixes label fl(x) and logits ∀y gl(x, y). ∑
Given x and y, the pointwise submodel vote count is ċy(x) :=
L
l=1 1[fl(x) = y].
The plurality and runner-up labels receive the most and second-most votes (resp.),
87
i.e., ypl = argmaxy∈Y ċy(x) and yru = argmaxy∈Y\y ċy(x). The pointwise submodelpl
vote gap between labels y, y′ ∈ Y is
Gapvote(y, y
′;x) := ċy(x)− ċ (x)− [y′y′ 1 ∑< y], (5.1)
with the indicator function used to break ties. Let c̈y(x; y
′) := Ll=1 1[gl(x, y) > g
′
l(x, y )]
be y’s logit vote count w.r.t. y′ ∈ Y . The pointwise logit vote gap for y w.r.t. y′ is
Gap ′logit(y, y ;x) := c̈y(x; y
′)− c̈y′(x; y)− 1[y′ < y]. (5.2)
Below, x is dropped from Gapvote and Gaplogit when the feature vector of interest is
clear from context.
Threat Model Given arbitrary (x, y), the attacker’s objective is to ensure that
y ̸= f(x). The adversary achieves this objective via two methods: (1) modify training
features X or (2) modify test instance x’s features. An adversary may use either
method individually or both methods jointly. An attacker can perturb up to 100% of
the training instances.
Our Objective For arbitrary (x, y), determine the certified feature robustness, R
(defined below). Note that pointwise guarantees certify the robustness of each instance
(x, y) individually.
Def. 5.1. Certified Feature Robustness Given training set (X,y), model f ′
trained on (X′,y), and arbitrary feature vector x′ ∈ X , certified feature robustness
R ∈ N is a pointwise, deterministic guarantee w.r.t. instance (x, y) where
|X ⊖ X′ ∪ x ⊖ x′| ≤ R =⇒ y = f ′(x′).
Certified robustness R is not w.r.t. individual feature values. Rather, certified
feature robustness provides a stronger guarantee allowing all values of a feature –
training and test – to be perturbed.
88
Logits (gl) Labels (fX l
)
1
Training Set 1 y1 x 0.2S
Feature Partitioning 10 y2 D1 f1 0.0 2
y D0 3 1 0.8
X2
X 0 y1 x 0.9S2
1 y2 D2 f2 0.0 0
x⊺ y D1 1 0 1 0 0 3 2 0.1
x⊺2 0 1 1 0 X3
x⊺3 0 0 0 1 1
y1 x 0.7S3
1 y2 D3 f3 0.1 0
y D0 3 3 0.2
X4
0 y1 x 0.4S4
0 y2 D4 f4 0.5 1
y D1 3 4 0.1
Figure 8. Feature partition aggregation example prediction for: test instance
x ∈ X , n = 3, d = 4, and |Y| = 3. Feature partitioning across L = 4 submodels, where
the l-th submodel uses only feature dimensions Sl = {l} ⊂ [4] and training set Dl, i.e.,
the tuple containing the l-th column of feature matrix X (denoted Xl) and label vector
y := [y1, y2, y3]. xS denotes the subvector of x restricted to the feature dimensionsl
in Sl. Plurality label ypl = 0; runner-up label yru = 1; and run-off label yRO = 0. Under
the plurality voting decision function (Sec. 5.3.1), f(x) has certified feature robustness
Rpl = 0. With run-off (Sec. 5.3.2), f(x)’s certified feature robustness is RRO = 1.
5.2 Related Work
FPA marries ideas from two classes of certified adversarial defenses, which are
discussed below. A more detailed discussion of related work in the full paper [HL23b;
HL23a].
ℓ0-Norm Certified Evasion Defenses Representing the work most closely
related to ours, these methods certify ℓ0-norm robustness (also known as “sparse
robustness”), which we formalize below.
89
Def. 5.2. ℓ0-Norm Certified Robustness Given model f , α ∈ (0, 1), and arbitrary
feature vector x′ ∈ X , ℓ0-norm certified robustness ρ ∈ N is a pointwise guarantee w.r.t.
instance (x, y) where if ∥x− x′∥0 ≤ ρ, then y = f(x′) with probability at least 1− α.
There are two main differences between certified ℓ0-norm robustness (Def. 5.2)
and our certified feature robustness (Def. 5.1). (1) ℓ0-norm methods are not certifiably
robust against any adversarial training perturbations (e.g., poisoning and backdoors).
(2) ℓ0-norm robustness guarantees are probabilistic, while our feature guarantees are
deterministic. Put simply, our certified feature guarantees are strictly stronger than
ℓ0-norm guarantees.
Randomized ablation (RA) is the state-of-the-art certified ℓ0-norm defense [LF20b;
Jia+22b]. RA adapts ideas from randomized smoothing [CRK19] to ℓ0 evasion attacks
[LF20b]. Specifically, RA creates a smoothed classifier by repeatedly evaluating
different ablated inputs, each of which keeps a random subset of the features unchanged
and masks outs (ablates) all other features. RA’s ablated training generally permits
only stochastically-trained, parametric model architectures. At inference, certifying
a single prediction with RA requires evaluating up to 100k ablated inputs [Lee+19;
Jia+22b]. Jia et al. [Jia+22b] improve RA’s guarantees via new certification analysis
that is tight for top-1 predictions, meaning Jia et al.’s version of RA always performs
at least as well as the original. Jia et al. [Jia+22b] also extend RA to certify ℓ0-norm
robustness for top-k predictions.
Certified patch robustness is a restricted form of ℓ0-norm robustness where the
perturbed test features are constrained to a specific, contiguous shape, e.g., square
[MY21]. Existing patch defenses include (de)randomized smoothing (DRS) [LF20a] –
a specialized version of randomized ablation for patch attacks. Like RA, DRS performs
ablated training and inference. By assuming a single patch shape, the number of
90
possible attacks becomes linear in d, allowing DRS to only evaluate O(d) ablations
during inference; this derandomizes the ablation set, making DRS’s patch guarantees
deterministic.2 More recently, Metzen and Yatsura [MY21] propose, BagCert –
a certified patch defense that is less sensitive to patch shape than DRS. Note any
certified feature or ℓ0-norm defense (e.g., FPA, RA) is also a certified patch defense,
given the former’s stronger guarantees.
Instance-wise Certified Poisoning Defenses The second class of related
defenses certify robustness under the arbitrary insertion or deletion of entire instances
in the training set [Che+22; WF23] – generally a small poisoning rate (e.g., ≤1%).
Like FPA, most instance-wise poisoning defenses are voting-based [JCG21; Jia+22a;
WLF22a]. For example, deep partition aggregation (DPA) randomly partitions the
training instances across an ensemble of L submodels [LF21]. More recently, Rezaei
et al. [Rez+23] propose run-off elections, a novel decision function for DPA that can
improve DPA’s certified robustness by several percentage points. While certified
instance-wise poisoning defenses show promise, they are still vulnerable to test
perturbations – even of a single feature.
5.3 Certifying Feature Robustness
Our certified defense, feature partition aggregation (FPA), can be viewed as the
transpose of Levine and Feizi’s [LF21] deep partition aggregation (DPA). Both defenses
are (1) ensembles, (2) rely on voting-based decision functions, and (3) partition the
training set; the key difference is in the partitioning operation. DPA horizontally
partitions the set of training instances (rows of feature matrix X), enabling DPA
to certify instance-wise robustness. In contrast, FPA vertically partitions along an
2(De)randomized smoothing’s deterministic guarantees do not scale to RA which considers O(2d)
attacks.
91
orthogonal dimension – the feature set (columns of X) – enabling FPA to certify
feature-wise robustness. Intuitively, partitioning along orthogonal dimensions means
that DPA and FPA certify orthogonal types of robustness. Training FPA submodels
on disjoint feature subsets (e.g., Figure 8) entails that a perturbed feature affects,
at most, one submodel prediction. FPA leverages this property to certify feature
robustness R. Below we describe two FPA decision functions : (1) a simpler scheme
using plurality voting and (2) an enhanced multi-round voting procedure specialized
for multiclass classification. The decision function combined with FPA’s architecture
dictates how our robustness guarantee is calculated.
5.3.1 Feature Robustness Under Plurality Voting. For x ∈ X , the
plurality voting decision function defines the model prediction as f(x) := ypl, i.e., the
label that receives the most submodel votes. A successful attack requires perturbing
enough submodels to change ypl. Specifically, each submodel perturbation decreases
the submodel vote gap (Gapvote) between ypl and the adversary’s selected label
by two. Hence, the minimum number of submodel perturbations equals half the
vote gap between ypl and runner-up label yru. Thm. 5.3 formalizes this idea as a
deterministic feature robustness guarantee. Eq. (5.3)’s decomposed form is similar to
other voting-based certified defenses, including DPA [LF21; Jia+22a; HL23c].
Theorem 5.3. Certified Feature Robustness with Plurality Voting For feature
partition S1, . . . ,SL, let f be an ensemble of L submodels using the plurality-voting
decision function, where the l-th submodel uses the features in Sl. For instance (x, y),
the pointwise certified feature robust⌊ness is ⌋
Gap
: vote
(ypl, yru)
Rpl = . (5.3)
2
92
Understanding Thm. 5.3 More Intuitively Let Atr ⊆ [d] be the set of features
(i.e., dimensions) an attacker modified in the training set, and let Ax ⊆ [d] be the
set of features the attacker modified in instance x. As long as |Atr ∪ Ax| ≤ R, the
adversarial perturbations did not change the model prediction. The union over the
perturbed feature sets entails that a feature perturbed in both training and test counts
only once against guarantee R. Put simply, there is no double counting of a perturbed
feature. Thm. 5.3’s certified guarantees are implicitly agnostic to the ℓ0 attack type.
Certified feature robustness R applies equally to an ℓ0 evasion attack (Ax only) as it
does to ℓ0 poisoning (Atr only). Thm. 5.3’s guarantees also encompass more complex
ℓ0 backdoor attacks (Atr ∪ Ax).
5.3.2 Feature Robustness Under Run-Off Elections. Under plurality
voting, only submodels that predict either ypl or yru are considered when determining
the certified feature robustness (Eq. (5.3)). In other words, submodels predicting other
labels essentially contribute nothing to plurality voting’s pointwise guarantees. Decision
functions that leverage these “wasted” submodels may certify larger guarantees (see
Figure 8). For instance, Rezaei et al. [Rez+23] propose run-off elections, an enhanced
two-round DPA decision function for multiclass classification.3 Since FPA and DPA
share the same basic architecture (excluding the partitioning dimension), run-off can
be directly combined with FPA to improve our certified robustness.
We now describe run-off. Our presentation is similar to Rezaei et al.’s [Rez+23]
except we standardize the formulation to align with previous work and to correct an
error in Rezaei et al.’s preprint version. Formally, run-off’s decision function procedure
is:
3Run-off only changes the decision function; no training or model architecture changes are required.
93
Round #1: Determine plurality and runner-up labels ypl and yru (resp.) as
above.
Round #2: Set run-off prediction yRO to either label ypl or yru based on the
logit vote gap where ypl Gap logit
(ypl, yru) ≥ 0
f(x) = yRO :=  . (5.4)yru Otherwise
Under run-off, ensemble prediction yRO can only be perturbed in two ways: (1) overtake
yRO in round #2 or (2) eject yRO from round #1’s top-two labels. Run-off’s certified
(feature) robustness is lower bounded by whichever case takes fewer submodel
perturbations. We discuss these two cases separately below; Thm. 5.4 combines
these analyses to define run-off’s overall feature robustness.
Case #1: Overtake yRO in Round #2 Let ỹRO := {ypl, yru} \ yRO denote the
label not selected in round #2. For a label y to overtake yRO in round #2, y must
simultaneously satisfy two requirements: (a) be in round #1’s top-two labels (in
turn ejecting ỹRO from the top two) and (b) receive more logit votes than yRO in
round #2. Hence, the certified robustness for this case is bounded by whichever of
these requirements requires more feature perturbations. Therefore, an attacker may
control up to {⌊ ⌋ ⌊ ⌋}
Gapvote(ỹRO, y) Gaplogit(yRO, y)
RCase1 :RO = min max , (5.5)
y∈Y\yRO 2 2
features, without yRO being overtaken in round #2 (Lemma B.3).
Case #2: Eject yRO from Round #1’s Top-Two Labels In round #1, a label
y is preferred over a different label y′ iff Gap ′vote(y, y ) ≥ 0 (Lemma B.2). Therefore,
ejecting yRO from round #1’s top-two labels requires perturbing sufficient submodels
94
such that two labels have negative submodel vote gaps w.r.t. yRO. Let dp be a function
that takes two submodel vote gaps (e.g., i, j ∈ N) and returns yRO’s round #1 certified
feature robustness. Recall that perturbing a submodel vote from yRO to a different
y decreases Gapvote(yRO, y) by 2; observe that this same submodel perturbation also
decreases Gap ′ ′vote(yRO, y ) by 1 for all y ∈ Y \ {yRO, y}. Combining these interactions,
dp can bedefined recursively as0 max{i, j} ≤ 1 and (i, j) ̸= (1, 1)
dp[i, j] =  ,1 + min{dp[i− 2, j − 1], dp[i− 1, j − 2]} Otherwise
(5.6)
where the base case ensures at least one submodel vote gap is non-negative. Therefore,
case #2’s total certified robustness is [ ]
RCase2 :RO = min dp gap , gap ′ (5.7)
y,y′∈Y\ y yyRO
where gapy∗ = max{0,Gap ∗vote(yRO, y )} (Lemma B.4). Recursive formulations like
Eq. (5.6) are solvable using classic dynamic programming. O(L2)-space matrix dp
is prepopulated once, meaning the incremental lookup cost is only O(1) and RCase2RO ’s
total time complexity O(|Y|2).
Combining Cases #1 and #2 to Certify Feature Robustness Thm. 5.4
provides the certified feature robustness for an FPA prediction using the run-off
decision function. Intuitively, an optimal attacker selects whichever of the two cases
above requires fewer feature perturbations; hence, Eq. (5.8) below takes the minimum
of RCase1 and RCase2RO RO .
Theorem 5.4. Certified Feature Robustness with Run-off For feature partition
S1, . . . ,SL, let f be an ensemble of L submodels using the run-off decision function,
where the l-th submodel uses only the features in Sl. Then, for instance (x, y), the
95
pointwise certified feature robustness is
RRO = min{RCase1 Case2RO , RRO }. (5.8)
5.3.3 Advantages of Feature Partition Aggregation. Below, we
summarize FPA’s advantages over state-of-the-art certified ℓ0-norm defense randomized
ablation (RA). These advantages apply irrespective of whether FPA uses plurality
voting or run-off.
(1) Stronger Guarantees FPA’s certified feature robustness guarantee (Def. 5.1)
is strictly stronger than RA’s ℓ0-norm guarantee (Def. 5.2). First, FPA’s guarantees
apply equally to ℓ0 evasion, poisoning, and backdoor attacks while RA only applies
to evasion. Second, FPA’s guarantees are deterministic while RA’s guarantees are
only probabilistic.
(2) Faster RA requires up to 100k forward passes to certify one prediction. FPA
requires only L forward passes – one for each submodel – where L < 200 in general.
FPA certification is, therefore, orders of magnitude faster than RA.
(3) Model Architecture Agnostic RA’s feature ablation is specialized for
parametric models like neural networks and generally prevents the use of tree-
based models like gradient-boosted decision trees (GBDTs). By contrast, FPA
supports any submodel architecture.
5.4 Feature Partitioning Strategies
The certification analysis above holds irrespective of the feature partitioning
strategy. However, how the features are partitioned can have a major impact on the
size of FPA’s certified guarantees. Below, we very briefly describe two insights into
the properties of good feature partitions.
96
Insight #1 Ensure sufficient feature information is available to each submodel.
Each incorrect submodel or logit vote cancels out a correct one, meaning the goal
should be to simultaneously maximize the number of correct submodel predictions and
minimize incorrect ones. In other words, robustness is maximized when all submodels
perform well, and feature information is divided equally.
Insight #2 Limit information loss due to feature partitioning. Feature partitioning
is lossy from an information theoretic perspective. Fixing L, some partitions are more
lossy than others, and good partitions limit the information lost.
5.4.1 Feature Partitioning Paradigms. Applying the above insights,
we propose two general feature partitioning paradigms. In practice, the partitioning
strategy is essentially a hyperparameter tunable on validation data. The validation
set need not be clean so long as the perturbations are representative of the test
distribution.
Balanced Random Partitioning Given no domain-specific knowledge, each
feature’s expected information content is equal. Balanced random partitioning assigns
each submodel a disjoint feature subset sampled uniformly at random, with subsets
differing in size by at most one. Random partitioning has two primary benefits. First,
each submodel has the same a priori expected information content. Second, random
partitioning can be applied to any dataset. FPA with random partitioning is usually
a good initial strategy and empirically performs quite well.
Deterministic Partitioning One may have application-related insights into
quality feature partitions. For example, consider feature partitioning of images.
Features (i.e., pixels) in an image are ordered, and that structure can be leveraged to
97
design better feature partitions. Often the most salient features are clustered in an
image’s center. To ensure all submodels are high-quality, each submodel should be
assigned as many highly salient features as possible. Moreover, adjacent pixels can be
highly correlated, i.e., contain mostly the same information. Given a fixed set of pixels
to analyze, the information contained in those limited features should be maximized,
so a good strategy can be to select a subset of pixels spread uniformly across the image.
Put simply, for images, random partitioning can have larger information loss than
deterministic strategies. The supplement of the original paper empirically compares
random and deterministic partitioning [HL23b; HL23a]. In short, a simple strided
strategy that distributes features regularly across an image tends to work well for
vision. Formally, given d pixels and L submodels, the l-th submodel’s feature set
under strided partitioning is Sl = {j ∈ [d] : jmod L = l − 1}.
5.5 Evaluation
Our empirical evaluation is modeled after Levine and Feizi’s [LF20b] evaluation
of randomized ablation. For clarity, additional results are deferred to the supplement
including the base (non-robust) accuracy for each dataset (C.2.1) and the full numerical
results (C.2.2 & C.2.3). Additional results appear in the original paper including:
hyperparameter sensitivity analysis, plurality voting vs. run-off comparison, random
vs. deterministic partitioning comparison, and model training times [HL23b; HL23a].
5.5.1 Experimental Setup. Due to space, most evaluation setup details
are deferred to suppl. Sec. D.2 with a brief summary below. We evaluate FPA with
both the plurality-voting (Sec. 5.3.1) and run-off (Sec. 5.3.2) decision functions.
Baselines Randomized ablation (RA) is FPA’s most closely related work and
the primary baseline below. We report the performance of both Levine and Feizi’s
[LF20b] original version of RA as well as Jia et al.’s [Jia+22b] improved version, where
98
the certification analysis is tight for top-1 predictions. RA performs feature ablation
during training and inference. Each ablated input keeps e randomly selected features
unchanged and masks out the remaining (d− e) features; RA evaluates up to 100,000
ablated inputs to certify each prediction. Recall that RA’s ℓ0-norm robustness only
applies to evasion attacks (Def. 5.2), while FPA provides strictly stronger feature
guarantees that cover manipulation of both training and test data (Def. 5.1).
We also compare FPA to three certified patch defenses: (de)randomized
smoothing [LF20a], patch interval bound propagation (IBP) [Chi+20], and
BagCert [MY21].
Performance Metrics Certified defenses generally trade-off robustness and
(clean) accuracy. Hence, following Levine and Feizi’s [LF20b] evaluation of randomized
ablation, performance is measured using two complementary metrics: (1) median
certified robustness, the median value of the certified robustness across a dataset’s entire
test set with misclassified instances assigned robustness −∞ and (2) classification
accuracy, the fraction of test predictions classified correctly. Below, Rmed and ρmed
denote the median certified feature robustness (Def. 5.1) and ℓ0-norm robustness
(Def. 5.2) resp. Mean certification time measures the time to certify a single prediction.
Performance is also quantified using certified accuracy, i.e., the fraction of correctly-
classified test instances that satisfy some specific robustness criterion; this criterion
can be patch robustness or certified robustness of at least ψ ∈ N.
Datasets We compare the methods on standard datasets used in data poisoning
evaluation. First, following Levine and Feizi’s [LF20b] evaluation of baseline RA,
99
we consider MNIST and CIFAR104 where each feature corresponds to one (RGB)
pixel. Second, Chapter 4 proves that certified regression reduces to certified binary
classification when median is used as the regressor’s decision function. We apply their
reduction to both FPA and RA where for instance (x, y) and hyperparameter ξ ∈ R≥0,
the goal is to certify that y − ξ ≤ f(x) ≤ y + ξ. We consider two tabular regression
datasets evaluated in Chapter 4. (1) Weather [Mal+21] predicts the temperature using
features such as date, longitude, and latitude (ξ = 3◦C). (2) Ames [Coc11] predicts
housing prices using features such as square footage (ξ = 15%y). These two regression
datasets serve as a stand-in for vertically partitioned data, which are commonly tabular
and, as mentioned at the beginning of this chapter, are particularly vulnerable to our
union of ℓ0 attacks threat model. Note run-off and plurality voting are identical under
binary classification so we only report FPA’s plurality voting regression results.
Model Architectures For vision datasets MNIST and CIFAR10, all methods used
convolutional neural networks. Gradient-boosted decision trees (GBDTs) generally
work exceptionally well on tabular data [BHL23] so for regression datasets Weather
and Ames, FPA used LightGBM GBDTs [Ke+17]. In contrast, RA’s feature ablation
prevents the use of tree-based models like GBDTs, so RA instead used linear models
for these two datasets (Hammoudeh and Lowd [HL23c] also used linear models for
Weather). Even when restricted to linear submodels, FPA still had better median
robustness and classification accuracy than RA; see suppl. Tables C.23 and C.24.
Feature Partitioning Strategy For CIFAR10 and MNIST, FPA used strided
feature partitioning; each submodel considered the full image dimensions with any
4Existing certified poisoning defenses do not evaluate on full ImageNet due to the high training
cost [Web+23; LF21; Jia+22a; WLF22a; WLF22b; Rez+23].
100
Table 2. Median certified robustness. Table 3. Classification accuracy (%
Each dataset’s best performing method – larger is better). We report FPA’s
is in bold. Our median robustness accuracy at both RA’s (middle, bold)
was 20–30% larger for classification and FPA’s (blue) best median robustness
and 3 to 4× larger for regression levels. At RA’s best median robustness,
while simultaneously providing stronger FPA had better classification accuracy
guarantees. For detailed results, see for all four datasets. For full results, see
Sec. C.2.2. Sec. C.2.2.
FPA (ours) RA FPA (ours) RA [Jia+22b]
Dataset Dataset
Plural Run-Off [LF20b] [Jia+22b] Rmed Acc. Rmed Acc. ρmed Acc.
CIFAR10 11 13 7 10 CIFAR10 13 62.4 10 75.0 10 64.7
MNIST 9 12 8 10 MNIST 12 87.2 10 96.1 10 93.1
Weather 4 – 0 1 Weather 4 76.1 1 85.3 1 75.2
Ames 3 – 1 1 Ames 3 65.5 1 84.6 1 67.2
pixels not in Sl set to 0. For Weather and Ames, FPA used balanced random
partitioning as the tabular features are unordered.
Hyperparameters Hyperparameters L (FPA’s submodel count) and e (RA’s kept
feature count) control the corresponding method’s robustness vs. accuracy tradeoff.
When optimizing patch and median robustness, hyperparameters L and e were tuned
on validation data.5
Patch Robustness We consider two CIFAR10 patch attacks: (1) a 5× 5 pixel
square [LF20a] and (2) all 24-pixel rectangles (e.g., 1× 24 pixels, 24× 1, 2× 12, etc.),
reporting each method’s minimum and maximum certified accuracies across the eight
valid shapes [MY21].
5.5.2 Main Results. Table 2 summarizes the median certified robustness
for FPA and baseline RA. Tables 3 and 5 compare the methods and RA’s classification
accuracy and mean certification time (resp.); note that, for clarity, these two tables
5Secs. C.2.2 & C.2.3 compare each method’s certified accuracy across a range of hyperparameter
settings.
101
only report the results for Jia et al.’s [Jia+22b] better performing version of baseline
RA. Table 4 analyzes FPA as a patch defense. We briefly summarize the experiments’
takeaways below. See suppl. Secs. C.2.2 and C.2.3 for the full numerical results,
including comparing the methods at additional robustness levels.
Takeaway #1 FPA simultaneously provided larger and stronger median robustness
guarantees than RA. As Table 2 details, FPA’s median certified robustness was 20–30%
larger than RA for classification and 3 to 4× larger for regression. Importantly, FPA’s
certified feature guarantees apply to evasion, poisoning, and backdoor attacks, while
baseline RA only covers evasion attacks.
Takeaway #2 FPA’s median robustness gains come at little cost in classification
accuracy. Table 3 reports FPA’s classification accuracy at two robustness levels: (1)
FPA’s best median robustness (blue) and (2) RA’s best median robustness (bold).
Table 3 also reports RA’s classification accuracy at its own best median robustness
(last column). For CIFAR10 at median robustness of 10 pixels, FPA’s classification
accuracy was 10.2 percentage points (pp) better than RA (75.0% vs. 64.7%). At
Rmed = 13, FPA’s CIFAR10 classification accuracy was 62.4%, only 2.3pp lower than
RA’s classification accuracy at ρmed = 10. For Weather at median robustness 1,
FPA’s classification accuracy was 10.1pp better than RA (85.3% vs. 75.2%); even at
Rmed = 4, FPA’s classification accuracy was 76.1%, 0.9pp better than RA at ρmed = 1.
For MNIST at median robustness 10, FPA’s classification accuracy was 3pp better
than RA (96.1% vs. 93.1%). At Rmed = 12, FPA’s MNIST classification accuracy was
5.9pp lower than RA’s classification accuracy at ρmed = 10 (87.2% vs. 93.1%).
102
Table 4. CIFAR10 certified patch Table 5. Mean certification time
accuracy (% – larger is better) for in seconds for FPA and Jia et al.’s
FPA, RA, and three dedicated patch [Jia+22b] randomized ablation (RA).
defenses. FPA is competitive despite FPA is 2 to 3 orders of magnitude faster
making fewer assumptions and providing than baseline RA.
stronger guarantees than patch defenses.
RA [Jia+22b] FPA (ours)
Dataset Speedup
24 Pixel Rect. Square
Method e Time L Time
Min. Max. 5× 5 CIFAR10 15 5.4E+0 115 7.3E−3 743×
←− −→ MNIST 25 6.8E−1 60 2.9E−3 235×FPA Plurality (L = 180, ours) 38.53 37.77
Weather 45 3.1E−1 21 1.0E−4 3,134×
FPA Run-Off (L = 180, ours) ←− 41.60 −→ 40.95
Ames 60 3.8E−1 21 3.5E−4 1,082×
Rand. Ablation [LF20b] ←− 28.95 −→ 28.21
Rand. Ablation [Jia+22b] ←− 37.31 −→ 36.43
(De)Random. Smoothing 0.0 72.68 57.69
BagCert 43.11 60.17 59.95
Patch IBP — — 30.30
Takeaway #3 FPA certifies predictions 2 to 3 orders of magnitude faster than RA.
Table 5 compares the mean certification times using the hyperparameter settings with
the best median robustness. To certify one prediction, Jia et al.’s [Jia+22b] improved
RA evaluates 100k ablated inputs. In contrast, FPA requires exactly L forward passes
per prediction (one per submodel).
Takeaway #4 FPA provides strong patch robustness without any assumptions
about patch shape or the number of patches. As Table 4 details, FPA certifies 41.6%
of CIFAR10 predictions at R = 24 perturbed pixels (2.3% of d) – regardless of patch
shape or the number of patches. In contrast, (de)randomized smoothing’s [LF20a]
(BS, s = 12) 24-pixel certified accuracy varies between 0% to 72.7% based on patch
shape alone. BagCert’s certified accuracy drops as low as 43.1% for 24-pixel column
and row patches – only 1.5pp better than FPA. Unlike FPA, patch defenses’ certified
accuracy guarantees decline further or even evaporate under (1) multiple patches,
(2) training data perturbations, and (3) amorphous shapes. While less effective in
some settings than dedicated patch defenses that make stronger assumptions and
103
weaker guarantees, FPA is still competitive, providing patch guarantees essentially for
free.
Takeaway #5 FPA is the first integrated defense to provide significant pointwise
robustness guarantees over the union of evasion, backdoor, and poisoning attacks –
ℓ0 or otherwise. Consider CIFAR10 (n = 50,000) where FPA feature robustness
R ≥ 25 (Table 4) certifies 41.0% of predictions’ robustness against 1.25M arbitrarily
perturbed pixels. In contrast, the only other certified defense robust over the union of
evasion, backdoor, and poisoning attacks [Web+23] certifies the equivalent of 3 or fewer
arbitrarily perturbed CIFAR10 pixels (i.e., a total training and test ℓ2 perturbation
distance of ≤3). Moreover, FPA certifies R ≥ 7 for 35.1% of Weather predictions
(n > 3M – Table C.27) – a pointwise guaranteed robustness of up to 21M arbitrarily
perturbed feature values.
5.6 Conclusions
This paper proposes feature partition aggregation – a certified defense against
the union of ℓ0 evasion, poisoning, and backdoor attacks. FPA provided stronger and
larger robustness guarantees than the state-of-the-art ℓ0 evasion defense, randomized
ablation. FPA’s certified feature guarantees are particularly important for vertically
partitioned data where a single compromised data source allows an attacker to
arbitrarily modify a limited number of features for all instances – training and test.
To our knowledge, FPA is the first integrated defense providing non-trivial pointwise
robustness guarantees against the union of evasion, poisoning, and backdoor attacks –
ℓ0 or otherwise [Web+23]. Future work remains to develop other ℓp certified defenses
over this union of attack types.
104
CHAPTER 6
IDENTIFYING POISONING AND BACKDOOR ATTACK TARGETS WHILE
MITIGATING THE ATTACK
This chapter contains previously published and unpublished coauthored
material [HL21; HL22a; HL22b; BHL23]. Hammoudeh developed the primary method,
developed all code, conducted all experiments, and wrote the manuscript. Lowd provided
supervision, editorial suggestions, and proposed some supplemental experiments.
Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set Attack’s
Target Using Renormalized Influence Estimation”. In: Proceedings of the
29th ACM SIGSAC Conference on Computer and Communications Security.
CCS’22. Los Angeles, CA: Association for Computing Machinery, 2022. url:
https://arxiv.org/abs/2201.10055
Zayd Hammoudeh and Daniel Lowd. “Training Data Influence Analysis and
Estimation: A Survey”. In: arXiv (2022). arXiv: 2212.04612 [cs.LG]
Zayd Hammoudeh and Daniel Lowd. “Simple, Attack-Agnostic Defense
Against Targeted Training Set Attacks Using Cosine Similarity”. In:
Proceedings of the 3rd ICML Workshop on Uncertainty and Robustness
in Deep Learning. UDL’21. 2021
Jonathan Brophy et al. “Adapting and Evaluating Influence-Estimation
Methods for Gradient-Boosted Decision Trees”. In: Journal of Machine
Learning Research 24 (2023), pp. 1–48. url: http://jmlr.org/papers/
v24/22-0449.html
105
Targeted training set attacks manipulate an ML system’s prediction on one or
more target test instances by maliciously modifying the training data [Muñ+17;
Sha+18; Agh+20; Sal+20; Hua+20; Gei+21; Wal+21]. Targeted attacks require very
few corrupted instances [Wal+21], and their effect on the test error is quite small,
making them harder to detect [Che+17].
Existing poisoning and backdoor defenses (Sec. 3.2) change the training procedure
to mitigate the impact of an attack but provide little insight into an attacker’s goals,
methods, or identity. As Sec. 3.3 explains, knowledge about an attacker is essential to
anticipating attacks, designing targeted defenses, and building defenses outside the
ML system.
This chapter’s defense against training set attacks focuses on the related goals of
learning more about an attacker and stopping their attacks. We achieve this through
a pair of related tasks:
1. target identification: identifying the target of a training set attack, which may
provide insight into the attacker’s goals and how to defend against them; and
2. adversarial-instance identification: identifying the malicious instances that
constitute the training set attack.
We are not aware of any work that studies poisoning and backdoor attack target
identification beyond very simple settings.
Our key insight is the synergistic interplay between the two tasks above. If
attackers can add only a limited number of training instances (as is often the case)
[Che+17; Wal+21], then these malicious instances must be highly influential to change
target predictions. Thus, targets are those test instances with an unusual number of
highly influential training examples. In contrast, non-targets tend to have many weak
106
influences and few very strong ones. Thus, if we can (1) determine which training
instances influence which predictions and (2) detect anomalies in this distribution,
then we can jointly solve both tasks.
Unfortunately, determining which training instances are responsible for which
model behaviors is provably hard for black-box models like neural networks [BR92].
Influence estimators [KL17; Yeh+18; Pru+20; FZ20; Che+21; BHL23; HL22b] quantify
how much each training example contributes to a particular prediction. However,
current influence analysis methods often perform poorly [BPF21; HL22b; ZZ22].
This chapter identifies a weakness common to gradient-based influence estimators
[KL17; Yeh+18; Pru+20; Che+21; HL22b]: they induce a low-loss penalty that
implicitly ranks confidently-predicted training instances as uninfluential. As a
result, existing influence estimators can systematically overlook (groups of) highly
influential, low-loss instances. This chapter provides a simple fix – renormalization that
removes the low-loss penalty. Our new renormalized influence estimators consistently
outperform the original estimators in both adversarial and non-adversarial settings.
The most effective of these renormalized estimators, gradient aggregated similarity
(GAS), often detects 100% of malicious training instances with no clean-data false
positives.
Our framework for identifying targets of training set attacks, FIT, compares the
distribution of influence values across test instances checking for anomalies. More
concretely, FIT marks as potential targets those test instances with an unusual number
of highly influential instances as explained above. Next, FIT mitigates the attack’s
effect by removing exceptionally influential training instances associated with the
target(s). Since mitigation considers only targets, training instance outliers that are
“helpful” to non-targets are unaffected. This target-driven mitigation has a positive or
107
neutral effect on clean data yet is highly effective on adversarial data where finding
even a single target suffices to disable the attack on almost all other targets.
By relying influence estimation and not the properties of a particular attack or
domain, GAS and FIT are attack agnostic; GAS and FIT can apply equally well to
different attack types, including poisoning and backdoor attacks. Our approach works
across data domains from CNN image classifiers to speech recognition to even text
transformers.
In addition to learning more about the attack and attacker, target identification
enables targeted mitigation. Defenses against poisoning and backdoor attacks
implement countermeasures that affect predictions on all instances – not just the very
few targets. Certified defenses can substantially degrade performance, in some cases
causing up to 10× more errors on clean data [Fow+21; HL23c; LF21].
Our chapter’s contributions are enumerated below.
1. We identify a weakness common to all gradient-based influence estimators and
provide a simple renormalization correction that addresses this weakness.
2. Inspired by influence estimation, we propose GAS – a renormalized influence
estimator that is highly adept at identifying influential groups of training
instances.
3. Leveraging techniques from anomaly detection and robust statistics, we extend
GAS into a general framework for identifying targets of training set attacks,
FIT.
4. We use GAS in a target-driven data sanitizer that mitigates attacks while
removing very few clean instances.
108
5. We demonstrate the effectiveness and attack agnosticism of GAS and FIT on a
diverse set of attacks and data modalities, including speech recognition, vision,
and text – even against an adaptive attacker that attempts to evade our method.
Below, we first specify this chapter’s threat model and defender objective.
Section 6.2 reviews existing gradient-based influence estimators [HL22b]. Then in
Section 6.3, we show how existing gradient-based influence estimators are inadequate
for identifying (groups of) highly influential training instances. We also introduce our
renormalization fix for influence estimation and describe four renormalized influence
estimators. Section 6.4 builds on these improved influence estimates to define a
framework for identifying the targets of an attack and mitigating the attack’s effect.
Section 6.5 demonstrates the effectiveness of our methods.
6.1 Preliminaries
Notation Let ẑte := (xte, ŷte) be any a priori unknown test instance, where ŷte is
the final model’s predicted label for xte. Observe that ŷte may not be xte’s true label.
Notation ̂ (e.g., ẑ, ĝ) denotes that the final predicted label ŷ = f(x; θT ) is used in
place of x’s true label y.
Threat Model. The attacker crafts an adversarial set of perturbed instances,
Dadv ⊂ D. Denote the clean training set Dcl := D \ Dadv. We only consider successful
attacks, as defined below.
Attacker Objective and Knowledge Let Xtarg := {xj}mj=1 be a set of target
feature vectors with shared true label ytarg ∈ Y . The attacker crafts Dadv to induce the
model to mislabel all of Xtarg as adversarial label yadv. Ẑtarg := {(xj, yadv) : xj ∈ Xtarg}
denotes the target set and ẑtarg := (xtarg, yadv) an arbitrary target instance. To
avoid detection, the attacked model’s clean-data performance should be (essentially)
109
unchanged. Data poisoning attacks only perturb adversarial set Dadv. Target feature
vectors are unperturbed/benign [BNL12; Muñ+17; Wal+21; Jag+21]. Clean-label
poisoning leaves labels unchanged when crafting Dadv from seed instances [Zhu+19;
YHL23a; YHL23b]. Backdoor attacks perturb the features of both Dadv and Xtarg
– often with the same adversarial trigger (e.g., change a specific pixel to maximum
value). Generally, these triggers can be inserted into any test example targeted by the
adversary, making most backdoor attacks multi-target (|Ẑtarg| > 1) [TLM18; Gu+19;
Lin+20; Web+23; YHL23a; YHL23b]. Dadv’s labels may also be changed.
To ensure the strongest adversary, the attacker knows any pre-trained initial
parameters. Where applicable, the attacker also knows the training hyperparameters
and clean dataset Dcl. Like previous work [Zhu+19; Web+23; Wal+21], the attacker
does not know the training procedure’s random seed, meaning the attack must be
robust to randomness in batch ordering or parameter initialization.
Defender Objective and Knowledge Let Ẑte denote the set of test instances
the defender is concerned enough about to analyze as potential targets.1 Our goals
are to (1) identify any attack targets in Ẑte, and (2) mitigate the attack by removing
the adversarial instances Dadv associated with those target(s). No assumptions are
made about the modality/domain (e.g., text, vision) or adversarial perturbation. We
do not assume access to clean validation data.
6.2 Review of Influence Analysis and Estimation
As Sec. 3.2.1 discusses, most existing defenses against poisoning and backdoor
attacks assume highly restricted threat models. In contrast, this section seeks a method
that is attack agnostic, to make it harder for an adversary to adapt against. We mitigate
1Generally, there are far fewer potential targets (Ẑte) than possible test examples.
110
Method AUPRC Top-5 Highest Ranked
Representer Point 0.030± 0.009
Influence Functions 0.029± 0.018
TracIn 0.140± 0.098
Test
Example
TracInCP 0.309± 0.260
Representer Point
ẑte 0.778 ± 0.144Renormalized (ours)
Influence Functions
0.215 ± 0.191
Renormalized (ours)
TracIn
0.617 ± 0.115
Renormalized (ours)
GAS (ours)
0.977 ± 0.001
(TracInCP Renormalized)
Figure 9. Renormalized Influence: CIFAR10 & MNIST joint, binary classification
for [frog] vs. [airplane & MNIST 0] with |Dcl| = 10,000 & |Dadv| = 150. Existing
influence estimators (upper half) consistently failed to rank Dadv’s MNIST training
instances as highly influential on MNIST test instances. In contrast, all of our
renormalized influence estimators (Section 6.3.3) outperformed their unnormalized
version – with AUPRC improving up to 25×. Results averaged across 30 trials.
training set attacks by building upon influence estimation to identify the target(s)
and adversarial set. We achieve agnosticism by building upon existing methods that
are general and makes no assumption about the underlying attack. Specifically, we
leverage influence analysis, which we formalize below. For a more complete discussion
of influence analysis and estimation, see our detailed survey [HL22b].
Intuitively, in every successful attack, the inserted training instances change a
model’s prediction for specific input(s). If the attacker can only add a limited number
of instances (e.g., 1% of D), these inserted instances must be highly influential to
achieve the attacker’s objective.
111
Influence analysis’s goal is to determine which training instances are most
responsible for a model’s prediction for a particular input [HL22b]. Influence is
often viewed as a counterfactual: which instance (or group of instances) induces
the biggest change when removed from the training data? While there are multiple
definitions of influence, as detailed below, influence analysis methods can be broadly
viewed as quantifying the (relative) responsibility of each training instance zi ∈ D on
some test prediction f(xte; θT ).
Static influence estimators consider only the final model parameters θT . For
example, Koh and Liang’s [KL17] seminal work defines influence, IIF (zi, ẑte), as
the change in risk L(ẑte; θT ) if zi ∈/ D, i.e., the leave-one-out (LOO) change in test
loss [CW82]. By assuming strict convexity and stationarity, Koh and Liang’s influence
functions estimator approximates the LOO influence as
I 1IF (zi, ẑte) ≈ ∇θL(ẑte; θ )⊺H−1T∑ θ ∇θL(zi; θT ), (6.1)n
with H−1θ the inverse of risk Hessian Hθ :=
1 2
z ∈D ∇θL(zn i i; θT ).
Yeh et al. [Yeh+18]’s representer point static influence estimator exclusively
considers the model’s final, linear classification layer. All other model parameters
are treated as a fixed feature extractor. Given final parameters θT , let fi denote
xi’s penultimate feature representation (i.e., the input to the linear classification layer).
Then the representer point influence of zi(∈ D on ẑte i)s
I 1 ∂L(zi; θT )
〈 〉
RP (zi, ẑte) := − fi, fte , (6.2)
2λn ∂ayi
where λ > 0 is the weight decay (L2) regularizer and ⟨·, ·⟩ denotes vector dot product.
Recall that a is the output of the model’s linear classification layer, specifically
here a = f(x ; θ ). Scalar ∂L(zi;θT )i T is then the partial derivative of risk L w.r.t. a’s∂ayi
yi-th dimension.
112
Algorithm 1 TracIn, TracInCP, & GAS training phase
Input: Training set D, iteration subset T , iteration count T , learning rates η1, . . . ,ηT , and initial
parameters θ0
Output: Training parameters P
1: P ← ∅
2: for t ← 1 to T do
3: if t ∈ T then
4: P ← P ∪ {(ηt,θt−1)}
5: Bt ∼ D
6: θt ← Update(ηt, θt−1,Bt)
7: return P
Dynamic influence estimators measure influence based on how losses change during
training. More formally, influence is quantified according to how batches B1, . . . ,BT
affect model parameters θ0, . . . ,θT and by consequence risks L(·; θ0), . . . ,L(·; θT ). For
example, Pruthi et al.’s [Pru+20] TracIn estimates influence by “tracing” gradient
descent – aggregating changes in ẑte’s test loss each time training instance zi’s gradient
updates parameters θt. For stochastic gradient descent (batch size b = 1), zi’s TracIn
influence on ẑte is ∑T ( )
ITracIn (zi, ẑte) := 1[zi ∈ Bt] L(ẑte; θt−1)− L(ẑte; θt) , (6.3)
t=1
where 1[u] is the indicator function s.t. 1[u] = 1 if predicate u is true and 0 otherwise.
Pruthi et al. approximate Eq. (6∑.3) as,η 〈 〉I tTracIn (zi, ẑte) ≈ ∇θL(zi; θt−1), ∇θL(ẑte; θt−1) , (6.4)
b
zi∈Bt
where ηt is iteration t’s learning rate.
Alg. 1 details the minimal changes made to model training to support TracIn where
T ⊂ {1, . . . ,T} is a preselected training iteration subset and P := {(ηt,θt−1) : t ∈ T }
contains the serialized training parameters. Alg. 2 outlines TracIn’s influence estimation
procedure for a priori unknown test instance ẑte. Influence vector v (|v| = n) contains
113
Algorithm 2 TracIn influence estimation
Input: Training parameters P, iteration subset T , iteration count T , batches B1, . . . ,BT , batch
size b, and test example ẑte
Output: Influence vector v
1: v ← 0⃗ ▷ Initialize
2: for t ← 1 to T do
3: if t ∈ T then
4: (η, θ) ← P[t] ▷ Equiv. to (ηt, θt−1)
5: ĝte ← ∇θL(ẑte; θ)
6: for each zi ∈ Bt do ▷ Batch examples
7: gi ← ∇θL(zi; θ)
8: v ← v + ηi i b ⟨gi, ĝte⟩ ▷ Unnormalized
9: return v
the TracIn influence estimates for each zi ∈ D. In practice, |T | ≪ T , and T is
evenly-spaced in {1, . . . ,T}, meaning TracIn effectively treats multiple batches like a
single model update.
Pruthi et al. also propose TracIn Checkpoint (TracInCP) – a more heuristic
version of TracIn that considers all training examples at each checkpoint in T – not
just those instances in the interve∑ning b〈atches (see Alg. 3).
2 Formally〉,
I η: tTracInCP (zi, ẑte) = ∇θL(zi; θt−1), ∇θL(ẑte; θt−1) . (6.5)
b
t∈T
TracInCP is more computationally expensive than TracIn – with the slowdown linear
w.r.t. the number of checkpoints per epoch.
A major advantage of TracIn and TracInCP over other estimators (e.g., influence
functions) is that their only hyperparameter is iteration set T , which we tuned based
only on compute availability.
2Algorithm 3 combines two different methods TracInCP as well as GAS – our renormalized version
of TracInCP discussed in Sec. 6.3.3.
114
6.3 Why Influence Estimation Often Fails and How to Fix It
Before addressing target identification, we first consider the related task of
adversarial-instance identification. In the simplest case, if the attack’s target is
known, then the malicious instances should be among the most influential instances
for that target instance. In other words, adversarial-instance identification reduces
to influence estimation. However, Sec. 6.2’s influence estimators share a common
weakness that makes them poorly suited for this task: they all consistently rank
confidently-predicted training instances as uninfluential. We illustrate this behavior
below using a toy experiment. We then explain this weakness’s cause and propose a
simple fix that addresses this limitation on adversarial and non-adversarial data, for
all preceding estimators. Our fix is needed to successfully identify adversarial set Dadv
and, as detailed in Sec. 6.4, attack targets.
6.3.1 A Simple Experiment. Consider binary classification where clean
set Dcl is all frog and airplane training images in CIFAR10 (|Dcl| = 10,000). To
simulate a naive backdoor attack, adversarial set Dadv is 150 randomly selected
MNIST 0 images labeled as airplane. Clean data’s overall influence can be estimated
indirectly by training only on Dcl and observing the target set’s misclassification
(rat)e [FZ20]. This experiment used class pair frog and airplane because amongst the
10 CIFAR10 class pairs, frog vs. airplane’s average MNIST test misclassification
2
rate was closest to random (47.5% vs. 50% ideal).In contrast, when training on
D :=Dadv ⊔ Dcl, MNIST 0 test instances were always classified as airplane, meaning
Dadv is overwhelmingly influential on MNIST predictions. MNIST is used instead of
other CIFAR10 classes because the large (and simple [Sha+20]) difference between
the data distributions leads to a strong signal that can be consistently learned from
relatively few examples – much like backdoor or poisoning attacks [Yu+21].
115
We use this simple setup to evaluate different influence estimation methods.
We trained 30 randomly-initialized state-of-the-art ResNet9 networks, and on each
network, we performed influence estimation for a random MNIST 0 test instance to
determine how well each estimator identified adversarial set Dadv provided a known
target.3 Given the large imbalance between the amount of clean and “adversarial”
data, i.e., |Dadv| ≪ |Dcl|, performance is measured using area under the precision-
recall curve (AUPRC), which quantifies how well Dadv’s influence ranks relative to Dcl.
Precision-recall curves are preferred for highly-skewed classification tasks since they
provide more insight into the false-positive rate [DG06].
Figure 9’s upper half shows how well each influence estimator in Section 6.2
identifies Dadv, both quantitatively and qualitatively. Dynamic estimators significantly
outperformed their static counterparts, with TracInCP the overall top performer.
However, no influence estimator consistently ranked MNIST instances (i.e., Dadv)
in the top-5 most influential, with influence functions marking instances from the
other class (frog) as most influential. Influence estimation’s poor performance here is
particularly noteworthy as the task was designed to be unrealistically easy.
6.3.2 Why Influence Estimation Performs Poorly. Intra-training
dynamics illuminate the primary cause of influence estimation’s poor performance
in our toy experiment. Fig. 10 visualizes the median training loss of Dadv and Dcl
at each training checkpoint. Also shown is the gradient norm ratio, which compares
the median gradient magnitude of the adversarial and clean sets at each iteration, or
formally
med{∥L(z; θt)∥ : z ∈ D: adv}GNRt = . (6.6)
med{∥L(z; θt)∥ : z ∈ Dcl}
3See supplemental Section D.3 for the complete experimental setup details.
116
MNIST (Dadv) Loss CIFAR10 (Dcl) Loss Gradient Norm Ratio
10
10
0.1 1
0.1
10−3
10−2
10−5 10
−3
10−4
10−7
10−5
−6
10−9 10
0 2 4 6 8
Epoch
Figure 10. CIFAR10 and MNIST Intra-training Loss Tracking: Dadv’s ( )
and Dcl’s ( ) median cross-entropy losses (L ) at each training checkpoint for binary
classification – frog vs. airplane & MNIST 0. The shaded regions correspond to
each training set loss’s interquartile range. MNIST’s training losses are generally
several orders of magnitude smaller than CIFAR10’s losses. Gradient norm ratio ( )
shows the tight coupling of loss and training gradient magnitude.
The gradient norm ratio closely tracks both training sets’ loss values. Both during and
at the end of training, Dadv’s median loss is significantly smaller than many instances
in Dcl – often by several orders of magnitude.
The Low-Loss Penalty Observe that all influence methods in Sec. 6.2’s scale
their influence estimates by ∂L (a,y) either directly (representer point (6.2)) or indirectly
∂a
via the chain rule (influence functions (6.1), TracIn (6.4), and TracInCP (6.5)) as
∇ L ∂L (f(x), y) ∂L (a, y) · ∂aθ (z; θ) := = . (6.7)
∂θ ∂a ∂θ
Therefore, gradient-based influence estimators implicitly penalizes all training
instances t with low training loss, including Dadv (MNIST 0) in our toy experiment
above.
117
Training Loss
Gradient Norm Ratio
Theorem 6.1 summarizes this relationship when there is a single output activation
(|a| = 1), e.g., binary classification and univariate regression. In short, when
Theorem 6.1’s conditions are met, loss induces a perfect ordering on the corresponding
norm.
Theorem 6.1. Let loss function L̃ : R → R≥0 be twice-differentiable and strictly
convex as well as either even4 or mono∥tonically d∥ecrea∥sing. Then∥, it holds that∥ ∥ ∥ ∥
L̃ (a) < L̃ (a′) =⇒ ∥∇ ′aL̃ (a)∥ < ∥∇aL̃ (a )∥ . (6.8)
2 2
Loss functions satisfying Theorem 6.1’s conditions include binary cross-entropy
(i.e., logistic) and quadratic losses. Theorem 6.1 generally applies to multiclass losses,
but there are cases where the ordering is not perfect. Although Theorem 6.1 primarily
relates to training instance gradients and losses, the theorem applies to test examples
as well since dynamic estimators also apply a low-loss penalty to any iteration where
test instance ẑte has low loss.
The preceding should not be interpreted to imply that large gradient magnitudes
are unimportant. Quite the opposite, large gradients have large influences on the
model. However, the approximations necessary to make influence estimation tractable
go too far by often focusing almost exclusively on training loss – and by extension
gradient magnitude – leading these estimators to systematically overlook training
instances with smaller gradients. This overemphasis of instances with large losses
and gradient magnitudes can also be viewed as a bias towards instances that are
globally influential — affecting many examples’ predictions — over those that are
locally influential – mainly affecting a small number of targets [BBD20].
4“Even” denotes that the function satisfies ∀a L̃ (a) = L̃ (−a).
118
Static Influence and the Low Loss Penalty Fig. 9’s static estimators
(representer point and influence functions) significantly underperformed dynamic
estimators (TracIn and TracInCP) by up to an order of magnitude. Static estimators
only consider final model parameters θT , meaning they may only see the low-loss case.
In contrast, dynamic estimators consider all of training, in particular iterations where
Dadv’s loss exceeds that of Dcl. This allows dynamic estimators to outperform static
methods, albeit still poorly.
Training Randomness and the Low-Loss Penalty TracInCP significantly
outperformed TracIn in Fig. 9 despite the TracIn being more theoretically sound. As
intuition why, imagine the training set contains two identical copies of some instance. In
expectation, these duplicates have equivalent influence on any test instance. However,
TracIn assigns identical training examples different influence estimates based on their
batch assignments; this difference can potentially be very large depending on training
dynamics.
Fig. 10 exhibits this behavior where training loss fluctuates considerably
intra-epoch. For example, Dadv’s median loss varies by seven orders of magnitude
across the third epoch. TracIn’s low-loss penalty attributes much more influence
to Dadv instances early in that epoch compared to those later despite all MNIST
instances having similar influence. By considering all examples at each checkpoint,
TracInCP removes batch randomization’s direct effect on influence estimation,5
meaning TracInCP simulates influence expectation without needing to train and
analyze multiple models.
5Batch randomization still indirectly affects TracInCP and GAS (Sec. 6.3.3) through the
model parameters. This effect could be mitigated by training multiple models and averaging
the (renormalized) influence, but that is beyond the scope of this work.
119
6.3.3 Renormalizing Influence Estimation. Our CIFAR10 and MNIST
joint classification experiment above demonstrates that a training example having low
loss does not imply that it and related instances are uninfluential. Most importantly
in the context of adversarial attacks, highly-related groups of (adversarial) training
instances may collectively cause those group members’ to have very low training losses
– so-called group effects. Generally, targeted attacks succeed by leveraging the group
effect of adversarial set Dadv on the target(s). We address these group effects via
renormalization, which is defined below.
Def. 6.2. For influence estimator I, the renormalized influence, Ĩ, replaces each
gradient g in I by its corresponding unit vector g∥ .g∥
We refer to this computation as renormalization since rescaling gradients removes
the low-loss penalty. Renormalization places all training instances on equal footing
and ensures that gradient and/or feature similarity is prioritized – not loss.
Renormalization is related to the relative influence (RelatIF) method introduced
by Barshan et al. [BBD20], since both methods use a function of the gradient to
downweight training instances with high losses. However, RelatIF only applies to
influence functions and requires computing expensive Hessian-vector products, while
renormalization is more efficient and can be applied to many influence estimators,
as we show below. See the supplement of the original paper [HL22a] for additional
discussion of alternative renormalization schemes.
Renormalized versions of Section 6.2’s static influence estimators are below.
Renormalized influence functions in Eq. (6.9) does not include target gradient norm
∥ĝte∥ since it is a constant factor. For simplicity, Eq. (6.10)’s renormalized representer
point uses signum function sgn (·) since for any scalar u ≠ 0, sgn (u) = u| | , i.e., signumu
120
is equivalent to normalizing by magnitude. ( )
Ĩ 1 ∇θL(zi; θT )IF (zi, ẑte) := ∇ ⊺ −1θL(ẑte; θT ) H (6.9)
n ( θ ∥∇)θ〈L(zi; θ〉T )∥
Ĩ − 1 ∂L(zi; θT )RP (zi, ẑte) := sgn fi, fte (6.10)
2λn ∂ayi
Renormalized versions of Section 6.2’s dynamic influence estimators appear below.
Going forward, we refer to renormalized TracInCP (Eq. (6.12)) as gradient aggregated
similarity, GAS, since it is essentially the weighted, gradient cosine similarity averaged
across all of training. GAS’s procedure〈is detailed in Algorithm 3.6∑ 〉η ∇θL(zi; θt−1), ∇θL(ẑte; θt−1)Ĩ tTracIn (zi, ẑte) := (6.11)
b ∥∇θL(zi; θt−1)∥∥∇θL(ẑte; θt−1)∥
∑zi∈Bt 〈 〉η ∇θL(zi; θt−1), ∇θL(ẑte; θt−1)Ĩ tTracInCP (zi, ẑte) :=
b ∥∇ L(z
∈ θ i
; θt−1)∥∥∇θL(ẑte; θt−1)∥
t T
=: GAS(zi, ẑte) (6.12)
Unlike static estimators, rescaling dynamic influence by target gradient norm ∥ĝte∥
is quite important as mentioned earlier. Intuitively, ∥ẑte∥ tends to be largest in
two cases: (1) early in training due to initial parameter randomness and (2) when
iteration t’s predicted label conflicts with final label ŷte. Both cases are consistent
with the features most responsible for predicting ŷte not yet dominating. Therefore,
rescaling dynamic influence by ∥ĝte∥ implicitly upweights iterations where ŷte is
predicted confidently. It also inhibits any single checkpoint dominating the estimate.
Applying Renormalization to CIFAR10 and MNIST Joint Classification
Figure 9’s lower half demonstrates renormalization’s significant performance advantage
over standard influence estimation – with the improvement in AUPRC as large as 25×.
6As shown in Algorithm 3, TracInCP’s procedure (Line 7) is identical to GAS (Line 9) other than
influence renormalization.
121
Algorithm 3 GAS vs. TracInCP
Input: Training parameters P, training set D, batch size b, & test example ẑte
Output: (Renormalized) influence vector v
1: v ← 0⃗ ▷ Initialize
2: for each (ηt, θt−1) ∈ P do
3: ĝte ← ∇θL(ẑte; θt−1)
4: for each zi ∈ D do ▷ All examples
5: gi ← ∇θL(zi; θt−1)
6: if calculating TracInCP then
7: vi ← vi + ηtb ⟨gi, ĝte⟩ ▷ Unnormalized (Sec. 6.2)
8: else if calculat〈ing GAS th〉en
9: v ← v + ηt gi , ĝtei i b ∥g ∥ ∥ĝ ∥ ▷ Renormalized (Sec. 6.3.3)i te
10: return v
In particular, our renormalized estimators’ top-5 highest-ranked instances were all
consistently from MNIST, unlike any of the standard influence estimators. Overall,
GAS (renormalized TracInCP) was the top performer – even outperforming our other
renormalized estimators by a wide margin.
6.3.4 Renormalization and More Advanced Attacks. Section 6.3.2
illustrates why influence performs poorly under a naive backdoor-style attack where
the adversary does not optimize the adversarial set. Those concepts also generalize to
more sophisticated attacks. For example, recent work shows that deep networks often
predict the adversarial set with especially high confidence (i.e., low loss) due to shortcut
learning – even on advanced attacks [Yu+21; Gei+20]. Those findings reinforce the
need for renormalization. This can be viewed through the lens of simplicity bias where
neural networks tend to confidently learn simple features (shortcuts) – regardless of
whether those features actually generalize [Sha+20].
Dynamic estimators – both TracIn and GAS – outperform static ones for
Sec. 6.3.1’s naive attack. The same can be expected for sophisticated attacks including
ones that track adversarial-set gradients through simulated training [Hua+20; Wal+21].
122
For those attacks, adversaries can craft Dadv to exhibit particular gradient signatures
at the end of training to avoid static detection. Moreover, models learn adversarial
data faster than clean data meaning training loss often drops abruptly and significantly
early in training [Li+21]. For an attack to succeed, adversarial instances must align
with the target at some point during training, meaning dynamic methods can detect
them.
Lastly, our threat model specifies that attackers never know the random batch
sequence nor any randomly initialized parameters. Therefore, attackers can only
craft Dadv to be influential in expectation over that randomness. Influence is
stochastic, varying significantly across random seeds. However, estimating the true
expected influence is computationally expensive. GAS and TracInCP, which simulate
expectation, better align with how the adversary actually crafts the adversarial set,
resulting in better Dadv identification.
Below we detail how renormalization can be specialized further for better
adversarial-set identification.
Extending Renormalization Layerwise: In practice, gradient magnitudes are
often unevenly distributed across a neural network’s layers. For example, Figure 11
tracks an attack target’s average intra-training gradient magnitude for two different
backdoor adversarial triggers on CIFAR10 binary classification (ytarg = airplane and
y = bird).7adv Specifically, target gradient norm, ∥∇θL(ẑtarg; θt)∥, is decomposed into
just the contributions of the network’s first convolutional layer (Conv1) and the final
linear layer. Despite being only 0.04% of the model parameters, these two layers
combined constitute >50% of the gradient norm. Therefore, the first and last layers’
7See supplemental Section D.3 for the complete experimental setup. The class pair and adversarial
triggers were proposed by Weber et al. [Web+23].
123
1 Pixel Conv1 1 Pixel Linear Blend Conv1 Blend Linear
60
∥∥∥% of∥ 40(t) ∥ĝtarg∥
20
0
0 2 4 6 8
Epoch
Figure 11. Layerwise Decomposition of an Attack Target’s Intra-
Training Gradient Magnitude: One-pixel and blend backdoor adversarial
triggers (dashed and solid lines respectively) trained separately on CIFAR10 binary
classification (ytarg = airplane and yadv = bird) using ResNet9. The network’s first
convolutional (Conv1) and final linear layers are a small fraction of the parameters
(0.03% and 0.01% resp.) but constitute most of the target’s gradient magnitude (∥ĝtarg∥)
with the dominant layer attack dependent. Results are averaged over 20 trials.
parameters are, on average, weighted >2,000× more than other layers’ parameters.
With simple renormalization, important parameters in those other layers may go
undetected.
As an alternative to simply renormalizing by ∥∇θL(z; θt)∥, partition gradient
vector g by layer into L disjoint vectors (where L is model f ’s layer count) and then
independently renormalize each subvector separately. This layerwise renormalization
can be applied to any estimator that uses training gradient gi or test gradient ĝte,
including influence functions, TracIn, and TracInCP. Layerwise renormalization still
corrects for the low-loss penalty and does not change the asymptotic complexity. To
switch GAS to layerwise, the only modification to Algorithm 3 is on Line 10 where
124
each dimension is divided by its corresponding layer’s norm instead of the full gradient
norm.
Notation: “-L” denotes layerwise renormalization, e.g., layerwise GAS is
GAS-L. Suffix “(-L)”, e.g., GAS(-L), signifies that a statement applies irrespective
of whether the renormalization is layerwise.
6.3.5 Renormalization and Non-Adversarial Data. Renormalization
not only improves performance identifying an inserted adversarial set; it also improves
performance in non-adversarial settings. Sec. 6.2 defines influence w.r.t. a single
training example. Just as one instance may be more influential on a prediction than
another, a group of training instances may be more influential than a different group.
Renormalization improves identification of influential groups of examples, even on
non-adversarial data.
To empirically demonstrate this, consider CIFAR10 binary classification again.
In each trial, ResNet9 was pre-trained on eight (= 10− 2) held-out CIFAR10 classes.
From the other two classes, test example zfilt was selected u.a.r. from those test
instances with a moderate misclassification rate (10-20%) across multiple retrainings
(i.e., fine-tunings) of the pre-trained network.8 (Renormalized) influence was then
calculated for zfilt, with each estimator yielding a training set ranking. Each estimator’s
top p% ranked instances were removed from the training set and 20 models trained
from the pre-trained parameters using these reduced training sets. Performance is
measured using zfilt’s misclassification rate across those 20 models where a larger error
rate entails a better overall ranking.
Figure 12 compares influence estimation’s filtering performance, with and without
renormalization, against a random baseline averaged across five CIFAR10 class pairs,
8Using examples with a moderate misclassification rate ensures that dataset filtering’s effects are
measurable even with a small fraction of the training data removed.
125
Random Inf. Func. Inf. Func. Rn. (ours) Inf. Func. Rn.-L (ours)
TracIn TracInCP GAS (ours) GAS-L (ours)
1
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25 30
% Training Set Removed
(a) Influence functions-based methods
1
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25 30
% Training Set Removed
(b) TracIn-based methods
Figure 12. Effect of Removing Influential, Non-Adversarial Training Data:
Test example zfilt’s misclassification rate (larger is better) when filtering the training
set using influence rankings based on influence functions (top) and TracIn (bottom).
Renormalization (Rn.) always improved mean performance across all training set
filtering percentages. Results are averaged across five CIFAR10 class pairs with
30 trials per class pair and 20 models trained per method per trial. Results are
separated by the reference influence estimator.
126
zfilt Misclass. Rate zfilt Misclass. Rate
namely the two pairs specified by Weber et al. [Web+23] and three additional
random pairs. Influence, irrespective of renormalization, significantly outperformed
random removal, meaning all of these estimators found influential subsets, albeit
of varying quality.9 In all cases, renormalized influence had better or equivalent
performance to the original estimator across all filtering fractions. This demonstrates
that renormalization generalizes across estimators even beyond adversarial settings.
Overall, layerwise renormalization was the top performer across all setups except
for large filtering percentages where GAS surpassed it slightly. Renormalized(-L)
Influence functions and GAS(-L) performed similarly when filtering a small fraction
(e.g., ≤5%) of the training data. However, the performance of renormalized influence
functions plateaued for larger filtering fractions (≥10%) while GAS(-L)’s performance
continued to improve. In addition, renormalization’s performance advantage over
vanilla influence functions narrowed at larger filtering fractions. In contrast, GAS(-L)’s
advantage over TracIn and TracInCP remained consistent. Recall that dynamic
methods (e.g., GAS(-L) and TracIn) use significantly more gradient information than
static methods (e.g., influence functions). This experiment again demonstrates that
loss-based renormalization’s benefits increase as more gradient information is used.
6.4 Identifying Attack Targets
Recall that non-targets have primarily weak influences and few very strong ones.
Target instances are anomalous in that they have an unusual number of highly-
influential training instances (Figure 13). This idea is the core of our framework for
identifying targets of training set attacks, FIT. Alg. 4 formalizes FIT as an end-to-end
9Representer point (Eq. (6.2)) is excluded as it underperformed random filtering.
127
procedure to identify any targets in test example analysis set Ẑ 10te. Overall, FIT has
three sub-steps, described chronologically:
1. Inf: Calculates (renormalized) influence vector v for each test instance in
analysis set Ẑte.
2. AnomScore: Targets have an unusual number of highly-influential instances.
Leveraging ideas from anomaly detection, this step analyzes each test instance’s
influence vector v and ranks those instances based on how anomalous their
influence values are.
3. Mitigate: Target-driven mitigation sanitizes model parameters θT and training
set D to remove the attack’s influence on the most likely target ẑtarg (e.g., the
most anomalous misclassified instance).
FIT is referred to as a “framework” since these subroutines are general and their
underlying algorithms can change as new versions are developed. The next three
subsections describe our implementation of each of these methods. For reference,
suppl. Alg. 5 specializes Alg. 4 to more closely align with the implementation details
below. See the original paper [HL22a] for a complete discussion of FIT’s end-to-end
computational complexity.
6.4.1 Measuring (Renormalized) Influence. Algorithm 4 is agnostic
of the specific (renormalized) influence estimator used to calculate v, provided that
method is sufficiently adept at identifying adversarial set Dadv. We use GAS(-L) for
the reasons explained in Section 6.3 as well as its simplicity, computational efficiency,
and strong, consistent empirical performance.
10For simplicity, Alg. 4 considers a single identified target. If there are multiple identified targets,
Mitigate is invoked on each target serially with parameters θ̃T and D̃tr.
128
Algorithm 4 FIT target identification & mitigation
Input: Training set D, test example set Ẑte, and final params. θT
Output: S{anitized model para}meters θ̃T & training set D̃tr
1: V ← Inf(ẑ;D) : ẑ ∈ Ẑte ▷ (Renorm.) Inf. (Alg. 3)
2: Σ ← {AnomScore(v, V) : v ∈ V} ▷ Anomaly score (Sec. 6.4.2)
3: Rank Ẑte by anomaly scores Σ
4: ẑtarg ← Most anomalous test example in Ẑte
5: θ̃T , D̃tr ← Mitigate(ẑtarg, θT , D) ▷ Sec. 6.4.3
6: return θ̃T , D̃tr
Dadv Dcl Theoretical Normal
(a) GAS target (b) GAS non-target (c) Training set gradient
renormalized influence renormalized influence norm
(d) GAS target (e) GAS non-target (f) Training set gradient
renormalized influence renormalized influence norm
Figure 13. GAS renormalized influence, v, density distributions for two training set
attacks: CIFAR10 vision poisoning [Zhu+19] (ytarg = dog and yadv = bird) and speech-
recognition backdoor [Liu+18] (ytarg = 4 and yadv = 5). Theoretical normal ( ) is w.r.t.
D := Dadv ∪ D. Observe that target examples (Figs. 13a and 13d) have significant
Dadv mass ( ) well to the right of Dcl’s mass ( ). This upper-mass phenomenon is
absent in non-targets (Figs. 13b and 13e). Training example gradient norms (Fig. 13c
and 13f) are poorly correlated with whether the training example is adversarial. For
example, speech recognition has Dcl mass well to the right of even the right-most Dadv
mass, necessitating renormalization. See Sections 6.5.1 and D.3 for more details on
these attacks.
129
Speech Backdoor Vision Poison
Density Density
Time and Space Complexity : Computing a gradient requires O(p) time and
space. For fixed T and p, TracInCP, GAS, and GAS-L require O(n) time and
space to calculate each test instance’s influence vector v. The next section explains
that FIT analyzes each test instance’s influence vector v meaning GAS(-L) can
(t)
be significantly sped-up by amortizing training gradient (gi ) computation across
multiple test examples – either on a single node or across multiple nodes (e.g., using
all-reduce).
6.4.2 Identifying Anomalous Influence. To change a prediction,
adversarial set Dadv must be highly influential on the target. When visualizing
ẑte’s influence vector v as a density distribution, an exceptionally influential Dadv
manifests as a distinct density mass at the distribution’s positive extreme.
Figures 13a and 13d each plot an attack target’s GAS influence as a density for
two different training set attacks – the first poisoning on vision [Zhu+19] and the
other a backdoor attack on speech recognition [Liu+18]. For both attacks, adversarial
set Dadv’s influence significantly exceeds that of Dcl. When compared to theoretical
normal (calculated11 w.r.t. complete training set D), Dadv’s target influence is highly
anomalous. In Figures 13b and 13e, which plot the GAS influence of non-targets for
the same two attacks, no extremely high influence instances are present.
Going forward, influence vectors v with exceptionally high influence instances
are referred to as having a heavy upper tail. Then, target identification simplifies to
identifying influence vectors whose values have anomalously heavy upper tails. The
preceding insight is relative and is w.r.t. to other test instances’ influence value
distributions. Non-target baseline anomaly quantities vary with model, dataset, and
11The plotted theoretical normal used robust statistics median and Q in place of mean and standard
deviation.
130
Algorithm 5 FIT target identification implementation
Input: Training set D, training set size n, test example set Ẑte, and upper-tail count κ
Output: S{anitized model parame}ters θ̃T & training set D̃tr
1: V ← {GAS(ẑj ;D) : ẑj ∈ }Ẑte ▷ Renorm. Inf. (Alg. 3)
(j) (j)
2: Σ ← {v −µ (j)(j) : v ∈} V ▷ Anomaly score (Sec. 6.4.2)Q
3: H ← σ(n−κ) : σ ∈ Σ ▷ Upper-tail heaviness (Sec. 6.4.2)
4: Rank Ẑte by heaviness H
5: θ̃T , D̃tr ← θT , D
6: for each target ẑtarg identified using H do
7: θ̃T , D̃tr ← Mitigate(ẑtarg, θ̃T , D̃tr) ▷ Sec. 6.4.3
8: return θ̃T , D̃tr
hyperparameters. That is why suppl. Algorithm 5 ranks candidates in Ẑte based on
their upper-tail heaviness.
Quantifying Tail Heaviness: Determining whether ẑte’s influence vector v is
abnormal simplifies to univariate anomaly detection for which significant previous
work exists [BL78; RL87; HA04; RH17]. Observe in Figures 13b and 13e that Dcl’s
GAS influence vector v tends to be normally distributed (see the close alignment to
the dashed line). We, therefore, use the traditional anomaly score, σ := v−µ , where
s
µ and s are each v’s center and dispersion statistics, resp.12 Mean and standard
deviation, the traditional center and dispersion statistics, resp., are not robust to
outliers. Both have an asymptotic breakdown point of 0 (one anomaly can shift the
estimator arbitrarily). Since Dadv instances are inherently outliers, robust statistics
are required.
Median serves as our center statistic µ given its optimal breakdown (50%).
Although median absolute deviation (MAD) is the best-known robust dispersion
statistic, we use Rousseeuw and Croux’s [RC93] Q estimator, which retains
12In suppl. Algorithm 5, statistics µ(j) and Q(j) are calculated separately for each test instance ẑj ’s
influence vector v(j).
131
MAD’s benefits while addressing its weaknesses. Specifically, both MAD and Q
have optimal breakdowns, but Q has better Gaussian data efficiency (82% vs. 37%).
Critically for our setting with one-sided anomalies, Q does not assume data symmetry
– unlike MAD. Formally,
Q := c{|vi − vl| : 1 ≤ i < l ≤ n}(r), ( ) (6.13)n
where {·}(r) denotes the set’s r-th order statistic with r = ⌊ 2 ⌋+1 and c is a2
distribution consistency constant which for Gaussian data, c ≈ 2.2219 [RH17].
Eq. (6.13) requires only O(n) space and O(n lg n) time as proven by Croux and
Rousseeuw [CR92]. Provided anomaly score vector σ, upper-tail heaviness is
simply σ(n−κ), which is σ’s (n− κ)th order statistic, i.e., ẑ thte’s κ largest anomaly
score value. The value of κ implicitly affects the size of the smallest detectable attack,
where any attack with |Dadv| < κ is much harder to detect.
Multiclass vs. Binary Classification Different classes are implicitly generated
from different data distributions. Each class’s data distribution may have different
influence tails – in particular in multiclass settings. Target identification performance
generally improves (1) when µ and Q are calculated w.r.t. only training instances
labeled ŷte and (2) ẑte’s upper-tail heaviness is ranked w.r.t. other test instances
labeled ŷte.
Faster FIT The execution time of TracInCP and by extension GAS(-L), depends
on parameter count |θ|. For very large models, target identification can be significantly
sped up via a two-phase strategy. In phase 1, GAS(-L) uses a very small iteration
subset (e.g., T = {T}) to coarsely rank analysis set Ẑte. Phase 2 then uses the
complete T but only on a small fraction (e.g., 10%) of Ẑte with the heaviest phase 1
132
tails. Section 6.5.3 applies this approach to natural-language data poisoning on
RoBERTaBASE [Liu+20b].
Computing each test instance’s (ẑte ∈ Ẑte) influence vector v is independent.
Each dimension vi is also independent and can be separately computed. Hence,
GAS(-L) is embarrassingly parallel allowing linear speed-up of target identification
via parallelization.
6.4.3 Target Driven Attack Mitigation. A primary benefit of target
identification is that attack mitigation becomes straightforward. Algorithm 6 mitigates
attacks by sanitizing training set D of adversarial set Dadv. Most importantly, target
identification solves data sanitization’s common pitfall (Sec. 3.2.1) of determining how
much data to remove. Sanitization stops when the target’s misprediction is eliminated.
Therefore, successfully identifying a target means sanitization is guaranteed to succeed
tautologically (i.e., attack success rate on any analyzed targets is 0).
More concretely, Alg. 6 iteratively filters D by thresholding anomaly score
vector, σ.13 Since adversarial instances are abnormally influential on targets, Alg. 6
filters Dadv instances first. After each iteration, influence is remeasured to account
for estimation stochasticity and because training dynamics may change with different
training sets. Data removal cutoff ζ is tuned based on computational constraints
– larger ζ results in less clean data removed but may take more iterations. Slowly
annealing ζ also results in less clean-data removal.
Given forensic or human analysis of the identified target(s), simpler mitigation
than Algorithm 6 is possible, e.g., a naive, rule-based, corrective lookup table that
entails no clean data removal at all.
13Alg. 6 considers the more general case of a single identified target but can be extended to consider
multiple targets. For instance, provided there is a single attack, average v across all targets, and
stop sanitizing once all targets are classified correctly.
133
Algorithm 6 Target-driven mitigation & sanitization
Input: Target ẑtarg := (xtarg,yadv), anomaly cutoff ζ, model f , initial params. θ0, final params. θT ,
and training set D
Output: Clean model parameters θ̃T & sanitized training set D̃tr
1: function Mitigate(ẑtarg, θT , D)
2: θ̃T , D̃tr ← θT ,(D )
3: while argmax f(xtarg; θ̃T ) = yadv do
4: v ← Inf(ẑtarg;D) ▷ Renorm. Influence (Alg. 3)
5: σ ← v−µQ { } ▷ Anomaly score (Sec. 6.4.2)
6: D̃tr ← D̃tr \ zi : σi ≥ ζ ∧ zi ∈ D̃tr ▷ Sanitize
7: θ̃T ← Retrain(θ0, D̃tr)
8: Optionally anneal ζ
9: return θ̃T , D̃tr
For learning environments where certified training data deletion is possible
[Guo+20; MRA22], retraining (Alg. 6 Line 7) may not even be required — making
our method even more efficient.
Enhancing Mitigation’s Robustness An adversary could attack FIT by
injecting adversarial instances into D to specifically trigger excessive, unnecessary
sanitization. To mitigate such a risk, Alg. 6 could be tweaked to include a
maximum sanitization threshold14 that would trigger additional (e.g., human, forensic)
analysis. This threshold could be set empirically or using domain-specific knowledge
(e.g., maximum possible poisoning rate). See the supplement of the original
paper [HL22a] for further discussion.
6.5 Evaluation
We empirically demonstrate our method’s generality by evaluating training set
attacks on different data modalities, including text, vision, and speech recognition. We
14This threshold could be w.r.t. the number of examples removed or the change in held-out loss.
These quantities can be measured cumulatively or for targets individually.
134
consider both poisoning and backdoor attacks on pre-trained and randomly-initialized,
state-of-the-art models in binary and multiclass settings. For brevity, most evaluation
setup details (e.g., hyperparameters) are deferred to suppl. Section D.3. Additional
experimental results also appear in the original paper [HL22a].
6.5.1 Training-Set Attacks Evaluated. We evaluated our method on four
published training set attacks – two single-target data poisoning and two multi-target
backdoor. Below are brief details regarding how each attack crafts adversarial set Dadv,
with the full details in suppl. Sec. D.3.2.4. Representative clean and adversarial
training instances for each attack appear in the original paper [HL22a]. Table 6 lists
each attack’s mean success rate aggregated across all related setups. Full granular
results are in Section C.3.
Below, ytarg → yadv denotes the target’s true and adversarial labels, respectively.
When an attack considers multiple class pairs or setups, each is evaluated separately.
(1) Speech Backdoor : Liu et al.’s [Liu+18] speech recognition dataset contains
spectrograms of human speech pronouncing in English digits 0 to 9 (10 classes,
|Dcl| = 3,000 – 1% backdoors). Liu et al. also provide 300 backdoored training
instances evenly split between the 10 classes. Each class’s adversarial trigger – a
short burst of white noise at the recording’s beginning – induces the spoken digit
to be misclassified as the next largest digit (e.g., 0 → 1, 1 → 2, etc.). This small
input-space signal induces a large feature-space perturbation – too large for many
certified methods. Following Liu et al., our evaluation used a speech recognition CNN
trained from scratch.
(2) Vision Backdoor : Weber et al. [Web+23] consider three different backdoor
adversarial trigger patterns on CIFAR10 binary classification. Specifically, Weber
et al.’s “pixel” attack patterns increase the pixel value of either one or four central
135
pixel(s) by a specified maximum ℓ2 perturbation distance while their “blend” trigger
pattern adds fixed N (0, I) Gaussian noise across all perturbed images. We considered
the same class pairs as Weber et al. (auto → dog and plane → bird) on the state-of-
the-art ResNet9 [Pag20] CNN trained from scratch with |Dadv| = 150 and |D| = 10,000
(1.5% backdoors).
(3) Natural Language Poison: Wallace et al. [Wal+21] construct text-based
poison by simulating bilevel optimization via second-order gradients. Dadv’s instances
are crafted via iterative word-level substitution given a target phrase. We follow
Wallace et al.’s [Wal+21] experimental setup of poisoning the Stanford Sentiment
Treebank v2 (SST-2) sentiment analysis dataset [Soc+13] (|Dcl| = 67,349 & |Dadv| = 50
– 0.07% poison) on the RoBERTaBASE transformer architecture (125M parameters)
[Liu+20b].
(4) Vision Poison: Zhu et al.’s [Zhu+19] targeted, clean-label attack crafts
poisons by forming a convex polytope around a single target’s feature representation.
Following Zhu et al., the pre-train then fine-tune paradigm was used. In each trial,
ResNet9 was pre-trained using half the classes (none were ytarg or yadv). Targets
were selected uniformly at random (u.a.r.) from test examples labeled ytarg, and
50 poison instances (0.2% of D) were then crafted from seed examples labeled yadv.
The pre-trained network was fine-tuned using Dadv and the five held-out classes’
training data (|D| = 25,000). Like previous work [Sha+18; Hua+20], CIFAR10 class
pairs dog vs. bird and deer vs. frog were evaluated, where each class in a pair serves
alternately as ytarg and yadv.
While it is not feasible to evaluate our approach on every attack (as new attacks
are developed & published so frequently) we believe this diverse set of attacks is
representative of training set attacks in general and demonstrates our approach’s
136
broad applicability. In particular, our method is not tailored to these attacks and could
be used against future attacks as well, as long as the attack includes highly-influential
training examples that attack specific targets.
6.5.2 Identifying Adversarial Set Dadv. To identify the target (Alg. 4)
or mitigate the attack (Alg. 6), we must be able to identify the likely adversarial
instances Dadv associated with a possible target ẑtarg. Our approach is to use influence-
estimation methods, which should rank an actual adversarial attack Dadv as more
influential than clean instances Dcl on the target. In this section, we evaluate how
well different influence-estimation methods succeed at performing this ranking for a
given target.
We compare the performance of our renormalized estimators, GAS and GAS-L,
against Section 6.2’s four influence estimators: TracInCP, TracIn, influence functions,
and representer point. As an even stronger baseline, where applicable, we also compare
against Peri et al.’s [Per+20] Deep k-NN empirical training set defense specifically
designed for Zhu et al.’s [Zhu+19]’s vision, clean-label poisoning attack; described
briefly, Deep k-NN sanitizes the training set of instances whose nearest feature-
space neighbors have a different label. Like Section 6.3’s CIFAR10 & MNIST joint
classification experiment, class sizes are imbalanced (|Dadv| ≪ |Dcl|) so performance
is again measured using AUPRC.
For targets selected u.a.r., Figure 14 details each method’s averaged adversarial-
set identification AUPRC for Section 6.5.1’s four attacks. In summary, GAS and
GAS-L were each the top performer for one attack and had comparable performance
for the other two.
GAS and GAS-L identified the adversarial instances nearly perfectly for Liu
et al.’s speech backdoor and Wallace et al.’s text poisoning attacks. Standard influence
137
GAS (ours) GAS-L (ours) TracInCP TracIn
Influence Func. Representer Pt. Deep k-NN
1
0.8
0.6
0.4
0.2
0
Speech Vision NLP Vision
Backdoor Poison
Figure 14. Adversarial-Set Identification: Mean AUPRC identifying adversarial
set Dadv using a randomly selected target for Sec. 6.5.1’s four attacks. Results averaged
across related setups with ≥10 trials per setup. See supplemental Section C.3 for the
full granular results.
estimation performed poorly on the text poisoning attack (in particular the static
estimators) due to the large model, RoBERTaBASE, that Wallace et al.’s attack
considers. For the vision backdoor and poisoning attacks, our renormalized estimators
successfully identified most of Dadv – again, much better than the four original
estimators. While Peri et al.’s [Per+20] Deep k-NN defense can be effective at
stopping clean-label vision poisoning, it does so by removing a comparatively large
fraction of clean data (up to 4.3% on average) resulting in poor AUPRC.
For completeness, Figure 15 provides adversarial-set identification results for
our renormalized, static influence estimators. In all cases, renormalization improved
the estimator’s performance, generally by an order of magnitude with a maximum
improvement of 600×. These experiments highlight layerwise renormalization’s benefits.
Influence functions’ Hessian-vector product algorithm [Pea94] can assign a large
magnitude to some layers, and these layers then dominate the influence and GAS
138
Dadv AUPRC
Inf. Func. Rn. (ours) Inf. Func. Rn.-L (ours) Inf. Func.
Rep. Pt. Rn. (ours) Representer Pt.
1
0.8
0.6
0.4
0.2
0
Speech Vision NLP Vision
Backdoor Poison
Figure 15. Static Influence Adversarial-Set Identification: Comparing the
mean adversarial-set identification AUPRC of the static influence estimators and their
corresponding renormalized (Rn.) versions. For all attacks, renormalization improved
the static estimators’ mean performance by up to a factor of>600×. These experiments
also highlight layerwise renormalization’s performance gains, e.g., influence functions
on natural-language poison. Results are averaged across related experimental setups
with ≥10 trials per setup.
estimates. Layerwise renormalization addresses this, improving renormalized influence
function’s adversarial-set identification AUPRC by up to 3.5×.
6.5.3 Identifying Attack Targets. The previous experiments demonstrate
that knowledge of a target enables identification of the adversarial set when using
renormalization. This section demonstrates that the distribution of renormalized
influence values actually enables us to identify target(s) in the first place, through
the interplay foundational to our target identification framework, FIT. Since target
identification is a new task, we propose four target identification baselines. First,
inspired by Peri et al.’s [Per+20] Deep k-NN empirical defense, maximum k-NN
distance computes the distance from each test instance to its κth nearest neighbor
in the training data, as measured by the L2 distance between their penultimate
139
Dadv AUPRC
FIT w/ GAS (ours) FIT w/ GAS-L (ours) Max. k-NN Dist.
Min. k-NN Dist. Most Certain Least Certain
1
0.8
0.6
0.4
0.2
0
Speech Vision NLP Vision
Backdoor Poison
Figure 16. Target Identification: Mean target identification AUPRC for Sec. 6.5.1’s
four attacks. “FIT w/ GAS” denotes GAS was FIT’s influence estimator with
matching notation for GAS-L. Results averaged across setups with ≥10 trials per
setup. See Sec. C.3 for the full granular results.
feature representations (f). It orders them by this distance, starting with the largest
distance to the κth neighbor, thus prioritizing outliers and instances in sparse regions
of the learned representation space. Minimum k-NN distance is the reverse ordering,
prioritizing instances in dense regions. The other two baselines are most certain, which
ranks test examples in ascending order by loss while least certain ranks by descending
loss.
There are far fewer targets than possible test examples so performance is again
measured using AUPRC. See suppl. Table D.58 for the number of targets and non-
targets analyzed for each attack. For single-target attacks (vision and natural language
poisoning), target identification AUPRC is equivalent to the target’s inverse rank,
causing AUPRC to decline geometrically.
Figure 16 shows that FIT – using eitherGAS orGAS-L as the influence estimator
– achieves near-perfect target identification for both backdoor attacks and natural
140
Target AUPRC
language poisoning. Overall, FIT with GAS was the top performer on two attacks,
and FIT with GAS-L was the best for the other two. Recall that the vision poisoning
attack is single target. Hence, GAS-based FIT’s mean AUPRC of >0.8 equates
to an average target rank better than 1.25 (1/0.8), i.e., three out of four times on
average ẑtarg was the top-ranked – also very strong target detection. FIT’s performance
degradation on vision poisoning is due to GAS and GAS-L identifying this attack’s
Dadv slightly worse (Fig. 14). Only maximum k-NN approached FIT’s performance
– specifically for Weber et al.’s vision backdoor attack. Note also that no baseline
consistently outperformed the others. Hence, these attacks affect network behavior
differently, further supporting that FIT is attack agnostic.
FIT’s performance is stable across a wide range of upper-tail cutoff thresholds κ
(see supplement of the original paper [HL22a]). For example, FIT’s natural language
target identification AUPRC varied only 0.2% and 2.1% when using GAS-L and GAS
respectively for κ ∈ [1, 25].
6.5.4 Target-Driven Mitigation. Section 6.4.3 explains that successfully
identifying the target(s) enables guaranteed attack mitigation on those instances. Here,
we evaluate GAS and GAS-L’s effectiveness in targeted data sanitization. Table 6
details our defense’s effectiveness against Sec. 6.5.1’s four attacks. As above, results
are averaged across each attack’s class pairs/setups. Section 14’s baselines all have
large false-positive rates when identifying Dadv (Fig. 14), which caused them to remove
a large fraction of Dcl and are not reported in these results.
For three of four attacks, clean test accuracy after sanitization either improved or
stayed the same. In the case of Weber et al.’s vision backdoor attack, the performance
degradation was very small – 0.1%. Similarly, owing to renormalized influence’s
effectiveness identifying Dadv (Fig. 16), our defense removes very little clean data when
141
mitigating the attack – generally <0.2% of the clean training set. For comparison, Peri
et al. [Per+20] report that their Deep k-NN clean-label, poisoning defense removes on
average 4.3% of Dcl on Zhu et al.’s [Zhu+19] vision poisoning attack. This is despite
Peri et al.’s method being specifically tuned for Zhu et al.’s attack and their evaluation
setup being both easier and less realistic by pre-training their model using a large
known-clean set that is identically distributed to their Dcl. In contrast, target-driven
mitigation removed at most 0.03% of clean data on this attack – better than Peri et al.
by two orders of magnitude.
Following Algorithm 6, Table 6’s experiments used only a single, randomly-
selected target when performing sanitization. No steps were taken to account for
additional potential targets, e.g., over-filtering the training set. Nonetheless, target-
driven mitigation still significantly degraded multi-target attacks’ performance on
other targets not considered when sanitizing. For example, despite considering one
target, speech backdoor’s overall attack success rate (ASR) across all targets decreased
from 100% to 4.7% and 6.5% for GAS and GAS-L, respectively – a 20× reduction.
For Weber et al.’s vision backdoor attack, ASR dropped from 90.5% to 11.9% and
6.7% with GAS and GAS-L, respectively. The key takeaway is that identifying a
single target almost entirely mitigates the attack everywhere.
6.6 Adaptive Attacks
We now consider how an attacker who knows about our defense could evade it or
otherwise exploit it. Our method relies on multiple attack instances having unusually
high influence on the target instance, as measured by model gradients during training.
As such, it may fail to detect an attack if (1) there are too few attack instances
(relative to upper-tail count κ); (2) the attack is a large fraction of the data (e.g., 10%),
in which case the instances are too common to be considered outliers; or (3) the attack
142
Table 6. Target Driven Attack Mitigation: Alg. 6’s target-driven, iterative
data sanitization applied to Sec. 6.5.1’s four attacks for randomly selected targets.
The attacks were neutralized with few clean instances removed and little change in
test accuracy. Attack success rate (ASR) is w.r.t. the analyzed target. Results are
averaged across related setups with ≥10 trials per setup. Detailed results appear in
Sec. C.3.1–C.3.4.
% Removed ASR % Test Acc. %
Dataset Method
Dadv Dcl Orig. Ours Orig. Chg.
GAS 98.4 0.07 0 0.0
Speech 99.8 97.7
GAS-L 98.1 0.17 0 0.0
GAS 87.6 0.50 0 -0.1
Vision 90.5 96.2
GAS-L 92.3 0.73 0 -0.1
GAS 99.6 0.02 0 +0.2
NLP 97.9 94.2
GAS-L 99.9 0.03 0 +0.1
GAS 65.1 0.02 0 0.0
Vision 77.9 87.1
GAS-L 58.6 0.03 0 0.0
instances appear no more influential on the target than clean instances. The first
case is only a risk when the target instance is “easy” enough to influence that the
attack can be carried out with very few instances. The second case requires a very
powerful attacker, one who would be hard to stop without additional constraints or
assumptions. The third case represents a possible weakness: if attackers can craft an
attack that successfully changes the target label without appearing unusual, then our
defense will fail. Whether or not the attacker can succeed is an empirical question,
which will depend on the dataset, the model, the choice of target instance, and the
attack method. Below, we provide evidence that FIT (and GAS in particular) remain
effective against an attacker who is trying to evade our defense.
Seed-Instance Optimization: Some training set attacks rely on a fixed, predefined
adversarial perturbation which is applied to clean seed instances [Liu+18; Web+23].
An attacker who is aware of our defense could choose seed instances that organically
appear uninfluential on some target, as estimated by GAS(-L). We apply this idea
143
Poison Backdoor
to Weber et al.’s [Web+23] CIFAR10 backdoor attack and find that our method
continues to perform well against this simple, adaptive attacker: GAS(-L) achieve
0.93 AUPRC for adversarial-set identification, a 7% decline versus the baseline, even
when the attacker is given information beyond our threat model, e.g., knowledge of
the random, initial parameters. Overall, the attacker’s gains from choosing different
seed instances are limited. See the supplement of the original paper for additional
details [HL22a].
Perturbation Optimization: A stronger adaptive adversary actively optimizes
the adversarial perturbation to be both highly effective and have a low (GAS)
influence estimate. For attacks that find perturbations through gradient-based
optimization [Wal+21; Zhu+19], the most natural way to incorporate knowledge
of our method would be to add some estimate of GAS to the loss being optimized.
Since GAS – like poison – relies on the entire training trajectory of the model, which
in turn relies on the perturbations being crafted, computing GAS’s exact gradient is
intractable [BR92]. However, the attacker can still use a surrogate that approximates
GAS, such as by using fixed model checkpoints in the computation of GAS.
To evaluate the robustness of our methods to adaptive perturbations, we apply
this joint optimization idea to Zhu et al.’s [Zhu+19] vision poison attack. We focused
on Zhu et al.’s attack because (1) it is the attack on which our method performed the
worst and (2) the other optimized attack we consider [Wal+21] is restricted to only
discrete token replacements, which reduces the attacker’s flexibility.
Zhu et al. iteratively optimize a set of poison examples to minimize the adversarial
loss. To increase the likelihood of successfully changing the target’s label, Zhu et al.
compute this loss over multiple surrogate models. The perturbations of the poison
examples are constrained to an ℓ∞ ball so that they appear relatively natural to
144
humans. Our jointly optimized, adaptive attack adds a second term to Zhu et al.’s
adversarial loss. This new term estimates the GAS influence using the same surrogate
models. Hyperparameter β balances the two objectives. See suppl. Section C.4 for
the full details of this jointly-optimized, adaptive attack, including tuning of β.
Where possible, these adaptive experiments followed the same vision poison
evaluation setup detailed in Section 6.5.1. However, simultaneously optimizing
surrogate GAS has a much higher GPU memory cost, so we were forced to adjust the
poison size and number of surrogate checkpoints to 40 and four, respectively, which
degrades the attacker’s success rate from 77.9% in Table 6 to 64.3%.
Fig. 17 summarizes our adversarial-set identification performance on Zhu et al.’s
vision poisoning attack with and without the jointly optimized surrogate GAS loss
term.15 Observe that the attack degraded the performance of GAS(-L) and TracInCP,
albeit slightly. After accounting for all other factors, this joint optimization decreased
GAS’s mean adversarial-set identification from a baseline of 0.73 AUPRC to 0.68 (a
7% drop). Fig. 18 visualizes joint attack optimization’s effect on target identification.
Overall, joint optimization reduced FIT withGAS’s mean target identification AUPRC
from a baseline of 0.86 to 0.78 (9% drop). Since Zhu et al.’s attack is single-target,
this translates to the target’s average rank declining from 1.16 to 1.28 — still high
performance.
Table 7 details target-driven mitigation’s effectiveness under this jointly-optimized
attack. In summary, the attack results in very little clean data removal (at most 0.05%
of Dcl on average). Also, the average test accuracy after mitigation either improved or
stayed the same in all but one case where it decreased by only 0.1%.
15Both versions of the attack (i.e., with and without joint optimization) used the same evaluation
setup, including the reduced surrogate model count.
145
Table 7. Attack Mitigation for the Adaptive Vision Poison Attack:
Algorithm 6’s target-driven data sanitization where Zhu et al.’s [Zhu+19] vision
poison attack is jointly optimized with minimizing the GAS influence. The results
below consider exclusively the jointly-optimized attack with β = 10−2. Clean-data
removal remains low, and test accuracy either improved or stayed the same for in but
one setup. The performance is comparable to the results with Zhu et al.’s [Zhu+19]’s
standard vision poisoning attack (see Table C.40). Bold denotes the best mean
performance with ≥10 trials per class pair.
Classes % Removed ASR % Test Acc. %
Method
ytarg yadv Dadv Dcl Orig. Ours Orig. Chg.
GAS 36.0 0.02 0 +0.1
Bird Dog 76.2 87.0
GAS-L 30.3 0.00 0 +0.1
GAS 21.6 0.00 0 +0.1
Dog Bird 57.1 87.1
GAS-L 21.9 0.00 0 –0.1
GAS 17.5 0.00 0 0.0
Frog Deer 38.1 87.1
GAS-L 19.4 0.00 0 0.0
GAS 85.0 0.18 0 0.0
Deer Frog 81.0 87.1
GAS-L 82.3 0.13 0 +0.1
In summary, even when the adversary specifically optimized for our defense, we
still effectively identify both the adversarial set and the target and then mitigate the
adaptive attack.
6.7 Discussion and Conclusions
This chapter explores two related tasks. First, we propose training set attack
target identification, which plays an important part in the protection of critical
ML systems but has thus far received relatively little attention. For example, it is
impossible to conduct a truly informed cost-benefit analysis of risk without knowing
the attacker’s target and by extension their objective. Knowledge of the target also
enables forensic and security analysts to reason about an attacker’s identity – a key
step to permanently stopping attacks by disabling the attacker. An open question is
whether target identification can be combined with certified guarantees, either building
on our FIT framework or creating an alternative to it.
146
FIT relies on identifying (groups of) highly influential training instances. To that
end, we propose renormalized influence. By addressing influence’s low-loss penalty,
renormalization significantly improves influence estimation in both adversarial and
non-adversarial settings – often by an order of magnitude or more [HL22b].
147
Baseline Adaptive Joint Optimization Attack with GAS
GAS (ours)
GAS-L (ours)
TracInCP
TracIn
Inf. Func.
Rep. Pt.
0 0.2 0.4 0.6 0.8
AUPRC
Figure 17. Adversarial-Set Identification for the Adaptive Vision Poison
Attack: Mean AUPRC identifying the adversarial set where Zhu et al.’s vision poison
attack is adapted to jointly minimize the adversarial loss and the GAS influence. The
baseline results (orange) used Zhu et al.’s standard attack. Our jointly-optimized
attack reduced the GAS similarity by 7% at the cost of a 19% decrease in ASR w.r.t.
Table 6. See suppl. Sec. C.4 for the granular results.
Baseline Adaptive Joint Optimization Attack with GAS
FIT w/ GAS (ours)
FIT w/ GAS-L (ours)
Max. k-NN Dist.
Min. k-NN Dist.
Most Certain
Least Certain
0 0.2 0.4 0.6 0.8
AUPRC
Figure 18. Target Identification for the Adaptive Vision Poison Attack:
Mean target identification AUPRC where Zhu et al.’s vision poison attack is jointly
optimized with minimizing GAS. FIT with GAS’s mean target identification AUPRC
declined only 9% versus the baseline – an average change in target rank of 1.16 to 1.28
– still strong performance. Results are averaged across related setups with ≥10 trials
per setup. See suppl. Sec. C.4 for the ful1l4r8esults.
CHAPTER 7
CONCLUSIONS AND FUTURE DIRECTIONS
With machine learning systems increasingly deployed in settings critical to human
well-being, the need for robust models similarly increases. Deploying models that
have been adversarially manipulated can harm human health, increase unfairness, and
degrade public trust in institutions. It is unlikely that the regulatory landscape will
continue to allow empirical performance to be prioritized over all else. While this
dissertation looks at model robustness through the lens of adversarial robustness, our
contributions should not be viewed as solely applying to malicious attacks.
“Honest poison” instances can be unintentionally and unknowingly inserted into
a model’s training data. For example, consider generative models trained on open-
source code. Countless instances of buggy or insecure code are posted to public
repositories like GitHub. Any model trained on this “dangerous” but non-malicious
code is vulnerable [Pea+22; Pea+23]. Methods like those proposed in this dissertation
not only provide robustness against adversarial poison but such honest poison as
well. Adversarial robustness simply provides a vehicle to study worst-case training set
perturbations.
In addition, while we propose and analyze GAS in the context of adversarial
robustness, understanding the causal relationship between training data and model
predictions generalizes to any setting where predictions must be interpretable. A
better understanding of why models misbehave should enable the development of
better models in the future. With algorithmic decision making increasingly common,
black-box decisions will no longer be tolerated by society or regulators [GF17]. Insights
into why poisoning attacks succeed may generalize to understanding why models learn
spurious correlations in non-malicious settings as well.
149
Future Directions
Despite decades of study [Coo77; CW82], existing methods to make models
(provably) robust to training set perturbations and spurious correlations remain
rudimentary. This dissertation takes a step towards improved model robustness, yet
significant work remains. Below, we briefly outline ideas for future research directions
to improve model training’s robustness.
Lowering Certification’s Performance Penalty As Chapters 4 and 5
demonstrate, certified robustness against poisoning and backdoor attacks is not
free. Non-trivial certified robustness against poisoning and backdoor attacks currently
entails a (significant) performance penalty [FZ20; Zha+19]. This weakness is common
to all existing certified methods that provide non-trivial robustness guarantees [LF21;
WLF22a; Rez+23]. Expecting certified robustness to be achieved without any decrease
in model performance is unrealistic. There are no free lunches. However, certified
methods will not be a realistic option in practice until their deleterious side effects are
substantially reduced. Without practical robust methods, society will bear the cost
when today’s brittle models inevitably fail in real-world, safety-critical applications.
Unifying Robustness to Training and Test Perturbations Conventional
wisdom treats training and test robustness as separate tasks with proposed methods
targeting one task or the other. Chapter 5’s feature partition aggregation (FPA) is
rather unique in that a single method simultaneously provides certified training and
test robustness. An obvious limitation of FPA is its restriction to a single perturbation
model, i.e., the ℓ0 ball. It remains an open question whether other effective defenses
over the union of training and test attacks exist. As Chapter 5 discusses, Weber
et al. [Web+23] propose a defense that provides robustness of ℓ2 training and test
150
attacks, but their robustness guarantees are very minimal. The more threat vectors
a (certified) defense effectively covers, the more likely that defense will be deployed
in practice. This broad statement includes defining defenses that are simultaneously
robust to both training and test attacks.
Efficient Influence Analysis and Estimation Put simply, GAS is slow [HL22a,
Sec. E.5]. Estimating the training set’s pointwise influence on a single prediction
can take hours, even for relatively simple models [HL22b]. For influence analysis
to be a practical tool, influence estimation must be at least one to two orders of
magnitude faster [HL22b]. Recently, more efficient influence analysis methods have
been proposed [Par+23], but significant work remains.
151
APPENDIX A
NOMENCLATURE REFERENCE
Table A.8. General Nomenclature Reference: This table contains symbols that
are relevant to one or more chapters in this dissertation. Related symbols are grouped
together with groups separated by dotted lines.
LightGBM Gradient-boosted decision tree model architecture by Ke et al. [Ke+17]
XGBoost Gradient-boosted decision tree model architecture by Chen and Guestrin [CG16]
DPA Deep partition aggregation certified poisoning defense proposed by Levine and
Feizi [LF21] (Sec. 3.2.2)
FA (Deterministic) finite aggregation certified poisoning defense proposed by Wang
et al. [WLF22a] (Sec. 3.2.2)
[k] Integer set {1, . . . , k}
2[k] Power set of integer set [k]
1[q] Indicator function where 1[q] = 1 if q is true and 0 otherwise
medA Median of (multi)set A ∑k
H(k) k-th harmonic number where H(k) = 1i=1 i
pp Percentage points
X Feature domain where X ⊆ Rd
x Feature vector where ∀x x ∈ X
d Feature dimension where ∀x|x| = d
[d] Complete feature set
Y Label set where Y ⊆ R
y Instance label where ∀y y ∈ Y
Z Instance space where Z := X × Y
z Feature vector, label tuple where z := (x, y) ∈ Z
zte Arbitrary test instance zte := (xte, yte) ∈ Z with true label yte ∈ Y
n Number of training instances
zi i-training index where zi := (xi, yi) ∈ D
D Training set where D := {z ni}i=1
f A model (ensemble, instance-based learner, neural network, etc.) where f : X → Y
f(xte) Model f ’s prediction for test feature vector xte
L Number of submodels in ensemble f
fl l-th submodel in ensemble f where fl : X → Y and l ∈ [L]
θ Model parameters where θ ∈ Rp
L Loss function where L : Y × Y → R≥0
L(z; θt) Empirical risk of z = (x, y) w.r.t. model parameters θt where
L(z; θt) = L (f(x; θ), y)
152
Table A.9. Chapter 4 Nomenclature Reference: Notation specific to Chapter 4
with related symbols grouped together. Groups are separated by dotted lines. Note
that this table spans multiple pages.
kNN Vanilla k-nearest neighbors (Sec. 4.4.1)
kNN-m k-nearest neighbors with median as the decision function (Sec. 4.4.1)
kNN-CR Our kNN-based certified regressor (Sec. 4.4.1)
rNN Radius nearest neighbors (Sec. 4.4.2)
PCR Our partitioned certified regressor (Sec. 4.5.1)
W-PCR Our weighted-cost partitioned certified regressor (Sec. 4.5.2)
PCR Our overlapping certified regressor (Sec. 4.6.1)
W-PCR Our weighted-cost overlapping certified regressor (Sec. 4.6.2)
ξ One-sided upper bound for regression robustness certification where f(xte) ≤ ξ
ξl Two-sided lower bound for robustness certification where ξl ≤ f(xte) ≤ ξu
ξu Two-sided upper bound for robustness certification where ξl ≤ f(xte) ≤ ξu
m Number of training set blocks
htr Training set partitioning function where htr : Z → [m]
hf Overlapping training set block assignment function where hf : [m] → 2[L]
j Training set block index where j ∈ [m]
D(j) j-th training set block where D(j) := {z ∈ D : htr(z) = j} and
∀ ′ D(j) ∩ ′D(j )j ≠ j = ∅
N (xte) Nearest-neighbors neighborhood (multi)set for test feature vector xte
rl Number of training set modifications required to violate invariant ξl ≤ fl(xte) ≤ ξu.
Note that rl is one larger than fl’s certified robustness
rmax Maximum submodel modification requirement where rmax := max⋃l rl
Dl Submodel fl’s training set where for overlapping regression D = D(j)l j∈[m]
l∈hf (j)
q Inverse of the fraction of the training set used to train each submodel, where
∀ nl q = |Dl|
d(j) Spread degree of training set block D(j) where d(j) := |hf (D(j))|
dmax Maximum spread degree where dmax := max d
(j)
j
Tl Set of ensemble submodels predicting fl(xte) ≤ ξ where Tl ⊆ [L]
V Real-valued (multi)set, e.g., kNN neighborhood or ensemble submodel predictions,
where L = |V|
Vl Lower thresholded real-valued multiset where Vl := {νl ∈ V : νl ≤ ξ}
Vu Upper thresholded real-valued multiset where Vu := {νl ∈ V : νl > ξ}
V±1 Binarized multiset where V±1 := {sgn (νl) : νl ∈ V}
Ṽ Adversarially perturbed real-valued (multi)set formed from (multi)set V
R Weight set where R := {rl : l ∈ [L]}
Rl Weight set corresponding to values set Vl where Rl := {rl ∈ R : νl ∈ V}
(Continued . . . )
153
Table A.9. Chapter 4 Nomenclature Reference (Continued): Notation specific
to Chapter 4 with related symbols grouped together. Groups are separated by dotted
lines. Note that this table spans multiple pages.
⌈ ⌉
∆ Midpoint distance where ∆ := |Vl| − L2
R̃l ∆ smallest values in Rl
ILP Integer linear program
ω(j) ILP integral variable representing the number of instance modifications made to
training set block D(j)
δl ILP binary variable which equals 1 if submodel fl has been perturbed such that
fl(xte) > ξ and 0 otherwise
σ ILP binary variable which equals 1 if in the case of weighted analysis and 0 otherwise
PSMC Partial set multicover
G Upper-bound on certified robustness R returned by a greedy algorithm
154
Table A.10. Chapter 5 Nomenclature Reference: Notation specific to Chapter 5
with related symbols grouped together. Groups are separated by dotted lines.Note
that this table spans multiple pages.
FPA Our certified defense, feature partition aggregation, against sparse poisoning,
backdoor, evasion, and patch attacks
RA Randomized ablation. Certified ℓ0-norm evasion defense. Proposed by Levine and
Feizi [LF20b] and subsequently improved by Jia et al. [Jia+22b]
DRS (De)randomized smoothing certified patch defense proposed by Levine and Feizi
[LF20a]. Based on randomized ablation
Patch IBP Certified patch defense based on interval bound propagation proposed by Chiang
et al. [Chi+20]
BagCert Certified patch defense proposed by Metzen and Yatsura [MY21]
RAB Robustness against backdoors certified defense proposed by Weber et al. [Web+23]
R Pointwise certified feature robustness – feature partition aggregation’s certification
objective (Def. 5.1)
Rmed Median certified feature robustness w.r.t. a dataset’s test set
ρ Pointwise ℓ0-norm certified evasion-only robustness (Def. 5.2). A weaker guarantee
than certified feature robustness.
ρmed Median ℓ0-norm certified evasion-only robustness w.r.t. a dataset’s test set
∥w∥0 ℓ0 norm for vector w, i.e., the number of non-zero elements in w
Xj j-th column of matrix X where j ∈ [d] and Xj ∈ Rn
X ⊖ X′ Set of column indices over which equal-size matrices X and X′ differ, where
X ⊖ X′ = {j ∈ [d] : Xj ̸= X′j}
xj j-th dimension of vector x where j ∈ [d] and xj ∈ R
x ⊖ x′ Set of dimensions over which vectors x and x′ differ where
x ⊖ x′ = {j ∈ [d] : xj ̸= x′j}
d (D,D′sym ) Symmetric difference between sets D and D′
f Voting-based, ensemble classifier trained over partitioned feature sets where
f : X → Y
Sl Feature subs⊔et considered by the l-th submodel during training and test where
S ⊂ Ll [d] and l=1 Sl = [d]
xS Subvector of x ∈ X restricted to feature subset Sl ⊂ [d]l
Dl Training set for the l-th submodel
ċy(x) Submode∑l vote count for label y and feature vector x whereL
ċy(x) := l=1 1[fl(x) = y]
Gapvote(y, y
′;x) Submodel vote gap for instance x ∈ X and labels y, y′ ∈ Y where
Gapvote(y, y
′;x) := ċy(x)− ċ ′y′(x)− 1[y < y]
ypl Submodel plurality label where ypl := argmaxy∈Y ċy(x) and ties broken by
preferring the smaller label. FPA ensemble prediction under the plurality label
decision function (Sec. 5.3.1)
(Continued . . . )
155
Table A.10. Chapter 5 Nomenclature Reference (Continued): Notation specific
to Chapter 5 with related symbols grouped together. Groups are separated by dotted
lines. Note that this table spans multiple pages.
yru Label with the second-most submodel votes (i.e., the “runner up”) where
yru := argmaxy′∈Y\y ċy′(x)pl
gl(x, y) Logit value predicted by the l-th submodel for instance x ∈ X and label y ∈ Y
where gl(x, y) ∈ [0, 1]
yRO FPA ensemble prediction under the run-off decision function (Sec. 5.3.2).
ỹRO Label in the run-off decision function’s second round that is not selected as the
run-off prediction where ỹRO := {ypl, yru} \ yRO
c̈x(y; y
′) Pairwise log∑it count for instance x and label y ∈ Y w.r.t. label y′ ∈ Y whereL
c̈y(x; y
′) := l=1 1[gl(x, y) > gl(x, y
′)]
Gaplogit(y, y
′;x) Submodel logit vote gap for labels y, y′ ∈ Y where
Gaplogit(y, y
′;x) := c̈y(x; y
′)− c̈y′(x; y)− 1[y′ < y]
e Randomized ablation hyperparameter – number of kept features with the other
(d− e) ablated where e ∈ N.
BS Blocking smoothing ablation paradigm used by (de)randomized smoothing [LF20a]
156
Table A.11. Chapter 6 Nomenclature Reference: Notation specific to Chapter 6
with related symbols grouped together. Groups are separated by dotted lines. Note
that this table spans multiple pages.
GAS Gradient aggregated similarity – renormalized TracInCP (Sec. 6.3.3)
GAS-L Layerwise gradient aggregated similarity (Sec. 6.3.3)
FIT Our Framework for Identifying Targets of a poisoning and backdoor attack (Sec. 6.4)
TracIn Tracing gradient descent influence estimator by Pruthi et al. [Pru+20]
TracInCP TracIn Checkpoint – a heuristic version of TracIn by Pruthi et al. [Pru+20]
LOO Leave-one-out influence [CW82]
IF Influence functions estimator by Koh and Liang [KL17]
Rep. Pt. Representer point influence estimator by Yeh et al. [Yeh+18]
Rn. Renormalization for influence estimation (Sec. 6.3.3)
⟨·, ·⟩ Dot product
sgn (·) Signum function
I (zi, ẑte) Pointwise Influence of training instance zi ∈ D on test instance ẑte ∈ Z
Ĩ (zi, ẑte) Renormalized Influence of training instance zi ∈ D on test instance ẑte ∈ Z
ytarg Target (i.e., source) label the attacker seeks to have mislabeled
yadv Attacker’s adversarial (i.e., destination) label
zte Arbitrary test instance zte := (xte, yte) ∈ Z with true label yte ∈ Y
ẑte Arbitrary test instance where ẑte := (xte, ŷte) ∈ Z with predicted label ŷte := f(xte)
zi i-training index where zi := (xi, yi) ∈ D
D Training set where D := {z ni}i=1
Dadv Set of adversarially perturbed training instances where Dadv ⊆ D
Dcl Set of clean (unperturbed) training instances where Dcl := D \ Dadv
f(xte) Model f ’s prediction for test feature vector xte
ŷte Compact notation for predicted label for xte given parametric model f
T Number of iterations performed by the training algorithm
t Iteration number s.t. t ∈ {1, . . . , T}
T Subset of the training iterations considered by GAS and TracInCP where T ⊆ [T ]
ηt Learning rate for iteration t ∈ [T ] where ηt > 0
λ Weight decay hyperparameter
Bt Iteration t’s training batch where Bt ⊆ D
b Batch size where ∀t∈[T ] b = |Bt|
θ Model parameters where θ ∈ Rp
θ0 Initial model parameters at the start of training
θt Model parameters for iteration t ∈ [T ]
θT Model parameters at the end of training
P Serialized training parameters where P := {(η Tt, θt−1)}t=1
(Continued . . . )
157
Table A.11. Chapter 6 Nomenclature Reference (Continued): Notation specific
to Chapter 6 with related symbols grouped together. Groups are separated by dotted
lines. Note that this table spans multiple pages.
fi Penultimate feature representation for training feature vector xi
fte Penultimate feature representation for test feature vector xte
(t) (t)
gi Gradient for zi ∈ D w.r.t. parameters θt where gi := ∇θL(zi; θt)
(t) (t)
ĝte Gradient for ẑte ∈ Z w.r.t. parameters θt where ĝte∑:= ∇θL(ẑte; θt)
Hθ Training set empirical risk Hessian where H :=
1 2
θ n z ∈D ∇θL(zi; θi T )
H−1θ Inverse of risk Hessian Hθ
ẑtarg Target test instance where ẑtarg := (xtarg, ŷtarg) ∈ Z
Ẑte Set of test instances to be analyzed for attack targets
v Influence vector where v ∈ Rn
µ Median influence score (Sec. 6.4.2)
Q Q-estimator Rousseeuw and Croux [RC93] for influence vector v (Sec. 6.4.2)
Mitigate Target-driven mitigation function (Sec. 6.4.3)
κ Heavy-tail influence cutoff count
σ Influence anomaly score
ζ Target driven sanitization cutoff score where ζ ∈ R
Retrain Model retrain method
β Adaptive attack tradeoff hyperparameter based on Zhu et al.’s [Zhu+19] vision
poison attack
158
APPENDIX B
PROOFS
This chapter contains previously published, coauthored material [HL22a; HL23c;
HL23a]. Hammoudeh designed and wrote all proofs in this section. Lowd provided
supervision and editorial suggestions.
Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set Attack’s
Target Using Renormalized Influence Estimation”. In: Proceedings of the
29th ACM SIGSAC Conference on Computer and Communications Security.
CCS’22. Los Angeles, CA: Association for Computing Machinery, 2022. url:
https://arxiv.org/abs/2201.10055
Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression
to Certified Classification for General Poisoning Attacks”. In: Proceedings
of the 1st IEEE Conference on Secure and Trustworthy Machine Learning.
SaTML’23. 2023. url: https://arxiv.org/abs/2208.13904
Zayd Hammoudeh and Daniel Lowd. “Feature Partition Aggregation:
A Fast Certified Defense Against a Union of ℓ0 Attacks”. In: Proceedings of
the 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning.
AdvML-Frontiers’23. 2023. url: https://arxiv.org/abs/2302.11628
This section provides proofs for all theoretical contributions in this dissertation.
B.1 Proofs for Chapter 4
This section contains the proofs for the theoretical contributions that are either
in or relevant to Chapter 4.
Lemma B.1. For real multiset V of cardinality L > 1, if an arbitrarily-large value is
inserted into V or the smallest value in V is deleted, the resulting sets’ medians are
equivalent.
159
Proof. For simplicity and w.l.o.g., assume V is orde⌈red⌉ where ν1 ≤ · · · ≤ νL. Let
α ∈ [1, L] denote the median’s index. If L is odd, α = L ; otherwise, when L is even,
2
α = L + 1 , i.e., the midpoint between the L -th and (L + 1)-th largest values in V .
2 2 2 2
Consider first an arbitrarily-large insertion when L odd. Each insertion increases
the set’s cardinality by 1. When V’s cardinality is odd, then the cardinality of this
new set after the first insertion is even. Therefore, this new set’s median has index
α′
L+
:=⌈ ⌉1 1+ ▷ Median’s index for new set of even size T + 1 (B.1)2 2
L 1
= + (B.2)
2 2
1
= α + . (B.3)
2
Since L ≥ α′ and the i⌈nse⌉rted element is larger than all values in V, the value
corresponding to index ( L − 1) is equivalent for both original set V and the new set
2 2
after the insertion.
Next, consider the deletion case for odd L. Similar to above, the cardinality of
the modified set after one deletion is even; therefore, this modified set’s median has
index
′′ L− 1 1α :=⌈ ⌉ + ▷ Median’s index for new set of even size T − 1 (B.4)2 2
L − 1= (B.5)
2 2
1
= α− . (B.6)
2
This new set’s cardinality is one smaller than the origina⌈l s⌉et with the smallest element
removed. Hence, the valu⌈e c⌉orresponding to index ( L − 1) in this shrunken set2 2
equals the value at index ( L + 1) in original set V .
2 2
Since indices α′ and α′′ correspond to the same value in V, the resulting sets’
medians are equivalent.
160
The primary takeaway from Lemma B.1 is that under the insertion/deletion
paradigm, worst-case insertions and deletions are interchangeable. Note that for
our purposes, there is an edge case where worst-case insertions and deletions exhibit
divergent behavior. Specifically, after L deletions (i.e., all elements in V are removed),
the median of an empty set is not generally defined. In contrast, the median after
L arbitrarily-large insertions is itself arbitrarily large. For consistency, we define the
empty set’s median as ∞ to match the insertion case.
Proof of Lemma 4.2
Proof. Let Vl be all elements in V that do not exceed ξ.
Under the swap paradigm, the optimal strategy to maximally increase a set’s
median is to iteratively replace the set’s smallest value with ∞. Apply this optimal
strategy to V. After one swap, the resulting set contains |Vl| − 1 elements that are
less than or equal to ξ. After two swaps, there are |Vl| − 2 such elements with each
subsequ⌈ent⌉ swap’s effects proceeding inductively. Once the modified set contains
exactly L elements less than or equal to ξ, no additional swaps are possible without
2
causing the resulting set’s median to exceed ξ.
Therefore, by induction, the maximum number of swaps that can be performed
on V and it remains guaranteed that the resulting set’s median does not exceed ξ is
⌈ ⌉
L
R = |Vl| − . (B.7)
2
Proof of Lemma 4.3
Proof. For simplicity and w.l.o.g., assume V is ordered where ν1 ≤ · · · ≤ νL. An
attacker’s optimal insertion strategy is to insert arbitrarily-large values into V while
161
the optimal deletion strategy is to always delete V ’s smallest value. Lemma B.1 proves
that these worst-case operations perturb the median identically so we only consider
insertions below. ⌈ ⌉
Let α denote the median’s index. If L is odd, α = L ; otherwise, when L is even,
2
α = L + 1 , i.e., the midpoint between the L -th and (L + 1)-th largest values in V .
2 2 2 2
Each insertion increases the set’s size by 1. When V’s size is odd, then the size
of this new set after the first insertion is even. Therefore, this new set’s median has
index
α′ :=⌈L+⌉1 1+ Median’s index for new even size T + 1 (B.8)2 2
L 1
= + (B.9)
2 2
1
= α + . (B.10)
2
The analysis is essentially identical when L is even and is excluded for brevity. Note
that each insertion always increases the median’s index α by 1 .
2
As long as α ≤ |Vl|, it is guaranteed that medV ≤ ξ. Since each insertion changes
α by 1 , then 2(|Vl| − α) arbitrary insertions can be made in V with it remaining2
guaranteed that the modified set’s median does not exceed ξ. Regardless of whether
L is odd or even,1 it holds that
R ≥ 2(|Vl| − α) = 2|Vl| − 2α = 2|Vl| − L− 1. (B.11)
Proof of Lemma 4.4
Proof. The first portion of this proof follows the same argument as the proof of
Lemma 4.2, with one primary difference. There, the optimal strategy to perturb
⌈ ⌉ ( )
1When L is odd, 2α = 2 L2 = L+ 1 while in the even case 2α = 2
L + 12 2 = L+ 1.
162
V’s median swapped out the smallest values in V first. For the weighted version,
the optimal (greedy) strategy swaps out whichever value in Vl ⊆ V has the smallest
weight. ⌈ ⌉
To perturb V ’s median above ξ, it is sufficient to swap any |V | − Ll values in V2 l
with an arbitrarily large replacement. For simplicity and without loss of generality, let
r̃1, . . . , r̃|V | ⌈be⌉the weights of the elements in Vl arranged in ascending order. Definel
∆ := |V Ll| − . Applying Lemma 4.2, up to ∆ values in V can be replaced without2
perturbing the median. This entails a min∑imum cost of∆
R ≥ r̃l. (B.12)
l=1
Denote the (∆ + 1)-th largest weight in Rl as r̃∆+1. Observe that adding r̃∆+1 − 1
to Eq. (B.12) is insufficient to swap out any remaining elements in Vl since all elements
with weight less than r̃∆+1 are already replaced and all remaining elements have weight
at least r̃∆+1. Therefore, the certified robustness∑is∆
R = (r̃∆+1 − 1) + r̃l (B.13)
∑ l=1∆+1
= r̃l − 1 (B.14)
l=1
|Vl|−∑⌈L⌉+12
= r̃l − 1 (B.15)
∑l=1
= r − 1. (B.16)
r∈R̃l
Moreover, increasing Eq. (B.16) by one would allow for the (∆ + 1)-th largest
value in V to be swapped, which would in turn perturb the set’s median above ξ.
Therefore, Eq. (B.16)’s bound is tight.
Proof of Lemma 4.5
163
Proof. The median perturbation paradigms formalized in Lemmas 4.2,⌈4.3⌉, and 4.4
calculate their certified robustness using three values, namely: L, |Vl|, and L . If these2
three values are equivalent for V and V±1, then their associated certified robustness (R)
must also be equal. ⌈ ⌉
Since |V| = |V±1|, they have equivalent L and L . By definition, the binarization2
of V to V±1 does not change the value of |Vl| either. Therefore, for all three median
perturbation paradigms, binary multiset V±1 and real multiset V have equivalent
certified robustness R.
Proof of Theorem 4.6
Proof. For fixed-population IBLs, certifying that f(xte) ≤ ξ simplifies to median
perturbation under Sec. 4.2.1’s unweighted swap paradigm since all necessary criteria
are met, namely that
1. f ’s decision function is a median operation over a set of values,
i.e., f(xte) := medN (xte).
2. Neighborhood N (xte) has fixed cardinality L, and L is odd.
3. A worst-case modification to training set D causes an element in N (xte) to be
replaced with a different one.
Lemma 4.2, therefore, provides a (lower) bound on the number of training set
modifications that can be made without the resulting model violating the requirement
that f(xte) ≤ ξ. That is why certified robustness R in Eqs. (4.7) and (4.2) (Thm. 4.6
& Lem. 4.2, resp.) are equivalent.
Proof of Theorem 4.7
164
Proof. This proof follows a very similar structure as Theorem 4.6’s proof above. The
primary distinction is that a different median perturbation paradigm from Sec. 4.2 is
needed here.
For region-based IBLs, certifying that f(xte) ≤ ξ simplifies to median perturbation
under Sec. 4.2.2’s insertion/deletion paradigm since the three necessary criteria are
met:
1. f ’s decision function is a median operation over a set of values,
i.e., f(xte) := medN (xte).
2. Neighborhood cardinality L is not fixed but can increase and/or decrease.
3. Each modification of N (xte) takes the form of either an insertion or deletion,
i.e., not swaps.
Therefore, Lemma 4.3 bounds the total number of training set insertions/deletions
that can be performed without violating the requirement that f(xte) ≤ ξ. That is
why R’s definition in Eq. (4.8) is identical to Eq. (4.3).
Proof of Theorem 4.8
Proof. Certifying here that f(xte) ≤ ξ simplifies to median perturbation under the
unweighted swap paradigm since all necessary criteria are satisfied, specifically that
1. f ’s decision function is a median operation over a set of fixed, deterministic
values, i.e., f(xte) := med {fl(xte; 1), . . . , fl(xte;L)}.
2. Since the submodels are trained on disjoint data/feature regions, a change to
one submodel (i.e., value fl(xte)) has no effect on any other submodel (value).
165
3. Each submodel perturbation causes an existing value in the set to be replaced
by a new value.
4. L is fixed and odd-valued.
5. The cost to change any submodel (i.e., value) is one, i.e., ∀l rl = 1.
Lemma 4.2 provides a (lower) bound on the number of training set modifications
that can be performed without violating the requirement that f(xte) ≤ ξ. Certified
robustness R in Eq. (4.9) is then identical to Lemma 4.2’s Eq. (4.2).
Proof of Theorem 4.9
Proof. Here, we extend the argument in Theorem 4.8’s proof to the weighted case.
Four of the five criteria in Thm. 4.8’s proof still hold, specifically that
1. f ’s decision function is a median over a set of values.
2. Each submodel is independent and deterministic.
3. Modifications to the set of values take the form of swaps.
4. L is fixed and odd-valued.
The only difference is that the perturbations are weighted where each value fl(xte)
now has an associated cost rl ≥ 0. Therefore, Sec. 4.2.3’s weighted swap paradigm
applies. Certified robustness R in Eq. (4.10) follows directly from and is identical to
R in Lemma 4.4’s Eq. (4.5).
Proof of Lemma 4.10
166
Proof. To prove a problem is NP-hard, it suffices to show that there exists a polynomial
time reduction from a known NP-hard problem to it. As explained in Sec. 4.6.1, partial
set cover is NP-hard [Sla97b; Sla97a]. Below we map partial set cover to overlapping
certified regression.
Let U := [L′] be a ground set of L′ elements, and let Q := {Q1, . . . ,Qm} be a
collection of sets where each Qj ⊆ U⋃and
Q = U .
Q∈Q
The goal is to find the subcover F ⊆ Q o∣∣ f⋃minim∣∣ um cardinality s.t.
∆ ≤ ∣∣ Q∣∣,
Q∈F
where ∆ ∈ [L′].
It is straightforward to map the above to overlapping certified regression. Let
the ensemble have (2L′ + 1) submodels. Function htr partitions the training set into
m blocks with the blocks denoted D(1), . . . , D(m). Define the block mapping function
as: Qj, j ≤ L′
hf (j) :=  . (B.17)∅, Otherwise
Intuitively, each of the first L′ submodels is trained on one of the subsets in Q, while
the remaining models are not trained on any data.
Let all submodels be constant afunction. Define⌈th⌉e submodel function as−∞, l ≤ ∆+ L
2
fl(xte) :=  . (B.18)⌈ ⌉ ∞, Otherwise
For any finite ξ, |Vl| = ∆+ L . Applying Theo⌈rem⌉ 4.8, the number of submodels2
overlapping certified regression perturbs is |V | − Ll = ∆.2
167
Overlapping certified regression’s robustness R is the solution to the original
partial set cover problem because
1. Only models with index l ≤ L′ will be perturbed since all other submodels have
no training data.
2. The training set of each of these L′ submodels maps directly to a subset in Q.
3. Overlapping certified regression seeks to find the minimum number of dataset
blocks that must be modified to perturb the median prediction. In this
formulation, the number of blocks to be modified is ∆ – same as in the original
partial-set cover problem.
If overlapping certified regression were solvable in polynomial-time, then partial
set cover would also be solvable in polynomial time. However, partial set cover is
NP-hard, meaning overlapping certified regression must also be NP-hard.
Proof of Corollary 4.10.1
Proof. From Lemma 4.10 above, (unweighted) overlapping certified regression is
NP-hard. The unweighted case trivially maps to the weighted one where ∀l rl = 1.
Therefore, weighted OCR must be at least as hard as the unweighted case meaning
W-OCR is also NP-hard.
Proof of Lemma 4.11
Proof. By construction
Given a deterministic training algorithm, a model’s prediction can be certified
against the deletion of any subsetD ⊂ D by training a model on just datasetD \D and
verifying the prediction does not violate the associated invariant, i.e., fD\D(xte) ≤ ξ.
168
Consider training a separate model on each subset of D of size at least n− r + 1.
If all of those models also satisfy the invariant, then by construction, r − 1 deletions
or fewer are insufficient to violate the invariant. If r − 1 deletions are not enough,
then at least r deletions are required.
Lemma 4.11’s proof above only applies if model prediction and training is
deterministic, i.e., repeating training and then the prediction always yields the same
predicted value. Otherwise, proof by construction would require verifying all random
seeds for each subset of D.
B.2 Proofs for Chapter 5
This section contains the proofs for the theoretical contributions that are either
in or relevant to Chapter 5.
Theorem 5.3
Proof. Let
∆ := ċy (x)− ċyru(x) ≤ ∀y′∈/Y\{y ,y ′ru} ċy (x)− ċy (x). (B.19)pl pl pl
In words, vote-count difference ∆ between plurality label ypl and runner-up label yru
is at least as small as the gap between ypl and any other label.
In the worst case, a single feature perturbation changes a single submodel’s
vote from plurality label ypl to a label of the adversary’s choosing. Each perturbed
submodel prediction reduces the gap between the plurality label and the adversary’s
chosen label by two. By Eq. (B.19), it takes the fewest number of vote changes for yru
to overtake plurality label ypl with the proof following by induction. ∆ then lower
bounds the certified robustness. When determining R, ∆ may be even or odd. We
separately consider both cases below.
Case #1: ∆ is odd.
169
Since ∆ is odd, there can never be a tie between labels ypl and yru, simplifying
the analysis. Then, the maximum number of submodel predictions that can change
without changing the plurality label is any R ∈ N satisfying
ċyru(x) + 2R < ċy (x)pl
ċy (x)− ċ (x)
R < ⌊ pl yru2− ⌋ċy (x) ċ (x)
R = ⌊ pl
yru
⌋ ▷ R must be a whole number2
ċy (x)− ċpl y⌊ ru
(x)− 1[yru < ypl]
= ⌋ ▷ Subtract 1 no effect for odd ∆2
Gapvote(ypl, yru;x)
= ▷ Eq. (5.1).
2
Case #2: ∆ is even.
For even-valued ∆, ties can occur. If yru < ypl, the tie between ypl and yru is
broken in favor of yru. Then, the number of submodel predictions that can change
without changing the plurality label is any R ∈ N satisfying
ċyru(x) + 1[yru < ypl] + 2R < ċy (x)pl
ċy (x)− ċyru(x)− 1[y≤ ⌊ pl ru
< ypl]
R
2 ⌋
ċy (x)− ċyru(x)− 1[yru < ypl]
R = pl⌊ ⌋ ▷ R is a whole number2
Gapvote(ypl, yru;x)
= ▷ Eq. (5.1).
2
Theorem 5.3’s definition of R follows the same basic structure as that of deep
partition aggregation [LF21, Eq. (10)].
Claims Related to Theorem 5.4
Lemma B.2. Let f1, . . . , fL be a set of L models where ∀l∈[L] fl : X → Y. Under
submodel voting, label y ∈ Y is preferred over label y′ ∈ Y \ y w.r.t. instance x ∈ X if
and only if Gapvote(y, y
′;x) ≥ 0.
170
Proof. Label y is preferred over label y′ in only two cases:
1. y receives more (sub)model votes than y′, i.e., ċy(x) > ċy′(x).
2. y and y′ receive the same number of votes and y < y′.
In the first case,
Gap ′ ′vote(y, y ;x) := ċy(x)− ċy′(x)− 1[y < y]
≥ 1− 1[y′ < y]
≥ 1− 1 = 0.
In the second case,
Gapvote(y, y
′;x) := ċy(x)− ċy′(x)− ′1[y < y]
= 0− 1[y′ < y]
= 0− 0 = 0.
The reverse direction where Gap ′vote(y, y ;x) ≥ 0 =⇒ y is preferred over y′ can
be proven by contradiction using similar logic as above. If y′ receives more votes
than y, then Gap ′vote(y, y ;x) < 0, a contradiction. Similarly, if ċy(x) = ċy′(x) then
necessarily y′ < y. This also leads to a contradiction as Gapvote(y, y
′;x) would be
negative.
Lemma B.3. Runoff Elections Case #1 Certified Feature Robustness Given
submodel feature partition S1, . . . ,SL, let f be a voting-based ensemble of L submodels,
where the l-th submodel uses only the features in Sl. For instance x ∈ X , let yRO be
the label selected by the run-off decision function. The certified feature robustness of
yRO getting overtaken in round #{2⌊of the run-off el⌋ect⌊ion is ⌋}
Gap (ỹ , y) Gap (y , y)
RCase1 :
vote RO logit RO
RO = min max ,
y∈Y\yRO 2 2
171
Proof. For a label y ∈ Y \ yRO to overtake yRO, two requirements must be
simultaneously met:
– y and yRO must be round #1’s top-two labels, and
– y must be preferred over yRO in round #2.
Let ỹRO ∈ Y \ ypl denote the other top-two label in round #1. Note that ỹRO
may or may not be the same as y. The robustness of ỹRO to being overtaken by y in
round #1 follows directly from Th⌊eorem 5.3 and equa⌋ls
′ Gapvote(ỹRO, y;x)R = . (B.20)
2
Concerning the second requirement, yRO is preferred over y in round #2 so long
as Gaplogit(yRO, y;x) ≥ 0. Following similar logic as above, yRO’s certified feature
robustness in round #2 is ⌊ ⌋
′′ Gaplogit(yRO, y;x)R = . (B.21)
2
Since both requirements must hold, the certified feature robustness is lower
bounded by both (i.e., the maximum) of Eqs. (B.20) and (B.21). Moreover, the
optimal label y ∈ Y \ yRO is not determined a priori meaning all labels need to be
checked.
Lemma B.4. Runoff Elections Case #2 Certified Feature Robustness Given
submodel feature partition S1, . . . ,SL, let f be a voting-based ensemble of L submodels,
where the l-th submodel uses only the features in Sl. For instance x ∈ X , let yRO be
the label selected by the run-off decision function. Define recursive function dp as
0 min{i, j} ≤ 1 and (i, j) ≠ (1, 1)dp[i, j] = 1 + min{dp[i− 2, j − 1], dp[i− 1, j − 2]} Otherwise
(B.22)
172
Then yRO’s certified feature robustness of remaining in the top-two round #1 labels
predicted by the submodels is [ ]
RCase2 :RO = min dp gap , gap ′
y,y′∈Y\ y yyRO
where gap ∗y∗ = max{0,Gapvote(yRO, y )}.
Proof. Lemma B.2 proves that a label y is preferred over another label y′ iff
Gap ′vote(y, y ;x) ≥ 0. For label yRO to be in round #1’s top two, no pair of labels can
have negative submodel vote gaps w.r.t. yRO. Determining yRO’s round #1 certified
feature robustness reduces to determining the maximum number of submodel votes
that can be perturbed with it remaining guaranteed that both labels do not have
negative submodel vote gaps.
In the best case for an attacker, perturbing a single submodel changes the
submodel’s predicted label from yRO to a label of the attacker’s choosing, e.g., y ≠ yRO;
this perturbation decreases Gapvote(yRO, y;x) by 2. For all other y
′ ∈ Y \ {yRO, y},
this perturbation also decreases Gapvote(yRO, y
′;x) by 1.
By definition, y Case2RO is in the top-two round #1 labels, meaning RRO ≥ 0. Consider
first when max{Gapvote(yRO, y),Gapvote(yRO, y′)} ≤ 1 and (i, j) ̸= (1, 1). The attacker
perturbs whichever label y, y′ has the larger submodel vote gap. Since at most one
of these two labels has a positive gap, an additional submodel perturbation could
make both Gap ′vote(yRO, y) and Gapvote(yRO, y ) negative meaning no further feature
perturbations are possible. In the special case of i = j = 1, perturbing a submodel
predicting either label y or y′ never causes the other label’s submodel vote gap to be
negative meaning one additional submodel feature perturbation is possible. When
max{Gapvote(yRO, y),Gap ′vote(yRO, y )} > 1, the proof follows by induction where
recursive function dp returns the fewest number of submodel perturbations required
given y, y′ ∈ Y .
173
Since the attacker’s optimal pair of labels y, y′ is not determined a priori,
Eq. (5.7)’s feature guarantee considers all pairs of labels and returns the robustness of
the pair most advantageous to the attacker.
Theorem 5.4
Proof. For a given x ∈ X , there are only two possible ways that run-off prediction
yRO ∈ Y can be perturbed, namely:
1. yRO loses in run-off’s second round.
2. yRO fails to qualify for the second round by not being in the top two labels in
round #1.
These two cases align directly with Lemmas B.3 and B.4, respectively. An optimal
attacker targets whichever of the two cases requires fewer feature perturbations.
Therefore, run-off’s certified feature robustness is the minimum of Eqs. (5.5) and (5.7).
B.3 Proof for Chapter 6
This section contains the proofs for the theorem in Chapter 6.
Theorem. Let loss function L̃ : R → R≥0 be twice-differentiable and strictly convex
as well as either even2 or mon(ot)onicall∥y decreasi∥ng. T∥hen, i(t h)o∥lds that∥
L̃ (a) < L̃ a′ =⇒ ∥∇ ∥ ∥ ∥aL̃ (a)∥ < ∥∇ ′aL̃ a ∥ . (B.23)
2 2
Proof.
Theorem 6.1 specifies that property,
∥∥ ∥∥ ∥∥ ∥∥
L̃ (a) < L̃ (a′) =⇒ ∥∇aL̃ (a)∥ < ∥∇aL̃ (a′)∥ ,
2 2
2“Even” denotes that the function satisfies ∀a L̃ (a) = L̃ (−a).
174
holds when loss function L̃ is strictly convex (i.e., ∀ ∇2a∈R aL̃ (a) > 0) and either
monotonically decreasing or even. We prove the claim separately for these two disjoint
cases.
Case #1: Monotonically Decreasing For any monotonically decreasing L̃ ,
by definition
L̃ (a) < L̃ (a′) =⇒ a > a′.
Then, given ∀ 2a∈R ∇aL̃ (a) > 0, it holds that
∇aL̃ (a) > ∇aL̃ (a′). (B.24)
For any scalar, monotonically decreasing function L̃ , it holds that ∇aL̃ (a′) ≤ 0
meaning Eq. (B.24)’s inequa∥lity flips w∥.r.t. ∥L2 norms, i.e.,∥∥ ∥ ∥ ∥∇ ∥aL̃ (a)∥ < ∥∇aL̃ (a′)∥ , (B.25)
2 2
as for any x,x′ ∈ R≤0 it holds that x > x′ =⇒ ∥x∥2 < ∥x′∥2.
Case #2: Even Formally, a function L̃ is even if
∀a L̃ (a) = L̃ (−a). (B.26)
For even L̃ , it holds that ∇aL̃ (0) = 0 provided twice differentiability. Given
∀ 2a∈R ∇aL̃ (a) > 0, then ∀a<0 ∇aL̃ (a) < 0. Hence over restricted domain R≤0, L̃ is
monotonically decreasing. Above it was shown that Eq. (6.8) holds for monotonically
decreasing functions so ∥
−| | −| ′| ⇒ ∥∥∇ −| | ∥∥∥ ∥∥∥ ∥∥L̃ ( a ) < L̃ ( a ) = aL̃ ( a ) < ∇aL̃ (−|a′|)∥ . (B.27)
2 2
Evenness induces function∣symmetr∣y ab∣out the origin so
∀ ∣a ∣ ∣∇ ∣ ∣ ∣aL̃ (a)∣ = ∣∇aL̃ (−a)∣, (B.28)
175
and by extension ∥
∀ ∥a ∥ ∥∥∥ ∥∥∥∇ ∇ ∥∥aL̃ (a) = aL̃ (−a)∥ . (B.29)
2 2
Eqs. (B.26) and (B.29) allow Eq. (B.27)’s absolute values and negations to be dropped
completing the proof.
176
APPENDIX C
DETAILED EMPIRICAL RESULTS
This chapter contains previously published, coauthored material [HL21; HL22a;
HL23c; HL23a]. Hammoudeh wrote this complete section, coded all experiments,
designed the experiments, and analyzed all the results. Lowd provided supervision,
editorial suggestions, and input on experiment design.
Zayd Hammoudeh and Daniel Lowd. “Simple, Attack-Agnostic Defense
Against Targeted Training Set Attacks Using Cosine Similarity”. In:
Proceedings of the 3rd ICML Workshop on Uncertainty and Robustness
in Deep Learning. UDL’21. 2021
Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set Attack’s
Target Using Renormalized Influence Estimation”. In: Proceedings of the
29th ACM SIGSAC Conference on Computer and Communications Security.
CCS’22. Los Angeles, CA: Association for Computing Machinery, 2022. url:
https://arxiv.org/abs/2201.10055
Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression
to Certified Classification for General Poisoning Attacks”. In: Proceedings
of the 1st IEEE Conference on Secure and Trustworthy Machine Learning.
SaTML’23. 2023. url: https://arxiv.org/abs/2208.13904
Zayd Hammoudeh and Daniel Lowd. “Feature Partition Aggregation:
A Fast Certified Defense Against a Union of ℓ0 Attacks”. In: Proceedings of
the 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning.
AdvML-Frontiers’23. 2023. url: https://arxiv.org/abs/2302.11628
177
This chapter provides detailed empirical results that were excluded from the main
body of the dissertation to improve readability and brevity. We break these detailed
results by the corresponding chapter in the main paper.
C.1 Chapter 4 Detailed Results
C.1.1 Baseline Accuracy. Table C.12 shows the baseline accuracy when
a model is trained on all of training set D (i.e., q = 1). For each dataset, the model
architecture (either ridge regression or XGBoost) aligns with those used for Sec. 4.8’s
ensembles. See Table 1.
Table C.12. Baseline Accuracy: Summary of the baseline (i.e., uncertified) accuracy
mean and standard deviation for Sec. 4.8’s six datasets. Submodels were trained on all
of training set D (i.e., q = 1). Beside each dataset’s name is the submodel architecture
used by the ensemble. Threshold ξ matches values in Table 1.
Dataset Submodel Base Acc. (%)
Ames XGBoost 90.4± 2.4
Austin XGBoost 71.3± 4.1
Diamonds Ridge 73.6± 4.0
Weather Ridge 85.9± 3.4
Life XGBoost 92.7± 3.1
Spambase Ridge 87.5± 2.9
C.1.2 Numerical Results. Fig. 7 visualizes our certified regressors’
certified robustness on six datasets – five regression and one binary classification. This
section provides the certified accuracy in numerical form, including the associated
variance.
178
Table C.13. Ames Housing Full Results: Certified accuracy mean and standard
deviation for the Ames Housing [Coc11] dataset. Each ensemble submodel was trained
on 1 -th of the training set with three q values tested per dataset, while kNN-CR was
q
always trained on the whole training set (i.e., q = 1). The certified accuracy results
of five robustness values (R) are reported per q value. Also reported as a baseline is
the uncertified accuracy (R = 0) when training a single model on all of training set D
(q = 1). Results are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold.
q R PCR OCR W-PCR W-OCR kNN-CR
1 0 90.4 ± 2.4 90.4 ± 2.4 90.4 ± 2.4 90.4 ± 2.4 54.3 ± 3.8
1 82.3 ± 3.1 86.8 ± 2.8 82.3 ± 3.1 85.2 ± 2.9 54.3 ± 3.8
4 76.3 ± 3.7 81.2 ± 2.9 76.3 ± 3.7 80.0 ± 2.7 54.0 ± 3.6
25 8 57.4 ± 3.8 69.6 ± 2.7 57.5 ± 3.7 70.1 ± 3.2 53.1 ± 3.5
12 11.0 ± 3.7 28.2 ± 3.8 23.7 ± 5.1 48.8 ± 4.4 52.2 ± 3.9
16 0.0 ± 0.0 0.0 ± 0.0 11.3 ± 3.8 15.3 ± 3.2 51.5 ± 3.8
1 70.4 ± 3.7 73.9 ± 1.7 70.4 ± 3.7 73.1 ± 1.6 54.3 ± 3.8
10 62.8 ± 3.4 66.5 ± 2.0 62.8 ± 3.4 65.9 ± 1.9 53.1 ± 3.5
125 20 44.9 ± 4.2 52.2 ± 2.9 44.9 ± 4.2 52.2 ± 3.0 51.2 ± 3.6
30 21.8 ± 4.2 28.7 ± 3.8 21.8 ± 4.2 31.1 ± 3.3 49.1 ± 3.7
40 2.5 ± 1.3 3.9 ± 1.7 2.5 ± 1.3 5.9 ± 2.3 48.5 ± 4.0
1 63.1 ± 3.4 66.3 ± 3.8 63.1 ± 3.4 66.0 ± 3.7 54.3 ± 3.8
20 51.8 ± 3.2 56.4 ± 2.9 51.8 ± 3.2 57.9 ± 2.9 51.2 ± 3.6
251 40 37.1 ± 3.2 42.5 ± 3.8 37.1 ± 3.2 45.3 ± 2.8 48.5 ± 4.0
60 15.3 ± 3.8 22.5 ± 3.4 15.3 ± 3.8 32.8 ± 3.7 44.1 ± 4.3
80 0.2 ± 0.4 0.6 ± 0.5 0.2 ± 0.4 10.9 ± 2.5 40.1 ± 3.9
179
Table C.14. Austin Housing Full Results: Certified accuracy mean and standard
deviation for the Austin Housing [Pie21] dataset. Each ensemble submodel was trained
on 1 -th of the training set with three q values tested per dataset, while kNN-CR was
q
always trained on the whole training set (i.e., q = 1). The certified accuracy results
of five robustness values (R) are reported per q value. Also reported as a baseline is
the uncertified accuracy (R = 0) when training a single model on all of training set D
(q = 1). Results are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold.
q R PCR OCR W-PCR W-OCR kNN-CR
1 0 71.3 ± 4.1 71.3 ± 4.1 71.3 ± 4.1 71.3 ± 4.1 35.3 ± 4.8
1 59.9 ± 4.5 63.7 ± 4.6 59.9 ± 4.5 61.7 ± 4.6 35.3 ± 4.8
5 49.6 ± 5.1 52.9 ± 3.4 49.6 ± 5.1 50.8 ± 4.2 35.2 ± 4.8
51 10 29.9 ± 3.7 35.3 ± 2.8 29.9 ± 3.7 31.8 ± 3.2 34.9 ± 4.9
15 9.2 ± 2.0 12.6 ± 2.8 9.2 ± 2.0 10.1 ± 2.7 34.7 ± 4.9
20 0.5 ± 0.5 0.3 ± 0.7 0.5 ± 0.5 0.0 ± 0.0 34.6 ± 4.6
1 51.0 ± 3.9 52.0 ± 4.1 51.0 ± 3.9 51.1 ± 4.1 35.3 ± 4.8
20 41.4 ± 3.6 43.3 ± 5.9 41.4 ± 3.6 43.3 ± 5.9 34.6 ± 4.6
301 40 29.7 ± 4.1 32.2 ± 5.2 29.7 ± 4.1 33.7 ± 5.7 34.3 ± 4.7
60 15.4 ± 3.0 19.1 ± 4.1 15.4 ± 3.0 22.7 ± 4.9 34.0 ± 4.5
80 3.2 ± 1.8 3.1 ± 1.2 3.2 ± 1.8 7.7 ± 3.4 32.9 ± 4.5
1 43.9 ± 5.0 42.7 ± 5.5 43.9 ± 5.0 43.6 ± 5.7 35.3 ± 4.8
40 34.5 ± 6.0 35.0 ± 6.2 34.5 ± 6.0 36.9 ± 5.9 34.3 ± 4.7
701 80 25.3 ± 4.8 24.7 ± 6.1 25.3 ± 4.8 27.0 ± 6.1 32.9 ± 4.5
120 13.1 ± 2.6 14.6 ± 5.0 13.1 ± 2.6 18.9 ± 4.4 31.6 ± 4.9
160 2.7 ± 0.9 4.8 ± 3.3 2.7 ± 0.9 9.1 ± 2.9 30.0 ± 4.7
180
Table C.15. Diamonds Full Results: Certified accuracy mean and standard deviation
for the Diamonds [Wic16] dataset. Each ensemble submodel was trained on 1 -th of
q
the training set with three q values tested per dataset, while kNN-CR was always
trained on the whole training set (i.e., q = 1). The certified accuracy results of five
robustness values (R) are reported per q value. Also reported as a baseline is the
uncertified accuracy (R = 0) when training a single model on all of training set D
(q = 1). Results are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold.
q R PCR OCR W-PCR W-OCR kNN-CR
1 0 73.6 ± 4.0 73.6 ± 4.0 73.6 ± 4.0 73.6 ± 4.0 15.7 ± 3.5
1 74.6 ± 3.8 74.8 ± 4.5 74.6 ± 3.8 74.7 ± 4.4 15.7 ± 3.5
35 64.4 ± 4.6 67.1 ± 4.8 67.2 ± 4.6 69.7 ± 4.1 15.6 ± 3.4
151 70 38.6 ± 5.7 42.2 ± 4.0 62.4 ± 4.3 64.7 ± 5.2 15.5 ± 3.4
105 0.0 ± 0.0 0.0 ± 0.0 54.5 ± 5.9 57.4 ± 5.2 15.2 ± 3.3
140 0.0 ± 0.0 0.0 ± 0.0 35.4 ± 5.8 34.3 ± 7.1 14.8 ± 3.4
1 77.3 ± 4.2 75.8 ± 3.9 77.3 ± 4.2 75.7 ± 4.1 15.7 ± 3.5
75 66.2 ± 4.0 65.0 ± 4.4 66.3 ± 4.0 68.7 ± 4.4 15.5 ± 3.4
501 150 50.7 ± 4.9 48.2 ± 4.5 57.8 ± 4.7 59.6 ± 4.9 14.8 ± 3.4
300 0.0 ± 0.0 0.0 ± 0.0 38.0 ± 6.1 36.2 ± 3.2 12.3 ± 3.4
450 0.0 ± 0.0 0.0 ± 0.0 8.8 ± 3.3 9.0 ± 2.1 10.7 ± 2.9
1 75.2 ± 4.1 74.9 ± 5.5 75.2 ± 4.1 74.9 ± 5.5 15.7 ± 3.5
150 56.0 ± 4.9 56.3 ± 5.8 56.0 ± 4.9 62.8 ± 6.1 14.8 ± 3.4
1001 300 24.7 ± 4.4 25.3 ± 4.8 29.5 ± 4.1 42.3 ± 6.4 12.3 ± 3.4
450 0.0 ± 0.0 0.0 ± 0.0 16.9 ± 4.0 17.9 ± 5.1 10.7 ± 2.9
600 0.0 ± 0.0 0.0 ± 0.0 4.2 ± 1.5 3.3 ± 3.1 9.6 ± 3.1
181
Table C.16. Weather Full Results: Certified accuracy mean and standard deviation
for the Weather [Mal+21] dataset. Each ensemble submodel was trained on 1 -th of
q
the training set with three q values tested per dataset, while kNN-CR was always
trained on the whole training set (i.e., q = 1). The certified accuracy results of five
robustness values (R) are reported per q value. Also reported as a baseline is the
uncertified accuracy (R = 0) when training a single model on all of training set D
(q = 1). Results are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold.
q R PCR OCR W-PCR W-OCR kNN-CR
1 0 85.9 ± 3.4 85.9 ± 3.4 85.9 ± 3.4 85.9 ± 3.4 23.8 ± 4.5
1 86.0 ± 3.1 86.5 ± 3.8 86.0 ± 3.1 86.5 ± 3.8 23.8 ± 4.5
10 83.9 ± 3.5 84.6 ± 3.8 83.9 ± 3.5 85.0 ± 3.8 23.8 ± 4.5
51 20 82.0 ± 3.5 82.3 ± 4.2 83.1 ± 3.4 84.0 ± 3.8 23.8 ± 4.5
35 0.0 ± 0.0 0.0 ± 0.0 81.8 ± 3.7 82.0 ± 4.4 23.8 ± 4.5
50 0.0 ± 0.0 0.0 ± 0.0 76.8 ± 5.1 75.8 ± 4.9 23.8 ± 4.5
1 85.2 ± 3.9 85.2 ± 4.2 85.2 ± 3.9 85.2 ± 4.2 23.8 ± 4.5
300 75.8 ± 4.3 77.6 ± 4.2 76.8 ± 4.2 79.4 ± 4.2 23.4 ± 4.6
1501 600 54.3 ± 4.7 55.1 ± 5.4 71.4 ± 3.7 72.2 ± 5.1 22.9 ± 4.3
1000 0.0 ± 0.0 0.0 ± 0.0 56.5 ± 5.1 57.4 ± 4.6 22.0 ± 4.6
1400 0.0 ± 0.0 0.0 ± 0.0 22.9 ± 2.6 22.5 ± 3.2 21.8 ± 4.8
1 86.7 ± 2.7 84.6 ± 2.9 86.7 ± 2.7 84.6 ± 2.9 23.8 ± 4.5
600 67.7 ± 2.7 66.9 ± 4.0 68.1 ± 2.9 71.5 ± 4.0 22.9 ± 4.3
3001 1200 25.7 ± 5.8 25.8 ± 4.9 55.0 ± 4.2 56.2 ± 3.8 21.9 ± 4.7
1800 0.0 ± 0.0 0.0 ± 0.0 35.8 ± 4.8 34.7 ± 4.0 21.5 ± 4.7
2400 0.0 ± 0.0 0.0 ± 0.0 9.3 ± 3.0 9.9 ± 2.5 20.5 ± 4.9
182
Table C.17. Life Full Results: Certified accuracy mean and standard deviation for
the Life [Raj21] dataset. Each ensemble submodel was trained on 1 -th of the training
q
set with three q values tested per dataset, while kNN-CR was always trained on
the whole training set (i.e., q = 1). The certified accuracy results of five robustness
values (R) are reported per q value. Also reported as a baseline is the uncertified
accuracy (R = 0) when training a single model on all of training set D (q = 1). Results
are averaged across 10 trials per method, with each R’s best mean certified accuracy
in bold.
q R PCR OCR W-PCR W-OCR kNN-CR
1 0 92.7 ± 3.1 92.7 ± 3.1 92.7 ± 3.1 92.7 ± 3.1 34.6 ± 3.1
1 77.7 ± 4.4 80.2 ± 4.0 77.7 ± 4.4 78.3 ± 5.2 34.6 ± 3.1
5 69.3 ± 4.9 71.5 ± 4.7 69.3 ± 4.9 71.4 ± 4.7 33.8 ± 2.8
25 10 43.3 ± 5.6 54.2 ± 6.0 47.3 ± 5.9 60.8 ± 4.9 32.9 ± 2.8
15 0.0 ± 0.0 0.0 ± 0.0 23.8 ± 4.0 33.1 ± 3.2 32.1 ± 2.9
20 0.0 ± 0.0 0.0 ± 0.0 9.5 ± 2.9 11.4 ± 3.2 31.1 ± 2.3
1 71.1 ± 3.8 71.5 ± 4.3 71.1 ± 3.8 70.8 ± 4.1 34.6 ± 3.1
10 58.9 ± 4.3 61.8 ± 5.1 58.9 ± 4.3 61.8 ± 5.1 32.9 ± 2.8
101 20 40.5 ± 5.8 43.9 ± 4.3 40.5 ± 5.8 45.4 ± 4.7 31.1 ± 2.3
30 20.7 ± 3.8 22.8 ± 4.4 20.7 ± 3.8 26.4 ± 3.9 28.5 ± 2.5
40 4.4 ± 2.4 3.5 ± 1.4 4.6 ± 2.5 10.1 ± 2.7 26.9 ± 2.4
1 62.9 ± 4.1 66.3 ± 3.0 62.9 ± 4.1 65.7 ± 2.4 34.6 ± 3.1
30 46.6 ± 3.8 49.0 ± 2.8 46.6 ± 3.8 52.9 ± 3.0 28.5 ± 2.5
201 60 23.3 ± 2.7 24.6 ± 4.1 24.4 ± 2.6 34.4 ± 3.9 23.4 ± 2.3
90 0.1 ± 0.3 0.6 ± 0.5 12.8 ± 2.9 18.0 ± 3.6 18.1 ± 2.1
120 0.0 ± 0.0 0.0 ± 0.0 4.1 ± 1.6 4.4 ± 2.2 8.5 ± 1.4
183
Table C.18. Spambase Full Results: Certified accuracy mean and standard deviation
for the Spambase [Hop+17] dataset. Each ensemble submodel was trained on 1 -th of
q
the training set with three q values tested per dataset, while kNN-CR was always
trained on the whole training set (i.e., q = 1). The certified accuracy results of five
robustness values (R) are reported per q value. Also reported as a baseline is the
uncertified accuracy (R = 0) when training a single model on all of training set D
(q = 1). Results are averaged across 10 trials per method, with each R’s best mean
certified accuracy in bold.
q R PCR OCR W-PCR W-OCR kNN-CR
1 0 87.5 ± 2.9 87.5 ± 2.9 87.5 ± 2.9 87.5 ± 2.9 64.0 ± 4.3
1 87.6 ± 3.5 87.1 ± 3.8 87.6 ± 3.5 85.8 ± 3.6 64.0 ± 4.3
5 80.3 ± 3.8 81.0 ± 3.4 81.1 ± 3.6 83.5 ± 3.7 63.6 ± 4.1
25 10 57.3 ± 4.5 65.0 ± 3.1 73.2 ± 3.8 76.4 ± 3.8 63.4 ± 4.3
15 0.0 ± 0.0 0.0 ± 0.0 61.5 ± 4.8 63.1 ± 2.4 63.2 ± 4.4
20 0.0 ± 0.0 0.0 ± 0.0 42.5 ± 4.4 38.7 ± 4.2 63.0 ± 4.4
1 87.4 ± 2.9 87.2 ± 2.2 87.4 ± 2.9 86.7 ± 2.6 64.0 ± 4.3
25 69.1 ± 4.3 70.2 ± 5.5 69.1 ± 4.3 75.7 ± 4.9 63.0 ± 4.4
151 50 22.8 ± 5.8 24.9 ± 4.0 35.4 ± 6.3 52.8 ± 4.0 62.0 ± 4.8
75 0.0 ± 0.0 0.0 ± 0.0 14.8 ± 3.0 23.0 ± 3.9 61.8 ± 4.7
100 0.0 ± 0.0 0.0 ± 0.0 3.4 ± 2.2 5.2 ± 2.4 61.3 ± 4.2
1 83.1 ± 2.8 86.2 ± 3.3 83.1 ± 2.8 86.0 ± 3.2 64.0 ± 4.3
45 65.1 ± 4.7 68.6 ± 3.9 65.1 ± 4.7 72.1 ± 3.9 62.3 ± 4.3
301 90 30.4 ± 3.5 34.6 ± 4.9 33.6 ± 3.2 53.7 ± 2.7 61.7 ± 4.6
135 0.5 ± 0.7 0.1 ± 0.3 23.7 ± 3.5 31.1 ± 4.6 60.2 ± 4.2
180 0.0 ± 0.0 0.0 ± 0.0 7.2 ± 2.5 11.3 ± 2.5 58.3 ± 4.3
184
C.1.3 kNN-CR Full Certified Accuracy Plots. To improve readability,
Fig. 7 does not show kNN-CR’s full certified accuracy trend. Instead, Fig. C.19 below
plots kNN-CR’s full mean certified accuracy against that of W-OCR (using each
dataset’s maximum q value) for each of Sec. 4.8’s six datasets. Fig. C.19 also visualizes
the variance of each method by showing one standard deviation of the certified accuracy
as a shaded region around the mean line. In summary, while W-OCR certifies more
instances (i.e., has larger peak certified accuracy), its maximum certified robustness R
is (significantly) smaller than that of kNN-CR.
Table C.19. W-OCR q Values: As detailed in Sec. 4.8.1, ensemble submodels were
trained on 1 -th of the training data where q varies by dataset. Below are the W-OCR
q
q values used in Fig. C.19.
Dataset Ames Austin Diamonds Weather Life Spambase
q 251 701 1,001 3,001 201 301
185
kNN-CR (q = 1) W-OCR (q Varies)
80 80
60 60
40 40
20 20
0 50 100 150 200 250 0 200 400 600 800
Certified Robustness (R) Certified Robustness (R)
(a) Ames Housing (b) Austin Housing
80 80
60 60
40 40
20 20
0 300 600 900 1,200 0 0.3 0.6 0.9 1.2 1.5
Certified Robustness (R) Certified Robustness (R) ·104
(c) Diamonds (d) Weather
80 80
60 60
40 40
20 20
0 40 80 120 160 0 100 200 300 400 500
Certified Robustness (R) Certified Robustness (R)
(e) Life (f) Spambase
Figure C.19. kNN-CR vs. W-OCR Certified Accuracy: Full plots of the mean
certified accuracy for Sec. 4.8’s six datasets. The shaded regions visualize one standard
deviation of the certified accuracy for each18R6value. W-OCR’s q value for each dataset
is in Table C.19.
Certified Acc. (%) Certified Acc. (%) Certified Acc. (%)
Certified Acc. (%)
Certified Acc. (%) Certified Acc. (%)
C.2 Chapter 5 Detailed Results
Limited space prevents us from including all experimental results in the main
paper. We provide additional results below.
C.2.1 Non-Robust Accuracy. Table C.20 provides the non-robust
(i.e., uncertified) accuracy when training a single model (L = 1) on each of Sec. 5.5’s
four datasets. The non-robust accuracy provides an upper-bound reference for the
maximum achievable accuracy given the training set and the model architectures we
used.
For regression, the “non-robust accuracy” denotes the single model’s prediction
satisfies the error bounds, i.e., ξl ≤ f(x) ≤ ξu. Given arbitrary instance (x, y), we follow
Hammoudeh and Lowd [HL23c] and use for Weather ξl = y − 3◦C and ξ = y + 3◦u C
as well as for Ames ξl = y − 15%y and ξu = y + 15%y.
Table C.20. Non-Robust Accuracy: Prediction accuracy when training a single
model on all model features, i.e., L = 1. These values represent an upper bound on
the potential accuracy of our method given the training set, model architecture, and
hyperparameters.
Dataset Accuracy
CIFAR10 95.40%
MNIST 99.57%
Weather 92.61%
Ames 88.05%
187
C.2.2 Detailed Median Certified Robustness Results. In Section 5.5.2
of the main paper, Tables 2 and 3 summarize the median certified robustness and
classification accuracies of feature partition aggregation (FPA) and baseline randomized
ablation [LF20b; Jia+22b]. In the tables, “[LF20b]” denotes Levine and Feizi’s [LF20b]
original version of RA, and “[Jia+22b]” denotes Jia et al.’s [Jia+22b] improved RA;
“Plural” denotes FPA using plurality voting as the decision function (Sec. 5.3.1) while
“Run-Off” denotes FPA with Sec. 5.3.2’s run-off elections.
Recall that FPA’s primary hyperparameter is L – the number of ensemble
submodels. RA’s primary hyperparameter is e – the number of kept (unchanged)
pixels in each ablated input. L and e control the corresponding method’s accuracy-
robustness trade-off where smaller L and larger e entail better accuracy. As a rule
of thumb, the fairest comparison across methods sets L ≈ d , since this relationship
e
entails that each FPA and RA prediction uses approximately the same number of
features from instance x.
This section explores the relationship between each method’s hyperparameter
settings and the corresponding median robustness and classification accuracy. Each
dataset’s results are split into separate tables similar to Levine and Feizi’s [LF20b,
Tables 1 and 2] presentation in the original RA paper.
For CIFAR10 and MNIST, FPA uses deterministic partitioning. Specifically, we
use a striding strategy as Section 5.4.1 details. Depending on the image dimensions,
some stride lengths are substantially worse than others, leading to non-monotonic
changes in median robustness as a function of L. Tables C.21 and C.22 do not report
the particularly poor choices of L that severely degrade median robustness, e.g., when
L is evenly divisible by the image width.
188
Below, any misclassified prediction is assigned robustness of −∞, meaning the
median certified robustness can in some cases be negative.
Table C.21. CIFAR10 Detailed Results: Classification accuracy (%) and median
certified robustness (larger is better) for the CIFAR10 [KNH14] dataset (d = 1024)
for our certified sparse defense, feature partition aggregation (FPA), and baseline
randomized ablation (RA) across various hyperparameter settings. Each certification
method’s hyperparameter setting with the best median robustness is shown in bold.
The best overall median robustness is shown in blue.
(a) Feature Partition Aggregation (Ours) (b) Randomized Ablation (RA – Baseline)
Plural Run-Off [LF20b] [Jia+22b]
L e
Acc. (%) Rmed Acc. (%) Rmed Acc. (%) ρmed Acc. (%) ρmed
5 91.46 2 91.77 2 250 88.77 2 88.56 2
10 86.09 4 86.20 4 225 88.05 2 87.90 2
20 81.38 7 81.40 7 200 86.76 3 86.54 3
25 78.65 8 78.58 8 175 86.16 3 85.94 3
40 74.74 9 74.95 10 150 84.23 4 84.08 4
55 70.44 10 70.34 11 125 82.66 5 82.49 5
70 67.46 9 67.47 11 100 80.43 6 80.05 6
85 66.24 10 66.61 12 75 78.48 7 78.11 7
105 63.55 10 63.61 12 50 73.26 7 72.79 8
115 62.39 11 62.35 13 35 70.34 7 69.72 9
140 60.35 10 60.57 12 30 69.62 7 69.01 9
165 57.91 8 58.48 10 25 68.81 6 68.08 9
185 56.08 7 56.39 9 20 67.01 5 66.15 9
200 55.80 7 56.43 9 15 65.68 3 64.74 10
225 56.27 6 56.56 8 12 63.93 0 62.91 10
250 53.30 4 53.46 5 10 62.73 0 61.71 10
8 60.24 0 59.12 9
7 59.08 0 57.83 8
5 53.20 0 51.84 3
189
Table C.22. MNIST Detailed Results: Classification accuracy (%) and median
certified robustness (larger is better) for the MNIST [LeC+98] dataset (d = 784)
for our certified sparse defense, feature partition aggregation (FPA), and baseline
randomized ablation (RA) across various hyperparameter settings. Each certification
method’s hyperparameter setting with the best median robustness is shown in bold.
The best overall median robustness is shown in blue.
(a) Feature Partition Aggregation (Ours) (b) Randomized Ablation (RA – Baseline)
Plural Run-Off [LF20b] [Jia+22b]
L e
Acc. (%) Rmed Acc. (%) Rmed Acc. (%) ρmed Acc. (%) ρmed
5 99.50 2 99.51 2 100 98.78 4 98.75 4
10 98.64 4 98.67 4 95 98.75 5 98.72 5
15 96.82 7 97.02 7 90 98.62 5 98.56 5
20 96.36 8 96.53 8 85 98.60 5 98.52 5
25 95.77 9 96.06 10 80 98.46 6 98.40 6
35 91.70 9 93.05 11 75 98.35 6 98.27 6
40 89.37 9 91.32 11 70 98.14 6 98.07 6
50 84.54 8 88.46 11 65 98.04 7 97.98 7
60 83.54 9 87.22 12 60 97.85 7 97.78 7
70 79.71 8 85.87 11 55 97.58 7 97.39 8
80 71.29 6 79.05 9 50 97.26 7 97.07 8
90 69.94 6 79.25 9 45 96.88 8 96.68 8
105 62.53 4 74.45 8 40 96.42 8 96.13 9
120 63.03 3 74.09 7 35 95.69 8 95.32 9
130 57.48 2 69.93 7 30 94.87 7 94.47 9
150 52.51 0 67.30 5 25 93.55 6 93.09 10
20 90.99 3 90.07 9
15 86.71 0 85.24 8
10 76.78 0 74.69 6
5 35.54 −∞ 32.89 −∞
190
Table C.23. Weather Detailed Results: Classification accuracy (%) and median
certified robustness (larger is better) for the Weather [Mal+21] dataset (d = 128)
for our certified sparse defense, feature partition aggregation (FPA), and baseline
randomized ablation (RA) across various hyperparameter settings. FPA considers only
plurality voting-based certification (Sec. 5.3.1) since the reduction is from certified
regression to certified binary classification. FPA results are reported using both
GBDTs [Ke+17] and linear submodels. Median robustness “−∞” denotes that the
classification accuracy was less than 50%. Each approach’s hyperparameter setting
with the best median robustness is shown in bold. The best overall median robustness
is shown in blue.
(a) Feature Partition Aggregation (Ours) (b) Randomized Ablation (RA – Baseline)
LightGBM Linear [LF20b] [Jia+22b]
L e
Acc. (%) Rmed Acc. (%) Rmed Acc. (%) ρmed Acc. (%) ρmed
1 92.70 0 86.05 0 65 80.70 0 78.63 0
5 85.29 2 83.34 2 60 80.33 0 78.01 0
11 82.48 3 79.55 2 55 79.52 0 77.05 0
15 81.09 3 76.15 3 50 78.62 0 76.59 0
21 76.10 4 67.09 2 45 77.20 0 75.19 1
25 71.40 3 64.77 2 40 76.56 0 74.82 1
31 67.06 3 58.71 2 35 74.76 0 73.22 1
35 62.56 3 55.95 1 30 72.04 0 70.74 1
41 60.19 2 51.57 0 25 69.77 0 68.72 1
51 55.34 1 45.84 −∞ 20 66.94 0 65.87 1
75 42.20 −∞ 26.93 −∞ 16 63.89 0 63.10 1
101 28.67 −∞ 21.26 −∞ 12 58.59 0 57.74 1
8 53.44 0 52.82 0
6 47.94 −∞ 47.25 −∞
4 40.70 −∞ 39.91 −∞
191
Table C.24. Ames Detailed Results: Classification accuracy (%) and median
certified robustness (larger is better) for the Ames [Coc11] dataset (d = 352) for our
certified sparse defense, feature partition aggregation (FPA), and baseline randomized
ablation (RA) across various hyperparameter settings. FPA considers only plurality
voting-based certification (Sec. 5.3.1) since the reduction is from certified regression to
certified binary classification. FPA results are reported using both GBDTs [Ke+17] and
linear submodels. Median robustness “−∞” denotes that the classification accuracy
was less than 50%. Each approach’s hyperparameter setting with the best median
robustness is shown in bold. The best overall median robustness is shown in blue.
(a) Feature Partition Aggregation (Ours) (b) Randomized Ablation (RA – Baseline)
LightGBM Linear [LF20b] [Jia+22b]
L e
Acc. (%) Rmed Acc. (%) Rmed Acc. (%) ρmed Acc. (%) ρmed
1 88.05 0 89.25 0 70 68.60 0 66.89 0
5 84.64 1 82.08 1 60 68.94 0 67.24 1
11 78.50 2 74.40 1 50 67.58 1 66.89 1
15 73.04 2 66.55 2 40 61.77 1 61.77 1
21 65.53 3 61.60 2 35 61.09 0 60.07 1
25 61.77 2 57.34 1 30 57.68 0 57.00 1
31 57.68 2 53.58 0 25 53.58 0 52.56 1
35 55.97 1 50.34 0 20 51.54 0 49.49 −∞
41 52.90 1 46.42 −∞ 15 45.05 −∞ 44.37 −∞
51 47.10 −∞ 40.10 −∞ 10 37.20 −∞ 37.54 −∞
75 36.86 −∞ 35.15 −∞ 5 33.79 −∞ 33.79 −∞
192
C.2.3 Feature Partition Aggregation vs. Randomized Ablation
Certified Accuracy Detailed Comparison. Levine and Feizi [LF20b] use median
certified robustness and classification accuracy as the two primary metrics by which
they compare RA against previous work. In this section, we present an alternative
evaluation strategy comparing the methods’ certified accuracy across a range of
robustness levels.
Specifically, we consider the same four datasets from Section 5.5, namely
classification datasets CIFAR10 [KNH14] and MNIST [LeC+98] as well as regression
datasets Weather [Mal+21] and Ames [Coc11]. Like in Section 5.5, we report
FPA’s performance using both the plurality-voting and run-off decision functions
for classification and only plurality voting for regression. For baseline randomized
ablation (RA), we again report the performance of Levine and Feizi’s [LF20b] original
version of RA as well as the improved version by Jia et al. [Jia+22b].
This section also compares FPA and RA against a naive baseline that is generally
low accuracy but maximally robust. For classification, the naive baseline always
predicts f(x) = 1; for regression, the naive baseline always predicts the training set’s
median target value.
Recall that hyperparameters L for FPA and e for baseline randomized ablation
control the corresponding method’s accuracy versus robustness trade-off. Specifically,
a smaller value of L and a larger value of e entails better accuracy. As a rule of
thumb, the fairest comparison between FPA and RA is when L ≈ d as each FPA and
e
RA prediction, in expectation, uses a comparable amount of information (i.e., number
of features). For each dataset, we report each method’s certified accuracy across
10 hyperparameter settings, roughly following the rule of thumb above. Section C.2.3.1
193
presents the experimental results in tabular form, and Section C.2.3.2 visualizes the
methods’ certified accuracy graphically.
C.2.3.1 Numerical Comparison of Feature Partition Aggregation
and Randomized Ablation. Certified accuracy w.r.t. ψ ∈ N quantifies the fraction
of correctly-classified test instances with certified robustness at least ψ.
Tables C.25, C.26, C.27, and C.28 numerically display the certified accuracies
for our certified feature defense, feature partition aggregation (FPA), and baseline
randomized ablation (RA) for CIFAR10, MNIST, Weather, and Ames, respectively.
For each dataset, the corresponding table lists the certified accuracy at 11 equally
spaced certified robustness levels.
Recall that RA’s ℓ0-norm robustness (Def. 5.2) is a strictly weaker guarantee than
FPA’s certified feature robustness (Def. 5.1). Put simply, a true direct comparison is
not possible here since FPA provides stronger certified guarantees than the baseline.
Despite that, FPA can achieve larger certified accuracies than the baseline while
simultaneously providing stronger guarantees.
194
Table C.25. CIFAR10 (d = 1024) certified accuracy for feature partition
aggregation (FPA) and baseline randomized ablation (RA). “Plurality” denotes FPA
with plurality voting as the decision function while “Run-Off” denotes FPA using
run-off elections as the decision function. “[LF20b]” denotes Levine and Feizi’s [LF20b]
original version of RA while “[Jia+22b]” denotes Jia et al.’s [Jia+22b] improved version
of RA. We also consider an additional naive baseline that always predicts f(x) = 1.
For each certified robustness level, each method’s best performing hyperparameter
setting is shown in bold with the overall best performing method shown in blue.
Cert. Hyper. Certified Robustness
Method Alg. Setting
0 13 26 39 52 65 78 91 104 117 130
Always f(x) = 1 N/A 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
5 91.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25 78.65 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
35 69.62 36.35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
55 70.44 44.06 10.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
85 66.24 46.67 26.87 7.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Plural
115 62.39 47.74 33.48 19.67 6.97 0.00 0.00 0.00 0.00 0.00 0.00
160 60.94 42.27 27.77 16.95 9.00 3.89 0.52 0.00 0.00 0.00 0.00
250 53.30 43.98 35.63 28.37 21.54 15.57 10.91 7.04 4.02 1.62 0.00
500 43.79 38.75 33.63 28.86 24.65 20.86 17.56 14.32 11.56 9.38 7.66
FPA 1024 33.01 29.70 26.95 24.14 21.68 19.33 17.24 15.41 13.92 12.29 11.05
(ours)
5 91.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25 78.58 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
35 69.92 37.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
55 70.34 46.71 11.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
85 66.61 49.26 30.25 8.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Run-Off
115 62.35 50.04 36.76 22.64 8.21 0.00 0.00 0.00 0.00 0.00 0.00
160 61.34 45.54 32.71 21.16 11.96 5.06 0.56 0.00 0.00 0.00 0.00
250 53.46 45.48 38.40 31.70 25.24 19.02 13.48 8.94 4.99 1.88 0.00
500 44.58 39.58 35.25 31.17 27.60 24.21 20.57 17.62 14.74 12.33 10.25
1024 35.50 32.01 28.80 25.89 23.22 20.74 18.63 16.85 15.20 13.80 12.57
250 88.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
75 78.48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 73.26 25.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25 68.81 38.82 11.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
15 65.68 38.81 23.59 9.42 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[LF20b]
10 62.73 37.60 27.46 17.72 9.74 1.89 0.00 0.00 0.00 0.00 0.00
7 59.08 33.44 25.65 18.58 12.56 7.77 3.71 1.09 0.00 0.00 0.00
5 53.20 28.47 22.80 17.85 14.04 10.10 6.87 4.20 2.31 0.94 0.05
2 40.44 14.03 12.37 10.62 9.12 7.91 6.96 5.95 5.16 4.51 3.98
1 21.16 4.37 3.87 3.37 2.91 2.58 2.35 1.90 1.68 1.42 1.21
RA
250 88.56 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
75 78.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 72.79 26.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25 68.08 43.10 12.28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
15 64.74 46.17 28.17 11.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[Jia+22b]
10 61.71 47.54 34.36 22.44 11.99 2.31 0.00 0.00 0.00 0.00 0.00
7 57.83 46.43 35.75 26.23 17.70 10.79 4.96 1.33 0.00 0.00 0.00
5 51.84 43.08 34.70 27.14 20.77 15.27 10.36 6.32 3.34 1.21 0.06
2 38.70 33.84 29.15 12955.01 21.22 17.95 14.90 12.49 10.33 8.54 7.03
1 19.64 17.96 15.83 14.06 12.48 11.18 10.17 9.06 8.24 7.35 6.48
Table C.26. MNIST (d = 784) certified accuracy for feature partition
aggregation (FPA) and baseline randomized ablation (RA). “Plurality” denotes FPA
with plurality voting as the decision function while “Run-Off” denotes FPA using
run-off elections as the decision function. “[LF20b]” denotes Levine and Feizi’s [LF20b]
original version of RA while “[Jia+22b]” denotes Jia et al.’s [Jia+22b] improved version
of RA. We also consider an additional naive baseline that always predicts f(x) = 1.
For each certified robustness level, each method’s best performing hyperparameter
setting is shown in bold with the overall best performing method shown in blue.
Cert. Hyper. Certified Robustness
Method Alg. Setting
0 4 8 12 16 20 24 28 32 36 40
Always f(x) = 1 N/A 11.35 11.35 11.35 11.35 11.35 11.35 11.35 11.35 11.35 11.35 11.35
5 99.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10 98.64 87.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25 95.77 86.48 66.42 20.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00
35 91.70 79.49 59.53 35.95 13.18 0.00 0.00 0.00 0.00 0.00 0.00
60 83.54 70.30 54.72 39.10 26.26 16.08 7.95 1.78 0.00 0.00 0.00
Plural
75 74.99 61.44 47.75 34.97 25.34 17.90 12.43 8.11 3.89 0.42 0.00
90 69.94 57.11 43.89 33.01 24.52 17.89 12.99 9.16 6.24 3.22 0.71
105 62.53 50.33 39.10 29.27 22.13 16.52 13.04 10.51 8.42 6.61 4.63
130 57.48 46.68 36.45 28.38 22.70 18.52 15.23 12.54 10.45 8.38 6.30
FPA 240 28.13 24.67 21.81 19.57 17.63 16.33 15.16 14.40 13.79 13.00 12.30
(ours)
5 99.51 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10 98.67 87.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25 96.06 88.72 71.52 20.28 0.00 0.00 0.00 0.00 0.00 0.00 0.00
35 93.05 83.56 67.58 44.72 14.36 0.00 0.00 0.00 0.00 0.00 0.00
60 87.22 76.59 63.67 50.52 37.10 23.91 12.14 2.97 0.00 0.00 0.00
Run-Off
75 81.74 68.54 56.44 44.65 34.68 25.48 17.82 11.09 5.28 0.45 0.00
90 79.25 66.38 53.93 43.35 33.92 26.20 20.14 14.71 9.98 6.02 2.34
105 74.45 61.76 50.73 40.32 31.38 24.57 19.00 14.85 11.80 9.05 6.46
130 69.93 58.88 48.44 38.73 31.04 25.06 20.82 17.47 14.69 12.00 9.85
240 48.33 40.31 33.37 28.30 24.57 21.29 18.71 17.17 15.82 14.82 13.81
100 98.78 84.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
85 98.60 86.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
60 97.85 84.30 35.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 97.26 81.56 49.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
40 96.42 76.53 51.99 16.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[LF20b]
30 94.87 66.97 46.33 26.88 7.30 0.00 0.00 0.00 0.00 0.00 0.00
20 90.99 48.11 34.38 23.77 15.23 7.50 0.96 0.00 0.00 0.00 0.00
10 76.78 20.36 16.22 13.08 10.62 8.40 5.99 3.72 1.54 0.16 0.00
5 35.54 10.85 10.31 9.75 9.17 8.69 7.86 6.90 5.73 4.42 3.23
3 16.91 11.13 10.96 10.70 10.51 10.19 9.84 9.41 8.87 8.21 7.04
RA
100 98.75 86.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
85 98.52 88.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
60 97.78 88.45 39.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 97.07 87.28 57.28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
40 96.13 85.69 62.37 21.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[Jia+22b]
30 94.47 82.47 62.32 36.45 11.20 0.00 0.00 0.00 0.00 0.00 0.00
20 90.07 76.29 58.26 39.39 24.36 12.98 2.70 0.00 0.00 0.00 0.00
10 74.69 59.11 44.55 32.87 23.94 17.91 13.49 10.38 7.33 3.73 0.80
5 32.89 26.17 21.19 11976.56 15.76 14.46 13.43 12.52 11.51 10.77 10.05
3 15.91 14.97 13.90 13.10 12.46 12.01 11.71 11.50 11.40 11.30 11.30
Table C.27. Weather [Mal+21] dataset (d = 128) certified accuracy for feature partition
aggregation (FPA) and baseline randomized ablation (RA). “[LF20b]” denotes Levine
and Feizi’s [LF20b] original version of RA while “[Jia+22b]” denotes Jia et al.’s
[Jia+22b] improved version of RA. Hammoudeh and Lowd’s [HL23c] reduction is
from certified regression to certified binary classification. Run-off is identical to
plurality voting under binary classification, so we report only the plurality voting
results below. We also consider an additional naive baseline that always predicts
the median training set target value (i.e., f(x) = med{y ni}i=1). For each certified
robustness level, each method’s best performing hyperparameter setting is shown in
bold with the overall best performing method shown in blue. These numerical results
are visualized graphically as envelope plots in Figure C.21.
Cert. Hyper. Certified Robustness
Method Alg. Setting
0 1 2 3 4 5 6 7 8 9 10
Always f(x) = med{y ni}i=1 N/A 21.90 21.90 21.90 21.90 21.90 21.90 21.90 21.90 21.90 21.90 21.90
5 85.29 77.38 62.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11 82.48 76.34 67.59 55.50 39.02 18.42 0.00 0.00 0.00 0.00 0.00
15 81.09 75.23 68.16 58.98 48.08 35.81 19.92 7.77 0.00 0.00 0.00
21 76.10 70.78 64.73 57.69 50.01 41.48 33.04 23.78 14.30 6.47 0.91
FPA 25 71.40 66.29 60.70 55.03 49.17 42.93 35.88 28.92 21.58 14.29 7.12
Plural
(ours) 31 67.06 62.80 58.18 53.39 48.76 43.85 38.49 32.77 27.12 21.51 15.81
35 62.56 58.84 54.93 50.72 46.54 42.03 37.62 33.08 28.10 22.76 17.18
41 60.19 56.83 53.34 49.72 45.99 42.34 38.55 34.60 30.44 26.09 21.47
45 57.96 54.99 51.94 48.81 45.57 42.26 38.78 35.11 31.29 27.23 22.91
127 23.43 22.95 22.49 22.04 21.61 21.19 20.77 20.38 20.00 19.61 19.23
50 78.62 22.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
40 76.56 31.26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
30 72.04 39.64 9.53 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20 66.94 45.11 20.61 6.82 0.00 0.00 0.00 0.00 0.00 0.00 0.00
16 63.89 45.77 26.67 11.64 3.83 0.04 0.00 0.00 0.00 0.00 0.00
[LF20b]
12 58.59 45.19 31.87 18.36 9.67 4.37 1.06 0.00 0.00 0.00 0.00
9 54.68 44.55 35.11 25.05 15.88 9.48 5.26 2.26 0.61 0.01 0.00
6 47.94 41.22 34.84 28.60 22.32 16.45 11.82 8.60 6.00 3.90 2.37
3 36.88 33.32 30.57 27.90 25.63 23.08 20.58 18.16 15.97 13.91 11.87
1 21.00 20.68 20.61 20.48 20.35 20.19 20.05 19.93 19.77 19.67 19.43
RA
50 76.59 47.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
40 74.82 53.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
30 70.74 56.18 31.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20 65.87 56.66 44.24 26.06 3.94 0.00 0.00 0.00 0.00 0.00 0.00
16 63.10 55.29 46.24 34.49 19.75 5.20 0.00 0.00 0.00 0.00 0.00
[Jia+22b]
12 57.74 51.96 45.73 38.47 29.53 19.26 10.88 0.00 0.00 0.00 0.00
9 53.97 49.95 45.97 41.18 35.62 29.11 21.44 14.51 9.10 2.63 0.00
6 47.25 44.86 41.94 39.16 36.21 33.00 29.54 25.82 21.18 16.82 13.31
3 36.01 34.97 33.59 32.19 31.02 29.72 28.46 27.33 26.28 25.21 23.99
1 20.84 20.76 20.72 20.63 20.58 20.50 20.41 20.31 20.25 20.14 20.03
197
Table C.28. Ames [Coc11] dataset (d = 352) certified accuracy for feature partition
aggregation (FPA) and baseline randomized ablation (RA). “[LF20b]” denotes Levine
and Feizi’s [LF20b] original version of RA while “[Jia+22b]” denotes Jia et al.’s
[Jia+22b] improved version of RA. Hammoudeh and Lowd’s [HL23c] reduction is
from certified regression to certified binary classification. Run-off is identical to
plurality voting under binary classification, so we report only the plurality voting
results below. We also consider an additional naive baseline that always predicts
the median training set target value (i.e., f(x) = med{y ni}i=1). For each certified
robustness level, each method’s best performing hyperparameter setting is shown in
bold with the overall best performing method shown in blue. These numerical results
are visualized graphically as envelope plots in Figure C.21.
Cert. Hyper. Certified Robustness
Method Alg. Setting
0 1 2 3 4 5 6 7 8 9 10
Always f(x) = med{y }ni i=1 N/A 31.40 31.40 31.40 31.40 31.40 31.40 31.40 31.40 31.40 31.40 31.40
5 84.64 72.01 39.93 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11 78.50 70.99 58.70 40.96 22.53 5.12 0.00 0.00 0.00 0.00 0.00
21 65.53 60.41 54.95 50.17 41.64 32.42 22.87 12.63 5.46 1.37 0.00
25 61.77 58.36 54.27 49.83 43.69 35.84 28.67 20.82 12.63 6.14 2.39
FPA 31 57.68 54.95 51.54 48.12 42.66 37.20 32.08 26.28 20.82 15.02 10.24
Plural
(ours) 35 55.97 52.56 48.81 45.73 42.32 38.23 33.79 29.01 24.57 19.45 14.68
41 52.90 50.51 47.10 43.34 40.96 37.20 34.47 31.06 27.65 24.23 20.82
51 47.10 44.37 41.98 39.25 37.88 35.49 34.13 32.08 30.03 28.33 26.28
65 41.64 39.25 37.88 37.20 36.01 34.47 33.45 32.42 31.40 30.38 29.69
101 33.45 33.11 32.76 32.76 32.42 32.08 32.08 31.74 31.74 31.74 31.40
60 68.94 43.34 11.95 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 67.58 52.56 32.08 7.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00
40 61.77 50.17 38.23 18.09 4.10 0.00 0.00 0.00 0.00 0.00 0.00
35 61.09 49.49 39.93 20.48 10.24 1.71 0.00 0.00 0.00 0.00 0.00
30 57.68 48.46 39.59 26.96 16.38 5.46 0.00 0.00 0.00 0.00 0.00
[LF20b]
25 53.58 47.78 38.91 27.65 20.82 15.02 4.10 0.34 0.00 0.00 0.00
20 51.54 43.34 38.23 32.76 26.28 20.48 15.02 7.85 2.39 0.00 0.00
15 45.05 39.25 36.18 34.81 29.69 27.99 23.21 19.45 13.99 9.90 5.80
10 37.20 36.18 35.15 33.11 32.76 31.40 28.67 26.62 25.26 24.57 22.87
5 33.79 33.11 32.76 32.08 32.08 32.08 31.74 31.40 31.06 30.38 30.38
RA
60 67.24 59.73 46.76 13.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50 66.89 59.73 48.81 31.40 7.17 0.00 0.00 0.00 0.00 0.00 0.00
40 61.77 55.63 49.49 38.57 25.60 6.48 0.00 0.00 0.00 0.00 0.00
35 60.07 52.90 48.12 38.91 31.06 16.38 2.39 0.00 0.00 0.00 0.00
30 57.00 51.88 47.10 41.30 34.81 26.96 15.36 2.39 0.00 0.00 0.00
[Jia+22b]
25 52.56 50.17 45.39 40.27 35.84 31.06 24.91 17.06 6.48 0.34 0.00
20 49.49 45.73 44.03 41.30 37.54 33.79 30.38 25.94 22.53 13.99 6.83
15 44.37 42.32 40.96 39.93 35.84 35.49 32.76 30.72 27.65 24.91 22.18
10 37.54 36.52 35.84 33.79 33.79 33.45 32.42 31.06 30.38 29.35 29.01
5 33.79 33.45 33.45 33.11 33.11 33.11 32.76 32.76 32.42 32.08 32.08
198
C.2.3.2 Graphical Comparison of Feature Partition Aggregation
and Randomized Ablation. Recall that hyperparameters L for FPA and e for
baseline randomized ablation control the corresponding method’s accuracy-robustness
trade-off. Specifically, a smaller value of L and a larger value of e entails better
accuracy. This section emulates a defender that tunes FPA’s and randomized ablation’s
hyperparameters to maximize the certified accuracy at each individual robustness level
individually.
Tables C.25 through C.28 above report each method’s certified accuracy across
10 comparable hyperparameter settings. For a given method, each hyperparameter
setting provides a certified accuracy versus certified robustness curve. This section
considers each defense’s certified accuracy envelope. Specifically, an envelope in
mathematics represents the supremum of a set of curves. Intuitively, taking the
certified accuracy envelope emulates maximizing a method’s performance at each
certified robustness level individually across the 10 hyperparameter settings.
Figures C.20 and C.21 visualize the certified accuracy envelopes in two ways.
First, Figures C.20a, C.20b, C.21a, and C.21b visualize the envelope curves themselves.
These figures also visualize the same naive baselines considered in Sec. C.2.3.1 above
(e.g., always predict label 1 for classification and median med{y }ni i=1 for regression).
Second, Figures C.20c, C.20d, C.21c, and C.21d visualize the improvement in certified
accuracy between FPA and the two versions of randomized ablation across the range
of certified robustness levels. A positive value in these four subfigures entails that FPA
outperformed the corresponding baseline (i.e., FPA had a larger certified accuracy),
while a negative value entails the baseline outperformed FPA.
For CIFAR10 and MNIST, FPA with run-off’s envelope had larger certified
accuracy than the envelope of both versions of baseline RA across the entire certified
199
robustness range (x-axis). Specifically, for Levine and Feizi’s [LF20b] version of
RA, FPA with run-off’s certified accuracy advantage was as large as 14.17 and 24.28
percentage points (pp) for CIFAR10 and MNIST, respectively. For Jia et al.’s [Jia+22b]
version of RA, FPA with run-off’s certified accuracy advantage was as large as 6.54pp
and 12.74pp for CIFAR10 and MNIST, respectively.
For regression datasets Weather and Ames, FPA’s envelope had larger certified
accuracy than the envelope of both versions of baseline RA across most of the certified
accuracy range. At the largest robustness values, [Jia+22b] marginally outperformed
both FPA and the naive baseline by <2pp. At smaller certified robustness values,
FPA outperformed Jia et al.’s [Jia+22b] version of RA by up to 21.9pp and 17.4pp
for Weather and Ames, respectively.
200
Always f(x) = 1 FPA Plural (ours) FPA Run-Off (ours) RA [LF20b] RA [Jia+22b]
100 100
80 80
60 60
40 40
20 20
0 30 60 90 120 150 0 10 20 30 40 50
Certified Robustness Certified Robustness
(a) CIFAR10: Certified Accuracy (b) MNIST: Certified Accuracy Envelope
Envelope
FPA Run-Off vs. RA [LF20b] FPA Run-Off vs. RA [Jia+22b]
24
12 20
16
9
12
6
8
3 4
0
0 0 10 20 30 40 50
0 30 60 90 120 150
Certified Robustness
Certified Robustness
(c) CIFAR10: FPA’s Certified Accuracy (d) MNIST: FPA’s Certified Accuracy
Improvement over RA Improvement over RA
Figure C.20. Classification certified accuracy envelope for datasets CIFAR10
(d = 1024) and MNIST (d = 784) for feature partition aggregation (FPA) and baseline
randomized ablation (RA). Each method’s envelope considers the corresponding
hyperparameters in Tables C.25 and C.26, emulating a certified defense where
the hyperparameters are roughly tuned to maximize the certified accuracy at each
robustness level. Subfigures C.20a and C.20b visualize each method’s certified accuracy
envelope (larger is better); also shown in these subfigures is a naive baseline where
the decision function always predicts label f(x) = 1. Subfigures C.20c and C.20d
visualize the improvement in certified accuracy when using FPA with the run-off
decision function over the two randomized ablation baselines from Levine and Feizi
[LF20b] and Jia et al. [Jia+22b]. The envelope plots’ underlying numerical values are
provided in Table C.25 for CIFAR10 and Table C.26 for MNIST.
201
Improvement in Certified Acc. (%) Certified Accuracy (%)
Improvement in Certified Acc. (%) Certified Accuracy (%)
Always f(x) = med{y }ni i=1 FPA (Plural) (ours) RA [LF20b] RA [Jia+22b]
100 100
80 80
60 60
40 40
20 20
0 2 4 6 8 10 12 0 2 4 6 8
Certified Robustness Certified Robustness
(a) Weather: Certified Accuracy (b) Ames: Certified Accuracy Envelope
Envelope
FPA Plural vs. RA [LF20b] FPA Plural vs. RA [Jia+22b]
32 20
16
24
12
16
8
8
4
0
0 3 6 9 12 0
0 2 4 6 8
Certified Robustness
Certified Robustness
(c) Weather: FPA’s Certified Accuracy
(d) Ames: FPA’s Certified Accuracy
Improvement over RA
Improvement over RA
Figure C.21. Regression certified accuracy envelope for the Weather [Mal+21]
(d = 128) and Ames [Coc11] (d = 352) datasets for feature partition aggregation (FPA)
and baseline randomized ablation (RA). Each method’s envelope considers the
corresponding hyperparameters in Tables C.27 and C.28, emulating a certified defense
where the hyperparameters are tuned to maximize each robustness level’s certified
accuracy. Subfigures C.21a and C.21b visualize each method’s certified accuracy
envelope (larger is better); also shown in these subfigures is a naive baseline that
always predicts the median training data target value. Subfigures C.21c and C.21d
visualize the improvement in certified accuracy when using FPA (with plurality voting)
as the decision function over the two randomized ablation baselines from Levine
and Feizi [LF20b] and Jia et al. [Jia+22b]. FPA outperforms randomized ablation
for smaller certified robustness values, while Jia et al.’s [Jia+22b] version of RA
marginally outperformed both FPA and t2h0e2naive baseline at larger robustness values.
The envelope plots’ underlying numerical values are provided in Table C.27 for Weather
and Table C.28 for Ames.
Improvement in Certified Acc. (%) Certified Accuracy (%)
Improvement in Certified Acc. (%) Certified Accuracy (%)
C.3 Chapter 6 Detailed Results
Section 6.5 provided averaged results for each related experimental setup. This
section provides detailed results for each attack setup individually (including variance).
C.3.1 Speech Recognition Backdoor Full Results.
GAS (ours) GAS-L (ours) TracInCP TracIn Influence Functions Representer Point
1
0.8
0.6
0.4
0.2
0
0 → 1 1 → 2 2 → 3 3 → 4 4 → 5 5 → 6 6 → 7 7 → 8 8 → 9 9 → 0
Figure C.22. Speech Backdoor Adversarial Set Identification: Mean backdoor
set (Dadv) identification AUPRC across 30 trials for all 10 class pairs with
21 ≤ |Dadv| ≤ 28 (varies by class pair, see Tab. D.57). GAS and GAS-L outperformed
all baselines in all experiments, with GAS-L the overall top performer on 6/10 class
pairs. See Table C.29 for the numerical results.
Table C.29. Speech Backdoor Adversarial Set Identification: Mean AUPRC
across 30 trials for speech backdoor dataset [Liu+18] with 21 ≤ |Dadv| ≤ 28. GAS(-L)
always outperformed the baselines. Bold denotes the best mean performance. Mean
results are shown graphically in Figs. 14 and C.22. Variance results appear in the
original paper [HL22a, Sec. F.1.1].
Digits Ours Baselines
ytarg → yadv GAS GAS-L TracInCP TracIn Inf. Func. Rep. Pt.
0 1 0.999 1.000 0.642 0.458 0.807 0.143
1 2 0.985 0.969 0.417 0.303 0.763 0.069
2 3 0.969 0.919 0.769 0.595 0.735 0.119
3 4 0.999 0.998 0.787 0.630 0.847 0.106
4 5 1.000 0.999 0.510 0.358 0.718 0.106
5 6 0.977 0.986 0.791 0.506 0.698 0.064
6 7 0.876 0.911 0.301 0.255 0.350 0.060
7 8 0.985 0.989 0.868 0.630 0.730 0.091
8 9 0.993 0.998 0.898 0.620 0.696 0.061
9 0 0.983 0.975 0.446 0.317 0.655 0.052
203
Dadv AUPRC
FIT w/ GAS (ours) FIT w/ GAS-L (ours) Max. k-NN Distance Min. k-NN Distance
Most Certain Least Certain Random
1
0.8
0.6
0.4
0.2
0
0 → 1 1 → 2 2 → 3 3 → 4 4 → 5
Figure C.23. Speech Backdoor Target Identification: See Table C.30 for numerical
results.
Table C.30. Speech Backdoor Target Identification: Bold denotes the best mean
performance. Mean results are shown graphically in Figures 16 and C.23. Variance
results appear in the original paper [HL22a, Sec. F.1.1].
Digits Ours Baselines
ytarg → yadv GAS GAS-L Max k-NN Min k-NN Most Certain Least Certain Random
0 1 1 1 0.156 0.030 0.177 0.040 0.067
1 2 0.923 0.795 0.034 0.158 0.267 0.028 0.059
2 3 0.981 0.981 0.047 0.110 0.179 0.032 0.047
3 4 1 1 0.107 0.034 0.206 0.037 0.062
4 5 1 1 0.040 0.076 0.225 0.027 0.072
Table C.31. Speech Backdoor Attack Mitigation: Bold denotes the best mean
performance with 10 trials per class pair. Aggregated results are shown in Table 6.
Digits % Removed ASR % Test Acc. %
Method
ytarg yadv Dadv Dcl Orig. Ours Orig. Chg.
GAS 100 0.06 0 0.0
0 1 100 97.7
GAS-L 100 0.03 0 0.0
GAS 100.0 0.02 0 0.0
1 2 100 97.7
GAS-L 99.8 0.09 0 0.0
GAS 93.7 0.08 0 –0.1
2 3 99.9 97.8
GAS-L 92.6 0.21 0 –0.1
GAS 98.7 0.10 0 –0.1
3 4 99.4 97.7
GAS-L 99.3 0.35 0 0.0
GAS 99.1 0.01 0 0.0
4 5 100 97.8
GAS-L 98.6 0.01 0 0.0
204
Target AUPRC
C.3.2 Vision Backdoor Full Results.
GAS (ours) GAS-L (ours) TracInCP TracIn Influence Functions Representer Point
1
0.8
0.6
0.4
0.2
0
1 Pixel 4 Pixel Blend 1 Pixel 4 Pixel Blend
Auto → Dog Plane → Bird
Figure C.24. Vision Backdoor Adversarial-Set Identification: Backdoor set,
Dadv, identification mean AUPRC across >30 trials for Weber et al.’s [Web+23]
three CIFAR10 backdoor attack patterns with a randomly selected reference ẑtarg.
All experiments performed binary classification on randomly-initialized ResNet9.
|Dadv| = 150. Notation ytarg → yadv. See Table C.32 for the numerical results.
Table C.32. Vision Backdoor Adversarial-Set Identification: Backdoor set,
Dadv, identification mean AUPRC across >30 trials for Weber et al.’s [Web+23]
three CIFAR10 backdoor attack patterns with a randomly selected reference ẑtarg.
All experiments performed binary classification on randomly-initialized ResNet9.
|Dadv| = 150. Notation ytarg → yadv. Bold denotes the best mean performance. Mean
results are shown graphically in Figures 14 and C.24. Variance results appear in the
original paper [HL22a, Sec. F.1.2].
Classes Trigger Ours Baselines
Pattern
ytarg → yadv GAS GAS-L TracInCP TracIn Inf. Func. Rep. Pt.
1 Pixel 0.977 0.987 0.742 0.435 0.051 0.033
Auto → Dog 4 Pixel 0.992 0.996 0.552 0.255 0.088 0.022
Blend 0.999 1.000 0.809 0.426 0.062 0.030
1 Pixel 0.738 0.805 0.389 0.237 0.132 0.026
Plane → Bird 4 Pixel 0.951 0.975 0.264 0.130 0.170 0.021
Blend 0.832 0.916 0.359 0.207 0.042 0.028
205
Dadv AUPRC
FIT w/ GAS (ours) FIT w/ GAS-L (ours) Max. k-NN Distance Min. k-NN Distance
Most Certain Least Certain Random
1
0.8
0.6
0.4
0.2
0
1 Pixel 4 Pixel Blend 1 Pixel 4 Pixel Blend
Auto → Dog Plane → Bird
Figure C.25. Vision Backdoor Target Identification: Mean target identification
AUPRC across 15 trials for Weber et al.’s [Web+23] three CIFAR10 backdoor attack
patterns and randomly selected reference ẑtarg. All experiments performed binary
classification on randomly-initialized ResNet9. |Dadv| = 150. Notation ytarg → yadv.
See Table C.33 for the numerical results.
206
Target AUPRC
Table C.33. Vision Backdoor Target Identification: Target identification mean
AUPRC across 15 trials for Weber et al.’s [Web+23] three CIFAR10 backdoor
attack patterns and randomly selected reference ẑtarg. All experiments performed
binary classification on randomly-initialized ResNet9. Bold denotes the best mean
performance. Mean results are shown graphically in Figures 16 and C.25. Variance
results appear in the original paper [HL22a, Sec. F.1.2].
Classes Trigger Ours Baselines
Pattern
ytarg → yadv GAS GAS-L Max k-NN Min k-NN Most Certain Least Certain Random
1 Pixel 0.998 0.998 0.996 0.065 0.475 0.101 0.135
Auto → Dog 4 Pixel 0.999 0.998 0.987 0.062 0.263 0.105 0.116
Blend 0.987 1 0.429 0.275 0.092 0.192 0.142
1 Pixel 0.925 0.970 0.938 0.063 0.662 0.095 0.155
Plane → Bird 4 Pixel 0.987 0.992 0.849 0.070 0.631 0.099 0.135
Blend 0.782 0.979 0.300 0.097 0.119 0.153 0.141
Table C.34. Vision Backdoor Attack Mitigation: Bold denotes the best mean
performance with 15 trials per setup. Aggregated results are shown in Table 6.
Classes % Removed ASR % Test Acc. %
Attack Method
ytarg → yadv Dadv Dcl Orig. Ours Orig. Chg.
GAS 92.6 0.28 0 0.0
1 Pixel 87.7 98.8
GAS-L 94.4 0.08 0 0.0
→ GAS 92.8 0.26 0 0.0Auto Dog 4 Pixel 95.0 98.9
GAS-L 92.6 0.05 0 0.0
GAS 99.9 0.67 0 –0.1
Blend 98.6 99.0
GAS-L 100 0.41 0 –0.1
GAS 65.8 0.66 0 –0.1
1 Pixel 80.8 93.5
GAS-L 75.5 0.84 0 0.0
→ GAS 92.7 0.39 0 0.0Plane Bird 4 Pixel 89.0 93.5
GAS-L 95.2 0.80 0 0.0
GAS 81.9 1.39 0 –0.4
Blend 92.0 93.7
GAS-L 95.9 1.55 0 –0.5
207
C.3.3 Natural Language Poisoning Full Results.
GAS (ours) GAS-L (ours) TracInCP TracIn
Influence Functions Representer Point
1
0.8
0.6
0.4
0.2
0
1 2 3 4 1 2 3 4
Positive Negative
Figure C.26. Natural Language Poisoning Adversarial-Set Identification: See
Table C.35 for the numerical results.
Table C.35. Natural Language Poisoning Adversarial-Set Identification:
Poison identification mean AUPRC across 10 trials for 4 positive and 4 negative
sentiment SST-2 movie reviews [Soc+13] with |Dadv| = 50. GAS-L perfectly identified
all poison in all but one trial. Bold denotes the best mean performance. Mean results
are shown graphically in Figures 14 and C.26. Variance results appear in the original
paper [HL22a, Sec. F.1.3].
Review Ours Baselines
Sentiment No. GAS GAS-L TracInCP TracIn Inf. Func. Rep. Pt.
↑ 1 1 1 0.245 0.113 0.005 0.002
2 1 1 0.382 0.117 0.007 0.001
Positive
↓ 3 1 1 0.072 0.043 0.003 0.0014 1 1 0.021 0.010 0.003 0.001
↑ 1 0.985 0.996 0.009 0.006 0.002 0.001
2 1 1 0.628 0.245 0.006 0.001
Negative
↓ 3 0.998 1 0.224 0.109 0.004 0.0014 1 1 0.017 0.008 0.005 0.001
208
Dadv AUPRC
FIT w/ GAS (ours) FIT w/ GAS-L (ours) Max. k-NN Distance Min. k-NN Distance
Most Certain Least Certain Random
1
0.8
0.6
0.4
0.2
0
1 2 3 4 1 2 3 4
Positive Negative
Figure C.27. Natural Language Poisoning Target Identification: See Table C.36
for the numerical results.
209
Target AUPRC
Table C.36. Natural Language Poisoning Target Identification: Bold denotes the
best mean performance with 10 trials per review. Mean results are shown graphically in
Figures 16 and C.27. Variance results appear in the original paper [HL22a, Sec. F.1.3].
Review Ours Baselines
Sentiment No. GAS GAS-L Max k-NN Min k-NN Most Certain Least Certain Random
↑ 1 1 1 0.017 0.043 0.010 0.078 0.044
2 1 1 0.009 0.698 0.015 0.021 0.048
Positive
↓ 3 1 1 0.012 0.038 0.014 0.022 0.0414 1 1 0.010 0.079 0.015 0.020 0.019
↑ 1 1 1 0.009 0.687 0.034 0.011 0.020
2 1 1 0.009 0.193 0.022 0.014 0.068
Negative
↓ 3 0.867 0.909 0.009 0.754 0.049 0.020 0.0294 0.950 1 0.012 0.055 0.021 0.015 0.032
Table C.37. Natural Language Poisoning Attack Mitigation: Bold denotes the
best mean performance with 10 trials per review. Aggregated results are shown in
Table 6.
Review % Removed ASR % Test Acc. %
Method
Sentiment No. Dadv Dcl Orig. Ours Orig. Chg.
GAS 100 0.01 0 0.0
1 100 94.1
↑ GAS-L 100 0.01 0 +0.3GAS 100 0.01 0 +0.1
2 100 94.2
GAS-L 100 0.02 0 +0.2
Positive
GAS 100 0.01 0 +0.1
↓ 3 100 94.2GAS-L 100 0.00 0 0.0
GAS 99.9 0 0 –0.1
4 100 94.3
GAS-L 100 0.02 0 0.0
GAS 97.1 0.01 0 +0.3
1 100 94.3
↑ GAS-L 99.5 0.02 0 +0.1GAS 100 0.17 0 +0.1
2 100 94.3
GAS-L 100 0.04 0 +0.3
Negative
GAS 99.5 0.05 0 +0.3
↓ 3 90 94.1GAS-L 100 0.01 0 +0.4GAS 100 0 0 +0.1
4 100 94.2
GAS-L 100 0 0 +0.1
210
C.3.4 Vision Poisoning Full Results.
Section 6.5.2 considers Peri et al.’s [Per+20] dedicated, clean-label poison defense
Deep k-NN as an additional baseline. By default, nearest neighbor algorithms yield
a label, not a score. To be compatible with AUPRC, we modified Deep k-NN to
rank each training example by the difference between the size of the neighborhood’s
plurality class and the number of neighborhood instances that share the corresponding
example’s label.
GAS0 (ours) GAS-L0 (ours) GAS (ours) GAS-L (ours)
TracInCP TracIn Influence Func. Representer Pt.
Deep k-NN
1
0.8
0.6
0.4
0.2
0
Bird → Dog Dog → Bird Frog → Deer Deer → Frog
Figure C.28. Vision Poisoning Adversarial-Set Identification: Adversarial
set (Dadv) identification mean AUPRC across >15 trials for four CIFAR10 class pairs
with |Dadv| = 50. Our renormalized influence estimators, GAS and GAS-L, using just
initial parameters θ0 and with 5 subepoch checkpointing outperformed all baselines
for all class pairs.
211
Dadv AUPRC
Table C.38. Vision Poisoning Adversarial-Set Identification: Adversarial
set (Dadv) identification mean AUPRC across >15 trials for four CIFAR10 class
pairs with |Dadv| = 50. Our renormalized influence estimators, GAS and GAS-L,
using just initial parameters θ0 and with 5 subepoch checkpointing outperformed all
baselines for all class pairs. Bold denotes the best mean performance. Mean results
are shown graphically in Figure 14 and C.28. Variance results appear in the original
paper [HL22a, Sec. F.1.4].
Classes Ours Baselines
ytarg → yadv GAS0 GAS-L0 GAS GAS-L TracInCP TracIn Inf. Func. Rep. Pt. Deep k-NN
Bird → Dog 0.773 0.628 0.892 0.825 0.493 0.194 0.146 0.028 0.078
Dog → Bird 0.847 0.685 0.848 0.769 0.464 0.171 0.066 0.017 0.036
Frog → Deer 0.912 0.842 0.962 0.942 0.602 0.265 0.150 0.026 0.208
Deer → Frog 0.803 0.673 0.888 0.855 0.534 0.210 0.085 0.028 0.027
212
FIT w/ GAS (ours) FIT w/ GAS-L (ours) Max. k-NN Distance Min. k-NN Distance
Most Certain Least Certain Random
1
0.8
0.6
0.4
0.2
0
Bird → Dog Dog → Bird Frog → Deer Deer → Frog
Figure C.29. Vision Poisoning Target Identification: See Table C.39 for the
numerical results.
Table C.39. Vision Poisoning Target Identification: Bold denotes the best mean
performance with ≥15 trials per class pair. Mean results are shown graphically in
Figures 16 and C.29. Variance results appear in the original paper [HL22a, Sec. F.1.4].
Classes Ours Baselines
ytarg yadv GAS GAS-L Max k-NN Min k-NN Most Certain Least Certain Random
Bird Dog 0.831 0.332 0.004 0.072 0.004 0.014 0.011
Dog Bird 0.896 0.601 0.004 0.068 0.003 0.015 0.008
Frog Deer 0.863 0.621 0.003 0.368 0.003 0.011 0.019
Deer Frog 0.723 0.382 0.005 0.122 0.004 0.010 0.010
Table C.40. Vision Poisoning Attack Mitigation: Bold denotes the best mean
performance with ≥15 trials per class pair. Aggregated results are shown in Table 6.
Classes % Removed ASR % Test Acc. %
Method
ytarg yadv Dadv Dcl Orig. Ours Orig. Chg.
GAS 72.1 0.04 0 0.0
Bird Dog 91.4 87.0
GAS-L 65.2 0.04 0 +0.1
GAS 54.1 0.01 0 –0.1
Dog Bird 80.0 87.1
GAS-L 46.6 0.01 0 0.0
GAS 36.0 0.01 0 0.0
Frog Deer 60.0 87.1
GAS-L 30.2 0.03 0 0.0
GAS 88.9 0.03 0 +0.1
Deer Frog 80.0 87.0
GAS-L 83.5 0.03 0 +0.1
213
Target AUPRC
C.4 Convex Polytope Poisoning and GAS Joint Optimization
Zhu et al. [Zhu+19] prove that under specific assumptions (e.g., a linear classifier),
an adversarial set is guaranteed to alter a model’s prediction on a target if that target’s
feature vector lies inside a convex polytope of the adversarial instances’ feature vectors.
Overview of Zhu et al.’s Attack Intuitively, Zhu et al.’s attack attempts to
construct a convex hull of poison instances around a target – all within feature space.
By design, deep models are non-linear and non-convex, so Zhu et al.’s underlying
assumption does not directly apply. However, the convex-polytope attack will succeed
if the trained model’s penultimate feature representation (i.e., the input into the
final, linear classification layer) forms a convex hull around the target’s penultimate
representation.
To that end, Zhu et al.’s iterative, bilevel poison optimization considers solely this
feature representation. In the attacker’s ideal case, the adversarial set’s feature-space
representation would be optimized w.r.t. the final trained model. However, attackers
do not know training’s random seed. Moreover, any change to a training instance
necessarily affects the final model parameters (and thus the penultimate feature
representation as well), inducing a cyclic dependency that makes poison crafting
non-trivial.
To increase the likelihood that the attack succeeds, Zhu et al. optimize the poison’s
feature-space representation across a suite of m surrogate models. For each model f (j)
(j ∈ {1, . . . ,m}), denote the model’s penultimate-feature extraction function as ϕ(j)(·).
Zhu et al. specify a bilevel optimization to iteratively form these feature-space
convex hulls, where adversarial setDadv := {(xl,y Kadv)}l=1 is crafted from a set ofK clean
seed instances, denoted {xcll }Kl=1. Zhu et al. restrict the adversarial perturbations to
an ℓ∞ ball of radius ε around those clean seed instances. The feature-space convex
214
(j)
hull requirement is enforced coefficients via cl ≥ 0. Zhu et al.’s bilevel optimization
is reproduced in Eq. (C.1), with an additional term, β ĜAS, that is explained below.
Note that in Zhu et al.’s formulation β =∥∥0.∑ ∥ ∑ ∥2m (j) (j) ∥1 ϕ (xtarg)− Kl=1 cl ϕ(j)(xl)∥
min β ĜAS+
(j) (j) 2{c }, D 2 ∥ϕ (x )∥l adv ∑ j=1
targ
K
(j)
s.t. cl = 1, ∀j (C.1)
l=1
(j)
cl ≥ 0, ∀l, j
∥xl − xcll ∥∞ ≤ ε, ∀l.
Joint Optimization Formulation An attacker may attempt to evade our defense
by optimizing adversarial set D to appear uninfluential on target ẑ .1adv targ Eq. (C.1)
formalizes this idea by simultaneously optimizing for both poison effectiveness as well
as for low GAS influence where hyperparameter β > 0 trades off between these two
sub-objectives. Following Zhu et al.’s [Zhu+19] paradigm as described above, the
attacker uses surrogate models to estimate the GAS influence.
Specifically, the adversary trains a gray-box model2 using the same architecture,
hyperparameters, clean training data (Dcl), and pre-trained parameters as the target
model. The surrogate set is then formed from m model checkpoints evenly spaced
across this gray-box training. This quantity then estimates the GAS influence in the
final trained model. Formally, ∑m ∑K ∥〈 (j∥) ∥ (j) 〉g , ĝ: ∥ l ∥ ∥targĜAS = ∥∥ . (C.2)(j) (j)
j=1 l=1 gl ĝtarg
1Dadv must remain actually influential. Otherwise, the model’s prediction on ẑtarg would not
change.
2Gray-box attacks assume the attacker has access to detailed (but not complete) information
about the target model like our case above.
215
It is important to note that optimizations like Eq. (C.1) create an implicit tension.
Training-set attacks commonly attempt to make Dadv and ẑtarg have similar feature-
space representations [Sha+18; Zhu+19; Hua+20; Wal+21]. Since each example’s
features inform the model gradients (g), appearing less influential can affect the
attack’s effectiveness.
Practical Challenges of Joint Optimization In modern neural networks,
each parameter only directly affects or is directly affected by a subset of the other
parameters – specifically those in adjacent layers. This limited interdependency makes
back-propagation more tractable and efficient.
Recall that GAS normalizes by the gradient magnitude. When calculating
influence in practice, this does not change the memory or computational complexity.
However, when trying to optimize surrogate ĜAS (Eq. (C.2)), normalizing by the
gradient magnitude creates pairwise dependencies between all parameters, i.e., Θ(|θ|2)
memory complexity for automatic differentiation systems. Therefore, renormalized
estimators like GAS are significantly more memory intensive to optimize against in
practice than the baseline influence estimators where this quadratic memory complexity
is not induced.
Section 6.6’s experiments were affected by joint optimization’s increased memory
complexity, where the GPU VRAM requirements increased by ≥12×. This created
significant issues even for the comparatively small ResNet9 neural network [Pag20].
For example, when the adversarial set size was larger than 40, joint optimization
exceeded the GPU VRAM memory capacity.3 In contrast, our original paper includes
an ablation study tests more than 400 poison samples for Zhu et al.’s baseline attack
3Experiments were performed on Nvidia Tesla K80 GPUs with 11.5GB of VRAM.
216
using the same hardware [HL22a]. Furthermore, joint optimization’s larger memory
footprint necessitated that only a small number of surrogate checkpoints could be used
– specifically four checkpoints. This then increases the coarseness of ĜAS’s influence
estimate.
Setting Joint Optimization Hyperparameter β As detailed above,
hyperparameter β induces a trade-off between Zhu et al.’s convex-polytope loss
and the surrogate GAS estimate. Section 6.6’s “baseline” results used β = 0. To
ensure a strong adversary, Section 6.6’s “Adaptive Joint Optimization Attack with
GAS” results used β = 10−2 since that was the largest value of β that did not result
in a significant drop in attacker success rate as detailed in Table C.41.
Table 6 in Section 6.5.4 reports that the vision poisoning’s attack success rate
was 77.9%. Even when β = 0 (i.e., the surrogate GAS loss is ignored), there was
still a substantial decrease in ASR to 64.3%. Recall that joint optimization’s memory
complexity is Θ(|θ|2) which necessitated using fewer surrogate models (due to GPU
VRAM capacity). This, in turn, degraded attack performance.4 Put simply, joint
adversarial set optimization is not necessarily a free lunch. It may come at the cost of
a worse attacker success rate.
Table C.41. Effect of joint-optimization hyperparameter β on the attacker’s success
rate (ASR). Observe that even at β = 0, the attack success rate is significantly lower
than the 77.9% ASR in Table 6 due to the fewer surrogate models that could be used
during jointly-optimized poison crafting as explained above.
β ASR (%)
0 64.3
10−2 63.1
2 · 10−2 50.0
10−1 4.8
4We separately verified that reducing the adversarial-set size from 50 to 40 did not meaningfully
change the ASR.
217
Section 6.6 summarizes the adversarial-set and target identification results for
this jointly-optimized attack. Sections C.4.1 and C.4.2 (resp.) provide more granular
versions of those results.
Section C.4.3 provides additional results on target-driven attack mitigation’s
effectiveness on this jointly optimized attack.
C.4.1 Adversarial-Set Identification of the Jointly Optimized
Poisoning Attack.
Table C.42. Adversarial-Set Identification for the Adaptive Vision Poison
Attack: Adversarial-set identification mean AUPRC with ≥10 trials per setup as
described in Section C.4. Section 6.6’s baseline results set trade-off hyperparameter
β = 0, meaning the poison was not jointly optimized. The jointly optimized results
used β = 10−2 as explained in suppl. Section C.4. Bold denotes the best mean
performance. Mean results are shown graphically in Figures 17 and C.30.Variance
results appear in the original paper [HL22a, Sec. F.2.1].
Param. Classes Ours Baselines
β ytarg → yadv GAS0 GAS-L0 GAS GAS-L TracInCP TracIn Inf. Func. Rep. Pt.
Bird → Dog 0.567 0.418 0.766 0.690 0.275 0.085 0.081 0.032
Dog → Bird 0.663 0.532 0.660 0.560 0.272 0.098 0.035 0.017
0
Frog → Deer 0.755 0.680 0.827 0.787 0.393 0.135 0.079 0.020
Deer → Frog 0.610 0.477 0.669 0.617 0.243 0.119 0.059 0.018
Bird → Dog 0.611 0.470 0.646 0.590 0.282 0.093 0.067 0.026
−2 Dog → Bird 0.708 0.553 0.558 0.479 0.180 0.072 0.030 0.01410
Frog → Deer 0.823 0.753 0.858 0.818 0.404 0.173 0.077 0.021
Deer → Frog 0.790 0.625 0.660 0.640 0.189 0.106 0.063 0.022
218
GAS0 (ours) GAS-L0 (ours) GAS (ours) GAS-L (ours)
TracInCP TracIn Influence Func. Representer Pt.
1
0.8
0.6
0.4
0.2
0
Bird → Dog Dog → Bird Frog → Deer Deer → Frog
(a) Baseline with β = 0
1
0.8
0.6
0.4
0.2
0
Bird → Dog Dog → Bird Frog → Deer Deer → Frog
(b) Joint optimization with β = 10−2
Figure C.30. Adversarial-Set Identification for the Adaptive Vision Poison
Attack: Mean AUPRC identifying the adversarial set where Zhu et al.’s vision
poison attack is jointly optimized with minimizing GAS with ≥10 trials per setup as
described in Section C.4. Section 6.6’s baseline results set trade-off hyperparameter
β = 0, meaning the poison was not jointly optimized. The jointly optimized results
used β = 10−2 as explained in suppl. Section C.4. This joint optimization reduces
the GAS similarity by 7% at the cost of a 19% decrease in ASR w.r.t. Table 6. See
Table C.42 (below) for the numerical results.
219
Dadv AUPRC Dadv AUPRC
C.4.2 Target Identification of the Jointly Optimized Poisoning
Attack.
Table C.43. Target Identification for the Adaptive Vision Poison Attack:
Target identification mean AUPRC where Zhu et al.’s [Zhu+19] vision poison attack is
jointly optimized with minimizing GAS. Section 6.6’s baseline results set trade-
off hyperparameter β = 0, meaning the poison was not jointly optimized. The
jointly optimized results used β = 10−2 as explained in suppl. Section C.4. Bold
denotes the best mean performance with ≥10 trials per class pair. Mean results are
shown graphically in Figures 18 and C.31. Variance results appear in the original
paper [HL22a, Sec. F.2.2].
Param. Classes Ours Baselines
β ytarg yadv GAS GAS-L Max k-NN Min k-NN Most Certain Least Certain Random
Bird Dog 0.789 0.350 0.357 0.011 0.082 0.014 0.025
Dog Bird 0.944 0.481 0.299 0.011 0.050 0.012 0.019
0
Frog Deer 0.958 0.806 0.538 0.013 0.171 0.012 0.115
Deer Frog 0.750 0.393 0.339 0.013 0.154 0.012 0.027
Bird Dog 0.775 0.204 0.422 0.010 0.046 0.012 0.088
−2 Dog Bird 0.875 0.321 0.400 0.012 0.211 0.011 0.02510
Frog Deer 0.784 0.586 0.387 0.010 0.108 0.012 0.076
Deer Frog 0.681 0.376 0.395 0.022 0.125 0.011 0.021
220
FIT w/ GAS (ours) FIT w/ GAS-L (ours) Max. k-NN Distance Min. k-NN Distance
Most Certain Least Certain Random
1
0.8
0.6
0.4
0.2
0
Bird → Dog Dog → Bird Frog → Deer Deer → Frog
(a) Baseline with β = 0
1
0.8
0.6
0.4
0.2
0
Bird → Dog Dog → Bird Frog → Deer Deer → Frog
(b) Joint optimization with β = 10−2
Figure C.31. Target Identification for the Adaptive Vision Poison Attack:
Mean target identification AUPRC where Zhu et al.’s [Zhu+19] vision poison attack
is jointly optimized with minimizing GAS. Section 6.6’s baseline results set trade-off
hyperparameter β = 0, meaning the poison was not jointly optimized. The jointly
optimized results used β = 10−2 as explained in suppl. Section C.4. See Table C.43
(below) for the numerical results.
221
Target AUPRC Target AUPRC
C.4.3 Target-Driven Attack Mitigation of the Jointly Optimized
Poisoning Attack.
This section examines joint optimization’s effect on target-driven mitigation.
Averaging across all class pairs, target-driven mitigation using GAS and GAS-L
removed 0.05% and 0.03% (resp.) of the clean training data (Dcl). For comparison, Zhu
et al.’s [Zhu+19] baseline attack removed on average 0.02% and 0.03% of clean training
data for GAS and GAS-L respectively (see Table 6). Moreover, after mitigating this
jointly-optimized attack, average test accuracy either improved or stayed the same in
all but one case.
Table C.44. Target-Driven Attack Mitigation for the Adaptive Vision Poison
Attack: Algorithm 6’s target-driven data sanitization where Zhu et al.’s [Zhu+19]
vision poison attack is jointly optimized with minimizing the GAS influence. The
results below consider exclusively the jointly-optimized attack with β = 10−2. Clean-
data removal remains low, and test accuracy either improved or stayed the same for
in but one setup. The performance is comparable to the results with Zhu et al.’s
[Zhu+19]’s standard vision poisoning attack (see Table C.40). Bold denotes the best
mean performance with ≥10 trials per class pair.
Classes % Removed ASR % Test Acc. %
Method
ytarg yadv Dadv Dcl Orig. Ours Orig. Chg.
GAS 36.0 0.02 0 +0.1
Bird Dog 76.2 87.0
GAS-L 30.3 0.00 0 +0.1
GAS 21.6 0.00 0 +0.1
Dog Bird 57.1 87.1
GAS-L 21.9 0.00 0 –0.1
GAS 17.5 0.00 0 0.0
Frog Deer 38.1 87.1
GAS-L 19.4 0.00 0 0.0
GAS 85.0 0.18 0 0.0
Deer Frog 81.0 87.1
GAS-L 82.3 0.13 0 +0.1
222
APPENDIX D
EVALUATION SETUPS
This chapter contains previously published, coauthored material [HL21; HL22a;
HL23c; HL23a]. Hammoudeh wrote this complete section and designed the experiments.
Lowd provided supervision, editorial suggestions, and input on experiment design.
Zayd Hammoudeh and Daniel Lowd. “Simple, Attack-Agnostic Defense
Against Targeted Training Set Attacks Using Cosine Similarity”. In:
Proceedings of the 3rd ICML Workshop on Uncertainty and Robustness
in Deep Learning. UDL’21. 2021
Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set Attack’s
Target Using Renormalized Influence Estimation”. In: Proceedings of the
29th ACM SIGSAC Conference on Computer and Communications Security.
CCS’22. Los Angeles, CA: Association for Computing Machinery, 2022. url:
https://arxiv.org/abs/2201.10055
Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression
to Certified Classification for General Poisoning Attacks”. In: Proceedings
of the 1st IEEE Conference on Secure and Trustworthy Machine Learning.
SaTML’23. 2023. url: https://arxiv.org/abs/2208.13904
Zayd Hammoudeh and Daniel Lowd. “Feature Partition Aggregation:
A Fast Certified Defense Against a Union of ℓ0 Attacks”. In: Proceedings of
the 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning.
AdvML-Frontiers’23. 2023. url: https://arxiv.org/abs/2302.11628
This chapter details the evaluation setup for the experiments Chapters 4, 5, and 6.
Since each chapter has a different evaluation setup, we separate the setup of each
chapter in a different section below.
223
D.1 Evaluation Setup for the Experiments in Chapter 4
This section details the evaluation setup used in Section 4.8’s experiments,
including implementation details, dataset configuration, and hyperparameter
settings. Chapter 4’s source code can be downloaded from https://github.com/
ZaydH/certified-regression. All experiments were implemented and tested in
Python 3.7.1. Experiments were performed using one core of a fourteen-core Intel
E5-2690v4 CPU and 12GB of RAM. Ridge regression models were trained using
Scikit-Learn [Ped+11], while the decision forests used the XGBoost library [CG16].
The overlapping regressor ILPs (Fig. 6) were optimized using Gurobi [Gur22]
with a time limit of 1200s.
D.1.1 Dataset Configuration. Chapter 4’s source code automatically
downloads all necessary datasets. Regarding dataset preprocessing, categorical features
were transformed into one-hot-encoded features in line with previous work [BHL23].
Standardizing features by dataset mean/variance breaks submodel independence and
so was not performed. Minimal manual feature engineering was performed to improve
the housing datasets’ results, e.g., adding a home’s age, total square feet, total number
of bathrooms, etc.; this feature engineering was done based on existing features in
the dataset (e.g., total square feet equals the sum of the first and second-floor square
footage). None of the engineered features affect submodel independence.
Most of the six datasets in Sec. 4.8.1 do not have a dedicated test set. In such
cases, the data was split 90%/10% at random between training and test.
When training kNN-CR models, each feature dimension was normalized to the
range [0, 1]. Without feature normalization, kNN-CR generally prioritizes whichever
feature has the largest magnitude. This transformation implicitly restricts arbitrary
insertions to the feature range in the original dataset. Such normalization is implicitly
224
done in certified classifier evaluation on image datasets where each pixel has a consistent,
fixed range.
D.1.2 Dataset Target Value Statistics. Table D.45 summarizes the test
set’s target (y) value distribution statistics for Sec. 4.8’s five regression datasets.
Recall from Table 1 that the Ames, Austin, and Diamonds datasets set error
threshold ξ as a fixed percentage of yte. This choice was made because these three
datasets exhibit significant y variance. For example, for Diamonds, the largest
y value ($18.8k) is about two orders of magnitude larger than the smallest y value
($339). Using a fixed ξ value on these three datasets would have made certifying
instances with small y unrealistically easy while making certification of instances with
large y unreasonably difficult. Making the error threshold a fraction of yte allows the
certification difficulty to be more consistent across the range of y values.
Datasets Weather and Life used fixed ξ values of 3 degrees (Celsius) and 3 years
respectively. Both of these threshold values are less than one-third of each dataset’s y
standard deviation.
In the original paper [HL23c], we evaluate the performance of our certified
regressors on additional ξ values – both larger and smaller than the ξ values used in
Sec. 4.8.
Table D.45. Target Value Test Distribution Statistics: Mean (ȳ), standard
deviation (σy), minimum value (ymin) and maximum value (ymax) for the test instances’
target y value for Sec. 4.8’s five regression datasets.
Dataset ȳ σy ymin ymax
Ames $184k $83.4k $12.8k $585k
Austin $466k $266k $81.0k $2.6M
Diamonds $3.8k $3.9k $0.3k $18.8k
Weather 14.9◦C 10.3◦C −44.0◦C 54.0◦C
Life 69.3 years 9.6 years 36.3 years 89.0 years
225
D.1.3 Hyperparameters. Following Jia et al.’s [Jia+22a] certified kNN
classifier evaluation, kNN-CR’s neighborhood size, k, was set to the (larger) odd
integer nearest to n . We use the Minkowski distance as the neighborhood’s distance
2
metric.
For our ensemble regressors, hyperparameters were tuned using Bayesian
optimization as implemented in the scikit-optimize library [Hea+21]. The
partitioned and overlapping certified regressors (unweighted and weighted) used
the same hyperparameter settings.
Ridge Regression Hyperparameters For three datasets – Diamonds [Wic16],
Weather [Mal+21], and Spambase [Hop+17] – our four ensemble regressors used ridge
regression as the submodel architecture. For each dataset and q value, we tuned three
ridge regression hyperparameters. Below, we list those hyperparameters along with
the set of values considered.
– Weight Decay (λ): L2 regularization strength. We considered values between 10
−8
and 104.
– Error Tolerance (ε): Minimum validation error that defines when a model is
considered converged. The tested values were {10−8, 10−7, . . . , 10−3}.
– Maximum Number of Iterations (# Itr.): Defines the maximum number of
optimizer iterations. If the error tolerance is achieved before the iteration count
is met, the model is treated as converged, and optimization stops. The tested
values were {102, 103, . . . , 108}.
Table D.46 lists the final hyperparameters for each experimental setup that used ridge
regression as the submodel architecture.
226
XGBoost Hyperparameters For three datasets – Ames Housing [Coc11], Austin
Housing [Pie21], and Life [Raj21] – our four ensemble regressors used XGBoost [CG16]
as the submodel architecture. For each dataset and q value, we tuned seven XGBoost
hyperparameters. Below, we list those hyperparameters along with the set of values
considered.
– Number of Trees (τ): Number of trees in the ensemble. The tested values were
{50, 100, 250, 500, 1000}.
– Maximum Tree Depth (h): Maximum depth of each tree in the ensemble. The
tested values were {1, . . . , 4}.
– Evaluation Metric (L): Applied to the validation set and is the metric being
minimized. The tested values were root mean squared error (RMSE) and mean
absolute error (MAE).
– Weight Decay (λ): L2 regularization strength. We considered values between 10
−3
and 105.
– Minimum Split Loss (γ): Minimum reduction in loss required to split
a node instead of making it a leaf. The values considered were
{0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 1}.
– Learning Rate (η): Larger value makes the boosting more conservative. The
tested values were {0.01, 0.1, 0.3, 1}.
Table D.47 lists the final hyperparameters for each experimental setup that used
XGBoost as the submodel architecture. Mixup [Zha+18] data augmentation was used
to improve XGBoost’s performance.1
1Mixup does not apply to convex models like ridge regression.
227
Table D.46. Ridge Regression Hyperparameters: Hyperparameter settings for
the three datasets that used ridge regression as the ensemble submodel architecture.
Hyperparameters are reported for the three q values used in Fig. 7 and Sec. C.1. We
also report the hyperparameters for uncertified accuracy when q = 1.
Dataset q λ ε # Itr.
1 3.16E−3 1E−6 1E6
151 6.01E−2 1E−7 1E8
Diamonds
501 1.00E−8 1E−6 1E8
1001 1.38E−8 1E−6 1E2
1 3.16E−3 1E−8 1E7
51 1.00E+3 1E−5 1E6
Weather
1501 3.16E+2 1E−6 1E2
3001 3.16E+2 1E−6 1E3
1 3.16E+2 1E−6 1E5
25 3.16E−3 1E−6 1E6
Spambase
151 3.16E−6 1E−7 1E6
301 3.16E−3 1E−6 1E6
D.2 Evaluation Setup for the Experiments in Chapter 5
This section details the evaluation setup used in Section 5.5’s experiments,
including implementation details, dataset configuration, and hyperparameter settings.
Our source code can be downloaded from https://github.com/ZaydH/feature-
partition. All experiments were implemented and tested in either Python 3.7.13 or
3.10.10. All neural networks were implemented in PyTorch version 1.12.0 [Pas+19].
LightGBM decision forests were trained using the official lightgbm Python module,
version 3.3.3.99 [Ke+17].
D.2.1 Hardware Setup. Experiments were performed on a desktop system
with a single AMD 5950X 16-core CPU, 64GB of 3200MHz DDR4 RAM, and a single
NVIDIA 3090 GPU.
D.2.2 Baselines. To the extent of our knowledge, no existing method
considers certified feature robustness guarantees (Def. 5.1). Randomized ablation –
our most closely related method – considers ℓ0-norm certified robustness (Def. 5.2)
[LF20b]. RA is a specialized form of randomized smoothing [CRK19; LXL23] targeted
228
Table D.47. XGBoost Hyperparameters: Hyperparameter settings for the three
datasets that used XGBoost as the ensemble submodel architecture. Hyperparameters
are reported for the three q values used in Fig. 7 and Sec. C.1. We also report the
hyperparameters for uncertified accuracy when q = 1.
Dataset q τ h L λ γ η
1 250 2 RMSE 1E−1 5E−3 0.3
25 500 2 MAE 1E−3 5E−3 0.3
Ames Housing
125 500 3 RMSE 1E−2 5E−3 1.0
251 250 1 RMSE 1E−1 5E−3 1.0
1 500 4 MAE 1E+2 1E−2 0.3
151 1000 1 RMSE 1E−2 5E−3 1.0
Austin Housing
301 250 1 MAE 1E+0 1E−2 1.0
701 250 1 MAE 1E−2 1E−2 1.0
1 500 5 RMSE 1E+1 1E−2 0.1
25 250 4 RMSE 0E+0 5E−2 0.3
Life
101 250 3 MAE 1E+0 1E−2 1.0
201 250 4 RMSE 0E+0 5E−3 0.3
towards sparse evasion attacks. In terms of the state of the art, Jia et al. [Jia+22b]
provide the tightest certification analysis for randomized ablation.
Recall that feature partition aggregation (FPA) provides strictly stronger certified
guarantees than baseline RA. Put simply, FPA is solving a harder task than baseline
randomized ablation. Therefore, when FPA achieves the same certified accuracy as
the baseline, FPA is performing provably better, given FPA’s stronger guarantees.
We also compare FPA to three certified patch defenses, namely: (de)randomized
smoothing (DRS) [LF20a], patch interval bound propagation (IBP) [Chi+20], and
BagCert [MY21]. Note that BagCert’s implementation is not open source, and
Metzen and Yatsura [MY21] have indicated they do not plan to open source the code in
the future.2 As such, BagCert’s results in the main paper were provided by Metzen
and Yatsura via personal correspondence. BagCert’s closed source code prohibited
the collection of its certification time. Nonetheless, comparing FPA’s certification
2The author’s comments regarding open-sourcing their code can be found on BagCert’s
OpenReview page.
229
time to that of BagCert provides only limited insight since FPA and BagCert
certify very different types of guarantees.
D.2.3 Datasets. Our empirical evaluation considers four datasets. First,
MNIST [LeC+98] and CIFAR10 [KNH14] are vision classification datasets with
10 classes each.
Although all certified sparse defenses considered in this work are exclusively
proposed in the context of classification, Hammoudeh and Lowd [HL23c] prove
that certified regression reduces to voting-based certified classification. Hence, it
is straightforward to transform FPA and randomized ablation into certified regression
defenses. We reuse this reduction and evaluate two tabular regression datasets,
Weather [Mal+21] and Ames [Coc11].
For Weather, we follow Hammoudeh and Lowd’s [HL23c] empirical evaluation,
where the objective is to predict ground temperature within ±3◦C using features that
include the date, time of day, longitude, and latitude. Similarly, we follow Hammoudeh
and Lowd’s [HL23c]’s empirical evaluation for Ames, where the objective is to predict
a property’s sale price within ±15% of the actual price. Since ablated training requires
a custom feature encoding to differentiate ablated and non-ablated features, min-max
scaling was applied to both datasets’ features for RA to normalize all feature values
to the range [0, 1].
We chose these two regression datasets as a stand-in for vertically partitioned
data, which are commonly tabular and particularly vulnerable to sparse backdoor and
evasion attacks.
Table D.48 provides basic information about the four datasets, including their
sizes and feature dimension. Table D.49 provides summary statistics for the regression
datasets’ test target-value (i.e., y) distribution.
230
Table D.48. Evaluation dataset information
Dataset # Classes # Feats # Train # Test
CIFAR10 10 1,024 50,000 10,000
MNIST 10 784 60,000 10,000
Weather N/A 128 3,012,917 531,720
Ames N/A 352 2,637 293
Table D.49. Target Value Test Distribution Statistics: Mean (ȳ), standard
deviation (σy), minimum value (ymin) and maximum value (ymax) for the test instances’
target y value for regression datasets Weather and Ames.
ȳ σy ymin ymax
Weather 14.9◦C 10.3◦C −44.0◦C 54.0◦C
Ames $184k $83.4k $12.8k $585k
Our source code automatically downloads all necessary dataset files.
D.2.4 Network Architectures. Table D.50 details the CIFAR10 neural
network architecture. Specifically, we follow previous work on CIFAR10 data poisoning
[HL22a] and use Page’s [Pag20] ResNet9 architecture. ResNet9 is ideal for our
experiments since it is very fast to train, as ranked on DAWNBench [Col+17].
ResNet9’s fast training significantly reduces the overhead of training L submodels for
FPA.
We directly adapt Page’s [Pag20] published implementation3 including the use of
ghost batch normalization [SD20] and the CELU activation function with α = 0.075
[Bar17].
Three forms of data augmentation were also used in line with Page’s [Pag20]
implementation. First, a random crop with four pixels of padding was performed. Next,
the image was flipped horizontally with a 50% probability. Finally, a random 8× 8 pixel
portion of the image was randomly erased. Note that these transformations were
3Source code: https://github.com/davidcpage/cifar10-fast.
231
Table D.50. ResNet9 neural network architecture
Conv1 In=3 Out=64 Kernel=3× 3 Pad=1
BatchNorm2D Out=64
CELU
Conv2 In=64 Out=128 Kernel=3× 3 Pad=1
BatchNorm2D Out=128
CELU
MaxPool2D 2× 2
↑ ConvA In=128 Out=128 Kernel=3× 3 Pad=1BatchNorm2D Out=128
CELU
ResNet1
↓ ConvB In=128 Out=128 Kernel=3× 3 Pad=1BatchNorm2D Out=128
CELU
Conv3 In=128 Out=256 Kernel=3× 3 Pad=1
BatchNorm2D Out=256
CELU
MaxPool2D 2× 2
Conv4 In=256 Out=512 Kernel=3× 3 Pad=1
BatchNorm2D Out=512
CELU
MaxPool2D 2× 2
↑ ConvA In=512 Out=512 Kernel=3× 3 Pad=1BatchNorm2D Out=512
CELU
ResNet2
↓ ConvB In=512 Out=512 Kernel=3× 3 Pad=1BatchNorm2D Out=512
CELU
MaxPool2D 4× 4
Linear Out=10
performed after the pixels were disabled in the image, meaning these transformations
do not result in a network seeing additional pixel information.
In a separate paper, Levine and Feizi [LF21] propose deep partition aggregation
(DPA), a certified defense against poisoning attacks. Here, we follow Levine
and Feizi’s [LF21] public implementation4 and use the Network-in-Network (NiN)
4Source code: https://github.com/alevine0/DPA.
232
Table D.51. Network-in-Network neural network architecture
Conv1 In=3 Out=192 Kernel=5× 5 Pad=2
BatchNorm2D Out=192
ReLU
Conv2 In=192 Out=160 Kernel=1× 1 Pad=1
Block 1 BatchNorm2D Out=160
ReLU
Conv3 In=160 Out=96 Kernel=1× 1 Pad=1
BatchNorm2D Out=96
ReLU
MaxPool2D 3× 3
Conv1 In=96 Out=192 Kernel=5× 5 Pad=2
BatchNorm2D Out=192
ReLU
Conv2 In=192 Out=192 Kernel=1× 1 Pad=1
Block 2 BatchNorm2D Out=192
ReLU
Conv3 In=192 Out=192 Kernel=1× 1 Pad=1
BatchNorm2D Out=192
ReLU
AvgPool2D 3× 3
Conv1 In=192 Out=192 Kernel=3× 3 Pad=1
BatchNorm2D Out=192
ReLU
Conv2 In=192 Out=192 Kernel=1× 1 Pad=1
Block 3
BatchNorm2D Out=192
ReLU
Conv3 In=192 Out=192 Kernel=1× 1 Pad=1
BatchNorm2D Out=192
ReLU
GlobalAvgPool2D Out=192
Linear Out=10
architecture [LCY14] when evaluating our method on MNIST. Table D.51 visualizes
the MNIST NiN architecture.
D.2.5 Hyperparameters. For simplicity, FPA used the same
hyperparameter settings for a given dataset irrespective of L. Therefore, FPA’s
results could be further improved in practice by tuning the hyperparameter settings
to optimize the ensemble’s performance for a specific submodel count.
233
Table D.52 details the CIFAR10 and MNIST hyperparameter settings for feature
partition aggregation.
Table D.52. FPA’s neural network training hyperparameters
CIFAR10 MNIST
Data Augmentation? ✓
Validation Split N/A 5%
Optimizer SGD AdamW
Batch Size 512 128
# Epochs 80 25
Learning Rate (Peak) 1 · 10−3 3.16 · 10−4
Learning Rate Scheduler One cycle Cosine
Weight Decay (L2) 1 · 10−1 1 · 10−3
For CIFAR10 and MNIST, we directly used Levine and Feizi’s [LF20b] published
randomized ablation training source code, which includes pre-specified hyperparameter
settings for the learning rate, weight decay, and optimizer hyperparameters.
Recall from Sec. 5.5 that for the Weather and Ames datasets, FPA’s submodels
are LightGBM [Ke+17] gradient-boosted decision tree (GBDT) regressors. Table D.53
details FPA’s LightGBM hyperparameter settings. For a more direct comparison with
randomized ablation which cannot use a GBDT, we also evaluated FPA with linear
submodels. FPA’s linear submodel hyperparameter settings for the regression datasets
are in Table D.54.
Table D.53. Regression datasets LightGBM submodel training hyperparameters
Weather Ames
Boosting Type GBDT GBDT
# Estimators 500 1,000
Max. Depth 10 6
Min. Child Samples 20 5
Max. # Leaves 127 127
L1 Regularizer 0 1 · 10−3
L2 Regularizer 0 1 · 102
Objective Huber MAE
Learning Rate 0.5 1 · 102
Subsampling 0.9 0.9
234
Table D.54. Regression datasets linear submodel training hyperparameters
Weather Ames
L1 Regularizer 3.16 · 10−3 4.15 · 10−5
Max. # Iterations 1 · 104 1 · 106
Tolerance 1 · 10−3 1 · 10−8
Levine and Feizi [LF20b] only evaluate classification datasets in their original
paper. As such, there are no existing hyperparameter settings for randomized ablation
on Weather and Ames. We manually tuned randomized ablation’s learning rate for
the regression datasets considering all values in the set {10−2, 10−3, 10−4}. We also
tested numerous different settings for the number of training epochs. To ensure a
strong baseline, we report the best performing randomized ablation hyperparameter
settings.
Recall from Sec. 5.2 that randomized ablation only provides probabilistic
guarantees. By contrast, feature partition aggregation provides deterministic
guarantees. To facilitate a more direct comparison between certified feature and
ℓ0-norm guarantees, α = 0.0001 in all experiments.
D.3 Evaluation Setup for the Experiments in Chapter 6
This section details the evaluation setup used in Section 6.3 and 6.5’s experiments,
including dataset specifics, hyperparameters, and the neural network architectures.
Our source code can be downloaded from https://github.com/ZaydH/target_
identification. All experiments used the PyTorch automatic differentiation
framework [Pas+19] and were tested with Python 3.6.5. Wallace et al.’s [Wal+21]
sentiment analysis data poisoning source code will be published by its authors
at https://github.com/Eric-Wallace/data-poisoning.
D.3.1 Dataset Configurations. This subsection provides details related
to dataset configurations.
235
Section 6.3.1 performs binary classification of frog vs. airplane from CIFAR10.
Added as a small adversarial set (Dadv) is 150 MNIST 0 training instan(ce)s selected
at random. We considered this class pair specifically since among the 10 possible
2
CIFAR10 class pairs, the MNIST test misclassification rate was closest to uniformly
at random (u.a.r.) for frog vs. airplane (47.5% actual vs. 50% u.a.r. – uniformly
at random). Hence, on average, neither frog nor airplane is overly influential on
MNIST. Note that no external constraints induced this near u.a.r. misclassification
rate.
Section 6.3.5 compares the ability of influence estimators, with and without
renormalization, to identify influential groups of training examples on non-adversarial,
CIFAR10, binary classification with Figure 12’s results averaged across five class pairs.
Two of the class pairs, airplane vs. bird and automobile vs. dog, were studied by
Weber et al. [Web+23] in relation to certified defenses. The three other class pairs –
cat vs. ship, frog vs. horse, and frog vs. truck – were selected at random.
Wallace et al.’s [Wal+21] poisoning method attacks the SST-2 dataset [Soc+13].
We consider detection on 8 short movie reviews – four positive and four negative – all
selected at random by Wallace et al.’s implementation. The specific reviews considered
appear in Table D.55.
Table D.55. SST-2 movie reviews selected by Wallace et al.’s [Wal+21] poisoning
attack implementation.
Sentiment No. Text
↑ 1 a delightful coming-of-age story .
2 a smart , witty follow-up .
Positive
↓ 3 ahhhh ... revenge is sweet !4 a giggle a minute .
↑ 1 oh come on .
2 do not see this film .
Negative
↓ 3 it ’s a buggy drag .4 or emptying rat traps .
236
The next section provides details regarding the adversarial datasets sizes.
D.3.1.1 Training Set Sizes. Table D.56 details the dataset sizes used to
train all evaluated models in Section 6.5.
Table D.56. Chapter 6 target identification dataset sizes
Dataset Attack # Classes # Train # Test
CIFAR10 [KNH14] Poison 5 25,000 5,000
SST-25 [Soc+13] Poison 2 67,349 N/A
Speech [Liu+18] Backdoor 10 3,0006 1,184
CIFAR10 [KNH14] Backdoor 2 10,000 2
Liu et al.’s [Liu+18] speech backdoor dataset includes training and test examples
with their associated adversarial trigger already embedded. We used their adversarial
dataset unchanged. Table D.57 details |Dadv| (i.e., adversarial training set size) for
each speech digit pair after a fixed, random train-validation split.
Table D.57. Number of backdoor training examples for each speech backdoor digit
pair. As detailed above, Liu et al.’s [Liu+18] dataset provides 30 backdoored instances
for each digit pair. The remainder of the 30 instances for each digit pair are part of
the fixed, validation set.
Digit Pair 0 → 1 1 → 2 2 → 3 3 → 4 4 → 5 5 → 6 6 → 7 7 → 8 8 → 9 8 → 9
|Dadv| 26 27 24 24 26 28 26 26 22 21
D.3.1.2 Target Set Sizes. Table D.58 details the sizes of the target and
non-target sets considered in Section 6.5.3’s target identification experiments. Davis
and Goadrich [DG06] explain that the class imbalance ratio between classes defines
the unattainable regions in the precision-recall curve. By extension, this ratio also
dictates the baseline AUPRC value if examples are labeled randomly.
5Stanford Sentiment Treebank dataset (SST-2) is used for sentiment analysis
6Clean only. Dataset also has 300 backdoored samples divided evenly among the 10 attack class
pairs (e.g., 0 → 1, 1 → 2, etc.).
237
Table D.58. Target and non-target set sizes used in Section 6.5.3’s target identification
experiments.
Attack Type # Targets # Non-Targets
Speech 10 220
Backdoor
Vision 35 250
NLP 1 125
Poison
Vision 1 450
D.3.2 Hyperparameters. This section details three primary hyperparameter
types, namely: hyperparameters used to create adversarial set Dadv (if any),
hyperparameter used when training model f , and influence estimator hyperparameters.
D.3.2.1 Model Training. Table D.59 enumerates the hyperparameters
used when training the models analyzed in Section 6.3.
Table D.59. Renormalized influence model training hyperparameter settings
Hyperparameter CIFAR10 & MNIST Filtering
θ0 Pretrained? ✓*
Data Augmentation? ✓
Validation Split 1 1
6 6
Optimizer Adam Adam
|Dadv| 150 N/A
Batch Size 64 64
# Epochs 10 10
# Subepochs (ω)7 5 3
η (Peak) 1 · 10−3 1 · 10−3
η Scheduler One cycle One cycle
λ (Weight Decay) 1 · 10−3 1 · 10−3
Table D.60 enumerates the hyperparameters used when training the adversarially-
attacked models analyzed in Sections 6.5.
D.3.2.2 Upper-Tail Heaviness Hyperparameters. Section 6.4.2 defines
the upper-tail heaviness of influence vector v as the κ-th largest anomaly score in
7We use the term “ω subepoch checkpointing” (ω ∈ Z+) to denote that iteration subset T is
formed from ω evenly-spaced checkpoints within each epoch. ω was not tuned, and was selected
based on overall execution time and compute availability.
8Varies by digit pair. See Table D.57.
238
Table D.60. Training-set attack model training hyperparameter settings
Poison Backdoor
Hyperparameter
CIFAR10 SST-2 Speech CIFAR10
θ0 Pretrained? ✓ ✓
Existing Adv. Dataset ✓
Data Augmentation? ✓
Validation Split 1 Predefined 1 1
6 6 6
Optimizer SGD Adam SGD Adam
|Dadv| 50 50 21–288 150
|Dcl| ( ) 24,950 67,349 3,000 9,850
|D |
Poisoning Rate adv|D| 0.20% 0.07% 0.99% 1.50%
Batch Size 256 32 32 64
# Epochs 30 4 30 10
# Subepochs (ω) 5 3 3 5
η (Peak) 1 · 10−3 1 · 10−5 1 · 10−3 1 · 10−3
η Scheduler One cycle Poly. decay One cycle One cycle
λ (Weight Decay) 1 · 10−1 1 · 10−1 1 · 10−3 1 · 10−3
Dropout Rate N/A 0.1 N/A N/A
vector σ. Table D.61 defines the hyperparameter value κ used for each of Section 6.5.1’s
four attacks.
Table D.61. Upper-tail heaviness cutoff count (κ)
Attack Type Tail Count (κ)
Speech 10
Backdoor
Vision 10
NLP 10
Poison
Vision 2
D.3.2.3 Target-Driven Mitigation Hyperparameters. Algorithm 6
details our target-driven attack mitigation algorithm, which uses filtering cutoff
hyperparameter ζ to tune how much data to filter in each filtering iteration.
Table D.62 details the hyperparameter settings used in Section 6.5.4’s attack mitigation
experiments.
For each attack, multiple trials were performed with different target examples,
class pairs, attack triggers, etc. For each such trial, we repeated the mitigation
239
experiment multiple times to ensure the most representative numbers with the number
of repeats enumerated in Table D.62.
In addition, cutoff threshold ζ was set to an initial value. After a specified number
of iterations l, ζ was decreased by a specified step-size. This process continued until
the attack had been mitigated. To summar⌊ize, iteration⌋l’s mitigation cutoff value ζ is
l
ζl = ζinitial − ψ , (D.1)
StepCount
with the corresponding value of each parameter in Table D.62.
Table D.62. Target-driven attack mitigation hyperparameters
Poison Backdoor
Hyperparameter
CIFAR10 SST-2 Speech CIFAR10
Repeats Per Trial 3 3 5 5
ζinitial Initial Cutoff 3 4 3 2
Anneal Step Size (ψ) 0.25 0.5 0.25 0.25
Anneal Step Count 1 1 4 4
D.3.2.4 Adversarial Set Dadv Crafting. Liu et al.’s [Liu+18] speech
recognition dataset comes bundled with 300 backdoor training examples. The
adversarial trigger takes the form of white noise inserted at the beginning of the speech
recording. We used the dataset unchanged except for a fixed training/validation split
used in all experiments. Only one backdoor digit pair (e.g, 0 → 1, 1 → 2, etc.) is
considered at a time.
Weber et al. [Web+23] consider backdoor three different backdoor adversarial
trigger types on CIFAR10 binary classification. The three attack patterns are:
1. 1 Pixel : The image’s center pixel is perturbed to the maximum value.
2. 4 Pixel : Four specific pixels near the image’s center had their pixel value increased
a fixed amount.
3. Blend : A fixed isotropic Gaussian-noise pattern (N (0,I)) across the entire image.
240
Table D.63 defines each attack pattern’s maximum L2 perturbation distance. Any
perturbation that exceeded the pixel minimum/maximum values was clipped to the
valid range.
Table D.63. CIFAR10 vision backdoor adversarial trigger maximum ℓ2-norm
perturbation distance
Pattern Max. ℓ2
√
1 Pixel 3
4 Pixel 2
Blend 4
Wallace et al. [Wal+21] construct single-target natural language poison using the
traditional poisoning bilevel opti(mization, ∑ )
argmin Ladv ẑtarg; argmin L(zi; θ) , (D.2)
Dadv θ z∈Dcl∪Dadv
where Ladv uses the attacker’s adversarial loss function, Ladv : A× Y → R≥0, in
place of training loss function L [BNL12; Muñ+17]. To mak∑e the computation
tractable, Wallace et al. approximate inner minimizer, argminθ z∈D ∪D L(zi; θ),cl adv
using second-order gradients similar to [FAL17; Wan+18; Hua+20]. Wallace et al.’s
method initializes each poison instance from a seed phrase, and tokens are iteratively
replaced with alternates that align well with the poison example’s gradient.
Like Wallace et al., our experiments attacked sentiment analysis on the Stanford
Sentiment Treebank v2 (SST-2) dataset [Soc+13]. We targeted 8 (4 positive &
4 negative – see Table D.55) reviews selected by Wallace et al.’s implementation and
generated |Dadv| = 50 new poison in each trial.
Zhu et al.’s [Zhu+19] targeted, clean-label attack crafts a set of poisons by
forming a convex polytope around the target’s feature representation. Our experiments
used the author’s open-source implementation when crafting the poison. Their
241
implementation is gray-box and assumes access to a known pre-trained network
(excluding the randomly-initialized, linear classification layer).
Both Zhu et al.’s [Zhu+19] andWallace et al.’s [Wal+21] poison crafting algorithms
have their own dedicated hyperparameters, which are detailed in Tables D.64 and D.65
respectively. Note that Table D.65’s hyperparameters are taken unchanged from the
original source code provided by Wallace et al.
Table D.64. Convex polytope poison crafting [Zhu+19] hyperparameter settings
Hyperparameter Value
# Iterations 1,000
Learning Rate 4 · 10−2
Weight Decay 0
Max. Perturb. (ϵ) 0.1
Table D.65. SST-2 sentiment analysis poison crafting hyperparameter settings. These
are identical to Wallace et al.’s [Wal+21] hyperparameter settings.
Hyperparameter Value
Optimizer Adam
Total Num. Updates 20,935
# Warmup Updates 1,256
Max. Sentence Len. 512
Max. Batch Size 7
Learning Rate 1 · 10−5
LR Scheduler Polynomial Decay
D.3.2.5 Baselines.
Baselines for Identifying Adversarial Set Dadv We exclusively considered
influence-estimation methods applicable to neural models and excluded influence
methods specific to alternate architectures [BHL23].
Koh and Liang’s [KL17] influence functions estimator uses Pearlmutter’s [Pea94]
stochastic Hessian-vector product (HVP) estimation algorithm. Pearlmutter’s
242
algorithm requires 5 hyperparameters, and we follow Koh and Liang’s notation for
these parameters below.
Influence functions’ five hyperparameters are required to ensure estimator quality
and to prevent numerical instability/divergence. Table D.66 details the influence
functions hyperparameters used for each of Section 6.5’s datasets. t and r were selected
to make a single pass through the training set in accordance with the procedure specified
by Koh and Liang.
As noted by Basu et al. [BPF21], influence functions can be fragile on deep
networks. We tuned β and γ to prevent HVP divergence, which is common with
influence functions.
Our influence functions implementation was adapted from the versions published
by [Guo+21] and in the Python package pytorch influence functions.9
Table D.66. Influence functions hyperparameter settings
Renormalization Poison Backdoor
Hyperparameter
CIFAR10 & MNIST Non-adv. CIFAR10 SST-2 Speech CIFAR10
Batch Size 1 1 1 1 1 1
Damp (β) 1 · 10−2 5 · 10−3 1 · 10−2 1 · 10−2 5 · 10−3 1 · 10−2
Scale (γ) 3 · 107 1 · 104 3 · 107 1 · 106 1 · 104 3 · 107
Recursion Depth (t) 1,000 1,000 2,500 6,740 1,000 1,000
Repeats (r) 10 10 10 10 10 10
Second-order influence functions [BYF20] are more brittle and computationally
expensive than the first-order version. Renormalization is intended as a first-order
correction and addresses our two tasks without the costs/issues related to second-order
methods.
Chen et al.’s [Che+21] HyDRA is an additional dynamic influence estimator.
However, HyDRA’s O(np) memory complexity makes it impractical in most modern
9Package source code: https://github.com/nimarb/pytorch_influence_functions.
243
applications with large models and datasets. We focus on TracIn as its memory
complexity is only O(n). HyDRA and TracIn were published contemporaneously and
share the same core idea.10
Peri et al.’s [Per+20] Deep k-NN defense labels a training example as poison if
its label does not match the plurality of its neighbors. For Deep k-NN to accurately
identify poison, it must generally hold that k > 2|Dadv|. Peri et al. propose selecting
k using the normalized k-ratio, k/N , where N is the size of the largest class in D.
Peri et al.’s ablation study showed that Deep k-NN generally performed best
when the normalized k-ratio was in the range [0.2, 2]. To ensure a strong baseline,
our experiments tested Deep k-NN with three normalized k-ratio values, {0.2, 1, 2},11
and we report the top-performing k’s result.
Baselines Identifying the Attack Target(s) Target identification baselines
maximum and minimum k-NN distance depend on k in order to generate target
rankings. Given k’s similarity to our tail cutoff count κ, we use the same
hyperparameter settings for both with the values in Table D.61.
D.3.3 Network Architectures. Table D.67 details the CIFAR10 neural
network architecture. Specifically, we used Page’s [Pag20] ResNet9 architecture,
which is the state-of-the-art for fast, high-accuracy (>94%) CIFAR10 classification on
DAWNBench [Col+17] at the time of writing.
Following Wallace et al. [Wal+21], natural language poisoning attacked Liu et al.’s
[Liu+20b] RoBERTaBASE pre-trained parameters. All language model training used
Facebook AI Research’s fairseq sequence-to-sequence toolkit [Ott+19] as specified
10Our influence renormalization – proposed in Section 6.3 – also applies to HyDRA.
11This corresponds to k ∈ {833, 4167, 8333} for 25,000 CIFAR10 training examples and a
1
6 validation split ratio.
244
by Wallace et al. The text was encoded using Radford et al.’s [Rad+19] byte-pair
encoding (BPE) scheme.
The speech classification convolutional neural network is identical to that used
by Liu et al. [Liu+18] except for two minor changes. First, batch normalization [IS15]
was used instead of dropout to expedite training convergence. In addition, each
convolutional layer’s kernel count was halved to allow the model to be trained on a
single NVIDIA Tesla K80 GPU.
Table D.67. Simplified ResNet9 neural network architecture used for Sec. 6.5’s
CIFAR10 binary classification
Conv1 In=3 Out=64 Kernel=3× 3 Pad=1
BatchNorm2D Out=64
ReLU
Conv2 In=64 Out=128 Kernel=3× 3 Pad=1
BatchNorm2D Out=128
ReLU
MaxPool2D 2× 2
↑ ConvA In=128 Out=128 Kernel=3× 3 Pad=1BatchNorm2D Out=128
ReLU
ResNet1
↓ ConvB In=128 Out=128 Kernel=3× 3 Pad=1BatchNorm2D Out=128
ReLU
Conv3 In=128 Out=256 Kernel=3× 3 Pad=1
BatchNorm2D Out=256
ReLU
MaxPool2D 2× 2
Conv4 In=256 Out=512 Kernel=3× 3 Pad=1
BatchNorm2D Out=512
ReLU
MaxPool2D 2× 2
↑ ConvA In=512 Out=512 Kernel=3× 3 Pad=1BatchNorm2D Out=512
ReLU
ResNet2
↓ ConvB In=512 Out=512 Kernel=3× 3 Pad=1BatchNorm2D Out=512
ReLU
MaxPool2D 2× 2
Linear Out=10
245
Table D.68. Speech recognition convolutional neural network
Conv1 In=3 Out=48 Kernel=11× 11 Pad=1
MaxPool2D 3× 3
BatchNorm2D Out=48
Conv2 In=48 Out=128 Kernel=5× 5 Pad=2
MaxPool2D 3× 3
BatchNorm2D Out=128
Conv3 In=128 Out=192 Kernel=3× 3 Pad=1
ReLU
BatchNorm2D Out=192
Conv4 In=192 Out=192 Kernel=3× 3 Pad=1
ReLU
BatchNorm2D Out=192
Conv5 In=192 Out=128 Kernel=3× 3 Pad=1
ReLU
MaxPool2D 3× 3
BatchNorm2D Out=128
Linear Out=10
246
REFERENCES CITED
[Aga+19] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano,
and Hao Li. “Protecting World Leaders Against Deep Fakes”. In:
Proceedings of the CVPR Workshop on Media Forensics. Long Beach,
California, 2019.
[Agh+20] Hojjat Aghakhani, Thorsten Eisenhofer, Lea Schönherr, Dorothea Kolossa,
Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. “VENOMAVE:
Clean-Label Poisoning Against Speech Recognition”. In: (2020). arXiv:
2010.10682 [cs.SD].
[AKA91] David W. Aha, Dennis Kibler, and Marc K. Albert. “Instance-Based
Learning Algorithms”. In: Machine Learning 6.1 (1991), pp. 37–66.
[ACW18] Anish Athalye, Nicholas Carlini, and David A. Wagner. “Obfuscated
Gradients Give a False Sense of Security: Circumventing Defenses to
Adversarial Examples”. In: ICML’18 (2018). url: https://arxiv.
org/abs/1802.00420.
[Awa+18] Edmond Awad, Sohan Dsouza, Richard Kim, Jonathan Schulz,
Joseph Henrich, Azim Shariff, Jean-François Bonnefon, and Iyad Rahwan.
“The Moral Machine Experiment”. In: Nature 563.7729 (2018),
pp. 59–64.
[Bai+21] Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. “Recent
Advances in Adversarial Training for Adversarial Robustness”. In:
Proceedings of the Thirtieth International Joint Conference on Artificial
Intelligence. IJCAI’21. 2021. doi: 10.24963/ijcai.2021/591. url:
https://arxiv.org/abs/2102.01356.
[BL78] Vic Barnett and Toby Lewis. Outliers in Statistical Data. 2nd edition.
Hoboken, New Jersey, USA: John Wiley & Sons Ltd., 1978.
[Bar17] Jonathan T. Barron. Continuously Differentiable Exponential Linear
Units. 2017. arXiv: 1704.07483 [cs.LG].
[BBD20] Elnaz Barshan, Marc-Etienne Brunet, and Gintare Karolina Dziugaite.
“RelatIF: Identifying Explanatory Training Samples via Relative
Influence”. In: Proceedings of the 23rd International Conference on
Artificial Intelligence and Statistics. AISTATS’20. 2020.
[BPF21] Samyadeep Basu, Phil Pope, and Soheil Feizi. “Influence Functions in
Deep Learning Are Fragile”. In: Proceedings of the 9th International
Conference on Learning Representations. ICLR’21. Virtual Only, 2021.
247
[BYF20] Samyadeep Basu, Xuchen You, and Soheil Feizi. “On Second-Order
Group Influence Functions for Black-Box Predictions”. In: Proceedings
of the 37th International Conference on Machine Learning. ICML’20.
Virtual Only: PMLR, 2020.
[BT74] Albert E. Beaton and John W. Tukey. “The Fitting of Power Series,
Meaning Polynomials, Illustrated on Band-Spectroscopic Data”. In:
Technometrics 16.2 (1974), pp. 147–185.
[Ben75] Jon L. Bentley. A Survey of Techniques for Fixed Radius Near Neighbor
Searching. Tech. rep. Stanford, CA, USA: Stanford University, 1975.
[BNL12] Battista Biggio, Blaine Nelson, and Pavel Laskov. “Poisoning Attacks
against Support Vector Machines”. In: Proceedings of the 29th
International Conference on Machine Learning. ICML’12. Edinburgh,
Great Britain: PMLR, 2012. url: https://arxiv.org/abs/1206.
6389.
[BR92] Avrim L. Blum and Ronald L. Rivest. “Training a 3-Node Neural
Network is NP-Complete”. In: Neural Networks 5.1 (1992), pp. 117–127.
[Blu+73] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest,
and Robert E. Tarjan. “Time Bounds for Selection”. In: Journal of
Computer and System Sciences 7.4 (1973), pp. 448–461. issn: 0022-
0000.
[BHL23] Jonathan Brophy, Zayd Hammoudeh, and Daniel Lowd. “Adapting
and Evaluating Influence-Estimation Methods for Gradient-Boosted
Decision Trees”. In: Journal of Machine Learning Research 24 (2023),
pp. 1–48. url: http://jmlr.org/papers/v24/22-0449.html.
[Buc+18] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow.
“Thermometer Encoding: One Hot Way To Resist Adversarial
Examples”. In: Proceedings of the 6th International Conference on
Learning Representations. ICLR’18. 2018. url: https://openreview.
net/forum?id=S18Su--CW.
[Cal+21] Stefano Calzavara, Claudio Lucchese, Federico Marcuzzi, and
Salvatore Orlando. “Feature Partitioning for Robust Tree Ensembles
and their Certification in Adversarial Scenarios”. In: EURASIP Journal
on Information Security (Dec. 2021), pp. 245–317.
[Car19] Nicholas Carlini. On Evaluating Adversarial Robustness. 2019. url:
https://youtu.be/-p2il-V-0fk?t=1574.
248
[Car+23] Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-
Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis,
Kurt Thomas, and Florian Tramèr. Poisoning Web-Scale Training
Datasets is Practical. arXiv. 2023. eprint: 2302.10149 (cs.CR). url:
https://arxiv.org/abs/2302.10149.
[CW17] Nicholas Carlini and David Wagner. “Towards Evaluating the
Robustness of Neural Networks”. In: Proceedings of the 2017 IEEE
Symposium on Security and Privacy. SP’17. IEEE Computer Society,
2017. doi: 10 . 1109 / SP . 2017 . 49. url: https : / / doi .
ieeecomputersociety.org/10.1109/SP.2017.49.
[Che+19] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig,
Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava.
“Detecting Backdoor Attacks on Deep Neural Networks by Activation
Clustering”. In: Proceedings of the AAAI Workshop on Artificial
Intelligence Safety. SafeAI’19. Honolulu, Hawaii, USA: Association
for the Advancement of Artificial Intelligence, 2019.
[Che+22] Ruoxin Chen, Zenan Li, Jie Li, Chentao Wu, and Junchi Yan.
“On Collective Robustness of Bagging Against Data Poisoning”. In:
Proceedings of the 39th International Conference on Machine Learning.
ICML’22. PMLR, 2022.
[CG16] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting
System”. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. KDD’16. New
York, NY, USA: Association for Computing Machinery, 2016.
[Che+17] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song.
Targeted Backdoor Attacks on Deep Learning Systems Using Data
Poisoning. 2017. arXiv: 1712.05526 [cs.CR].
[Che+21] Yuanyuan Chen, Boyang Li, Han Yu, Pengcheng Wu, and
Chunyan Miao. “HyDRA: Hypergradient Data Relevance Analysis
for Interpreting Deep Neural Networks”. In: Proceedings of the 35th
AAAI Conference on Artificial Intelligence. AAAI’21. Virtual Only:
Association for the Advancement of Artificial Intelligence, 2021.
[CCM13] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust High
Dimensional Sparse Regression and Matching Pursuit. arXiv. 2013.
eprint: 1301.2725 (stat.ML).
[Chi+20] Ping-yeh Chiang, Renkun Ni, Ahmed Abdelkader, Chen Zhu,
Christoph Studor, and Tom Goldstein. “Certified Defenses for
249
Adversarial Patches”. In: Proceedings of the 8th International
Conference on Learning Representations. ICLR’20. Virtual Only, 2020.
url: https://openreview.net/forum?id=HyeaSkrYPH.
[Chv79] Vasek Chvatal. “A Greedy Heuristic for the Set-Covering Problem”.
In: Mathematics of Operations Research 4.3 (1979), pp. 233–235. issn:
0364-765X.
[Coc11] Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data
as an End of Semester Regression Project”. In: Journal of Statistics
Education 19.3 (2011).
[CRK19] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. “Certified Adversarial
Robustness via Randomized Smoothing”. In: Proceedings of the 36th
International Conference on Machine Learning. ICML’19. PMLR, 2019.
url: https://arxiv.org/abs/1902.02918.
[Col+17] Cody A. Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao,
Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré,
and Matei Zaharia. “DAWNBench: An End-to-End Deep Learning
Benchmark and Competition”. In: Proceedings of the 2017 NeurIPS
Workshop on Machine Learning Systems. Long Beach, California, USA:
Curran Associates, Inc., 2017.
[Col22] Kevin Collier. Former Twitter Employee Sentenced to More than Three
Years in Prison for spying for Saudi Arabia. Dec. 2022. url: https:
//www.nbcnews.com/tech/security/former-twitter-employee-
sentenced-three-years-prison-spying-saudi-arab-rcna61384.
[Coo77] R. Dennis Cook. “Detection of Influential Observation in Linear
Regression”. In: Technometrics 19.1 (1977), pp. 15–18.
[CW82] R. Dennis Cook and Sanford Weisberg. Residuals and Influence in
Regression. New York: Chapman and Hall, 1982. isbn: 041224280X.
[CR92] Christophe Croux and Peter J. Rousseeuw. “A Class of High-
Breakdown Scale Estimators Based on Subranges”. In: Communications
in Statistics - Theory and Methods 21.7 (1992), pp. 1935–1951.
[DAm+20] Alexander D’Amour, Katherine A. Heller, Dan Moldovan, Ben Adlam,
Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton,
Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari,
Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam,
Mario Lucic, Yi-An Ma, Cory Y. McLean, Diana Mincu, Akinori Mitani,
Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson,
250
Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres,
Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh,
Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster,
Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, and D. Sculley.
Underspecification Presents Challenges for Credibility in Modern
Machine Learning. 2020. arXiv: 2011.03395 [cs.LG].
[DPV08] Sanjoy Dasgupta, Christos H. Papadimitriou, and Umesh V. Vazirani.
Algorithms. McGraw-Hill, 2008. isbn: 978-0-07-352340-8.
[DG06] Jesse Davis and Mark Goadrich. “The Relationship Between Precision-
Recall and ROC Curves”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML’06. Pittsburgh, Pennsylvania:
PMLR, 2006.
[Dhi+18] Guneet S. Dhillon, Kamyar Azizzadenesheli, Jeremy D. Bernstein,
Jean Kossaifi, Aran Khanna, Zachary C. Lipton, and Animashree Anandkumar.
“Stochastic Activation Pruning for Robust Adversarial Defense”.
In: Proceedings of the 6th International Conference on Learning
Representations. ICLR’18. 2018. url: https://openreview.net/
forum?id=H1uR4GZRZ.
[Ebr+18] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. “HotFlip:
White-Box Adversarial Examples for Text Classification”. In:
Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics. ACL’18. 2018.
[EK10] Tapio Elomaa and Jussi Kujala. “Covering Analysis of the Greedy
Algorithm for Partial Cover”. In: Algorithms and Applications: Essays
Dedicated to Esko Ukkonen on the Occasion of His 60th Birthday. Berlin,
Heidelberg: Springer-Verlag, 2010, pp. 102–113. isbn: 3642124755.
[Fel20] Vitaly Feldman. “Does Learning Require Memorization? A Short Tale
about a Long Tail”. In: Proceedings of the 52nd Annual ACM SIGACT
Symposium on Theory of Computing. STOC’20. 2020.
[FZ20] Vitaly Feldman and Chiyuan Zhang. “What Neural Networks Memorize
and Why: Discovering the Long Tail via Influence Estimation”. In:
Proceedings of the 34th Conference on Neural Information Processing
Systems. NeurIPS’20. Virtual Only: Curran Associates, Inc., 2020.
[FAL17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic
Meta-Learning for Fast Adaptation of Deep Networks”. In: Proceedings
of the 34th International Conference on Machine Learning. ICML’17.
Sydney, Australia: PMLR, 2017.
251
[FB81] Martin A. Fischler and Robert C. Bolles. “Random Sample Consensus:
A Paradigm for Model Fitting with Applications to Image Analysis
and Automated Cartography”. In: Communications of the ACM 24.6
(1981), pp. 381–395. issn: 0001-0782.
[Fow+21] Liam Fowl, Micah Goldblum, Ping-yeh Chiang, Jonas Geiping,
Wojtek Czaja, and Tom Goldstein. “Adversarial Examples Make Strong
Poisons”. In: Proceedings of the 35th Conference on Neural Information
Processing Systems. NeurIPS’21. Virtual Only: Curran Associates, Inc.,
2021.
[Gao+19] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe,
and Surya Nepal. “STRIP: A Defence against Trojan Attacks on
Deep Neural Networks”. In: Proceedings of the 35th Annual Computer
Security Applications Conference. ACSAC’19. San Juan, Puerto Rico,
USA: Association for Computing Machinery, 2019.
[Gei+21] Jonas Geiping, Liam Fowl, W. Ronny Huang, Wojciech Czaja,
Gavin Taylor, Michael Moeller, and Tom Goldstein. “Witches’
Brew: Industrial Scale Data Poisoning via Gradient Matching”.
In: Proceedings of the 9th International Conference on Learning
Representations. ICLR’21. Virtual Only, 2021.
[Gei+20] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel,
Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. “Shortcut
Learning in Deep Neural Networks”. In: Nature Machine Intelligence
2.11 (2020), pp. 665–673.
[GEW06] Pierre Geurts, Damien Ernst, and Louis Wehenkel. “Extremely
Randomized Trees”. In: Machine Learning 63.1 (2006), pp. 3–42. issn:
1573-0565.
[GSS15] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining
and Harnessing Adversarial Examples”. In: Proceedings of the
3rd International Conference on Learning Representations. ICLR’15.
2015. url: https://arxiv.org/abs/1412.6572.
[GF17] Bryce Goodman and Seth Flaxman. “European Union Regulations on
Algorithmic Decision-Making and a ‘Right to Explanation’”. In: AI
Magazine 38.3 (Oct. 2017), pp. 50–57.
[Gow+19] Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel,
Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Arthur Mann,
and Pushmeet Kohli. “Scalable Verified Training for Provably
Robust Image Classification”. In: Proceedings of the 2019 IEEE/CVF
252
International Conference on Computer Vision. ICCV’19. 2019. doi:
10.1109/ICCV.2019.00494.
[Gu+19] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg.
“BadNets: Evaluating Backdooring Attacks on Deep Neural Networks”.
In: IEEE Access 7 (2019), pp. 47230–47244. url: https : / /
ieeexplore.ieee.org/document/8685687.
[Guo+20] Chuan Guo, Tom Goldstein, Awni Y. Hannun, and Laurens van der Maaten.
“Certified Data Removal from Machine Learning Models”. In:
Proceedings of the 37th International Conference on Machine Learning.
Vol. 119. ICML’20. 2020, pp. 3832–3842.
[Guo+18] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten.
“Countering Adversarial Images using Input Transformations”.
In: Proceedings of the 6th International Conference on Learning
Representations. ICLR’18. 2018. url: https://openreview.net/
forum?id=SyJ7ClWCb.
[Guo+21] Han Guo, Nazneen Rajani, Peter Hase, Mohit Bansal, and
Caiming Xiong. “FastIF: Scalable Influence Functions for Efficient
Model Interpretation and Debugging”. In: Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing.
EMNLP’21. 2021.
[Gur22] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual. 2022.
url: https://www.gurobi.com.
[HL21] Zayd Hammoudeh and Daniel Lowd. “Simple, Attack-Agnostic Defense
Against Targeted Training Set Attacks Using Cosine Similarity”. In:
Proceedings of the 3rd ICML Workshop on Uncertainty and Robustness
in Deep Learning. UDL’21. 2021.
[HL22a] Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set
Attack’s Target Using Renormalized Influence Estimation”. In:
Proceedings of the 29th ACM SIGSAC Conference on Computer and
Communications Security. CCS’22. Los Angeles, CA: Association for
Computing Machinery, 2022. url: https://arxiv.org/abs/2201.
10055.
[HL22b] Zayd Hammoudeh and Daniel Lowd. “Training Data Influence Analysis
and Estimation: A Survey”. In: arXiv (2022). arXiv: 2212.04612
[cs.LG].
253
[HL23a] Zayd Hammoudeh and Daniel Lowd. “Feature Partition Aggregation: A
Fast Certified Defense Against a Union of ℓ0 Attacks”. In: Proceedings
of the 2nd ICML Workshop on New Frontiers in Adversarial Machine
Learning. AdvML-Frontiers’23. 2023. url: https://arxiv.org/abs/
2302.11628.
[HL23b] Zayd Hammoudeh and Daniel Lowd. Feature Partition Aggregation: A
Fast Certified Defense Against a Union of Sparse Adversarial Attacks.
2023. arXiv: 2302.11628 [cs.LG].
[HL23c] Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression to
Certified Classification for General Poisoning Attacks”. In: Proceedings
of the 1st IEEE Conference on Secure and Trustworthy Machine
Learning. SaTML’23. 2023. url: https://arxiv.org/abs/2208.
13904.
[HNM19] Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. “Data
Cleansing for Models Trained with SGD”. In: Proceedings of
the 33rd Conference on Neural Information Processing Systems.
NeurIPS’19. Vancouver, Canada: Curran Associates, Inc., 2019.
[Hea+21] Tim Head, Manoj Kumar, Holger Nahrstaedt, Gilles Louppe,
and Iaroslav Shcherbatyi. scikit-optimize: Sequential model-based
optimization in Python. Version v0.9.0. 2021.
[HA04] Victoria J. Hodge and Jim Austin. “A Survey of Outlier Detection
Methodologies”. In: Artificial Intelligence Review 22.2 (Oct. 2004),
pp. 85–126.
[Hop+17] Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. UCI
Machine Learning Repository: Spambase Dataset. 2017. url: https:
//archive.ics.uci.edu/ml/datasets/spambase.
[Hua+20] W. Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, and
Tom Goldstein. “MetaPoison: Practical General-purpose Clean-label
Data Poisoning”. In: Proceedings of the 34th Conference on Neural
Information Processing Systems. NeurIPS’20. Virtual Only: Curran
Associates, Inc., 2020. url: https://arxiv.org/abs/2004.00225.
[Hub64] Peter J. Huber. “Robust Estimation of a Location Parameter”. In:
Annals of Mathematical Statistics 35.1 (1964), pp. 73–101.
[Ily+19] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom,
Brandon Tran, and Aleksander Madry. “Adversarial Examples Are Not
Bugs, They Are Features”. In: Proceedings of the 33rd International
254
Conference on Neural Information Processing Systems. NeurIPS’19.
Red Hook, NY, USA: Curran Associates Inc., 2019.
[IS15] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating
Deep Network Training by Reducing Internal Covariate Shift”. In:
Proceedings of the 32nd International Conference on Machine Learning.
ICML’15. Lille, France: PMLR, 2015.
[Jag+18] Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu,
Cristina Nita-Rotaru, and Bo Li. “Manipulating Machine Learning:
Poisoning Attacks and Countermeasures for Regression Learning”. In:
Proceedings of the 2018 IEEE Symposium on Security and Privacy.
SP’18. 2018.
[Jag+21] Matthew Jagielski, Giorgio Severi, Niklas Pousette Harger, and
Alina Oprea. “Subpopulation Data Poisoning Attacks”. In: Proceedings
of the 28th ACM SIGSAC Conference on Computer and Communications
Security. CCS’21. Virtual Only: Association for Computing Machinery,
2021.
[JCG21] Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. “Intrinsic Certified
Robustness of Bagging against Data Poisoning Attacks”. In: Proceedings
of the 35th AAAI Conference on Artificial Intelligence. AAAI’21. 2021.
[Jia+22a] Jinyuan Jia, Yupei Liu, Xiaoyu Cao, and Neil Zhenqiang Gong.
“Certified Robustness of Nearest Neighbors against Data Poisoning
and Backdoor Attacks”. In: Proceedings of the 36th AAAI Conference
on Artificial Intelligence. AAAI’22. 2022. url: https://arxiv.org/
abs/2012.03765.
[Jia+22b] Jinyuan Jia, Binghui Wang, Xiaoyu Cao, Hongbin Liu, and
Neil Zhenqiang Gong. “Almost Tight ℓ0-norm Certified Robustness of
Top-k Predictions against Adversarial Perturbations”. In: Proceedings
of the 10th International Conference on Learning Representations.
ICLR’22. 2022. url: https : / / openreview . net / forum ? id =
gJLEXy3ySpu.
[Jin+20] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. “Is BERT
Really Robust? A Strong Baseline for Natural Language Attack on
Text Classification and Entailment”. In: Proceedings of the 34th AAAI
Conference on Artificial Intelligence. AAAI’20. 2020.
[Joh74] David S Johnson. “Approximation Algorithms for Combinatorial
Problems”. In: Journal of Computer and System Sciences 9.3 (1974),
pp. 256–278.
255
[JW78] John E. Dennis Jr. and Roy E. Welsch. “Techniques for nonlinear least
squares and robust regression”. In: Communications in Statistics -
Simulation and Computation 7.4 (1978), pp. 345–359.
[Ke+17] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen,
Weidong Ma, Qiwei Ye, and Tie-Yan Liu. “LightGBM: A Highly
Efficient Gradient Boosting Decision Tree”. In: Proceedings of the 31st
International Conference on Neural Information Processing Systems.
NeurIPS’17. 2017.
[KB15] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic
Optimization”. In: Proceedings of the 3rd International Conference on
Learning Representations. ICLR’15. 2015.
[KKM18] Adam R. Klivans, Pravesh K. Kothari, and Raghu Meka. “Efficient
Algorithms for Outlier-Robust Regression”. In: Proceedings of the 31st
Conference on Learning Theory. COLT’18. PMLR, 2018.
[KL17] Pang Wei Koh and Percy Liang. “Understanding Black-box Predictions
via Influence Functions”. In: Proceedings of the 34th International
Conference on Machine Learning. ICML’17. Sydney, Australia: PMLR,
2017.
[Kol+19] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann.
“Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs”. In:
Proceedings of the 32nd Conference on Computer Vision and Pattern
Recognition. CVPR’19. Long Beach, California, USA, 2019.
[KNH14] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10
Dataset. 2014.
[Kum+20] Ram Shankar Siva Kumar, Magnus Nyström, John Lambert,
Andrew Marshall, Mario Goertzel, Andi Comissoneru, Matt Swann, and
Sharon Xia. “Adversarial Machine Learning – Industry Perspectives”.
In: Proceedings of the 2020 IEEE Security and Privacy Workshops.
SPW’20. 2020. url: https://arxiv.org/abs/2002.05646.
[KGB16] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. “Adversarial
Examples in the Physical World”. In: (2016). arXiv: 1607.02533. url:
http://arxiv.org/abs/1607.02533.
[LSF21] Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. “Perceptual Adversarial
Robustness: Defense Against Unseen Threat Models”. In: Proceedings
of the 9th International Conference on Learning Representations.
256
ICLR’21. Virtual Only, 2021. url: https://arxiv.org/abs/2006.
12655.
[Lec89] Yvan G. Leclerc. “Constructing Simple Stable Descriptions for Image
Partitioning”. In: International Journal of Computer Vision 3.1 (1989),
pp. 73–102.
[LeC+98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
“Gradient-Based Learning Applied to Document Recognition”. In:
Proceedings of the IEEE. Vol. 86. 1998, pp. 2278–2324.
[Léc+19] Mathias Lécuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu,
and Suman Jana. “Certified Robustness to Adversarial Examples with
Differential Privacy”. In: Proceedings of the 2019 IEEE Symposium on
Security and Privacy. SP’19. IEEE, 2019. url: https://arxiv.org/
abs/1802.03471.
[Lee+19] Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola.
“Tight Certificates of Adversarial Robustness for Randomly Smoothed
Classifiers”. In: Proceedings of the 33rd Conference on Neural
Information Processing Systems. NeurIPS’19. 2019. url: https://
arxiv.org/abs/1906.04948.
[Lee16] Peter Lee. Learning from Tay’s Introduction. Mar. 2016. url: https:
/ / blogs . microsoft . com / blog / 2016 / 03 / 25 / learning - tays -
introduction/.
[LLP20] Sungyoon Lee, Jaewook Lee, and Saerom Park. “Lipschitz-Certifiable
Training with a Tight Outer Bound”. In: Proceedings of the 34th
International Conference on Neural Information Processing Systems.
NeurIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
[Lev+11] Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright,
Márk Félegyházi, Chris Grier, Tristan Halvorson, Chris Kanich,
Christian Kreibich, He Liu, Damon McCoy, Nicholas C. Weaver,
Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. “Click
Trajectories: End-to-End Analysis of the Spam Value Chain”. In: 2011
IEEE Symposium on Security and Privacy. SP’11 (2011), pp. 431–446.
[LF20a] Alexander Levine and Soheil Feizi. “(De)Randomized Smoothing for
Certifiable Defense against Patch Attacks”. In: Proceedings of the 34th
International Conference on Neural Information Processing Systems.
NeurIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020. url:
https://arxiv.org/abs/2002.10733.
257
[LF20b] Alexander Levine and Soheil Feizi. “Robustness Certificates for Sparse
Adversarial Attacks by Randomized Ablation”. In: Proceedings of the
34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020.
url: https://arxiv.org/abs/1911.09272.
[LF21] Alexander Levine and Soheil Feizi. “Deep Partition Aggregation:
Provable Defenses against General Poisoning Attacks”. In: Proceedings
of the 9th International Conference on Learning Representations.
ICLR’21. Virtual Only, 2021. url: https://arxiv.org/abs/2006.
14768.
[Li+19] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. “Certified
Adversarial Robustness with Additive Noise”. In: Proceedings of
the 33rd International Conference on Neural Information Processing
Systems. NeurIPS’19. Red Hook, NY, USA: Curran Associates Inc.,
2019. url: https://arxiv.org/abs/1809.03113.
[LXL23] Linyi Li, Tao Xie, and Bo Li. “SoK: Certified Robustness for Deep
Neural Networks”. In: Proceedings of the 44th IEEE Symposium on
Security and Privacy. SP’23. IEEE, 2023. url: https://arxiv.org/
abs/2009.04131.
[LDD21] Xiling Li, Rafael Dowsley, and Martine De Cock. “Privacy-
Preserving Feature Selection with Secure Multiparty Computation”. In:
Proceedings of the 38th International Conference on Machine Learning.
ICML’21. 2021. url: https://arxiv.org/abs/2102.03517.
[Li+21] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and
Xingjun Ma. “Anti-Backdoor Learning: Training Clean Models on
Poisoned Data”. In: Proceedings of the 35th Conference on Neural
Information Processing Systems. NeurIPS’21. Virtual Only: Curran
Associates, Inc., 2021.
[Li+22] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia.
“Backdoor Learning: A Survey”. In: IEEE Transactions on Neural
Networks and Learning Systems (2022). doi: 10.1109/TNNLS.2022.
3182979. url: https://arxiv.org/abs/2007.08745.
[Lin+20] Junyu Lin, Lei Xu, Yingqi Liu, and Xiangyu Zhang. “Composite
Backdoor Attack for Deep Neural Network by Mixing Existing Benign
Features”. In: Proceedings of the 2020 ACM SIGSAC Conference
on Computer and Communications Security. CCS’20. Virtual Only:
Association for Computing Machinery, 2020.
258
[LCY14] Min Lin, Qiang Chen, and Shuicheng Yan. “Network in Network”.
In: Proceedings of the 2nd International Conference on Learning
Representations. ICLR’14. 2014. url: https://arxiv.org/abs/
1312.4400.
[Liu+17] Chang Liu, Bo Li, Yevgeniy Vorobeychik, and Alina Oprea. “Robust
Linear Regression Against Training Data Poisoning”. In: Proceedings
of the 10th ACM Workshop on Artificial Intelligence and Security.
AISec’17. New York, NY, USA: Association for Computing Machinery,
2017.
[LDG18] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “Fine-Pruning:
Defending Against Backdooring Attacks on Deep Neural Networks”.
In: Proceedings of the International Symposium on Research in Attacks,
Intrusions, and Defenses. RAID’18. Heraklion, Crete, Greece: Springer,
2018, pp. 273–294.
[Liu+20a] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis.
“High Dimensional Robust Sparse Regression”. In: Proceedings of the
23rd International Conference on Artificial Intelligence and Statistics.
AISTATS’20. 2020.
[Liu+18] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai,
Weihang Wang, and Xiangyu Zhang. “Trojaning Attack on Neural
Networks”. In: Proceedings of the 25th Annual Network and Distributed
System Security Symposium. NDSS’18. San Diego, California, USA,
2018.
[Liu+20b] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi,
Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and
Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT
Pretraining Approach”. In: Proceedings of the 8th International
Conference on Learning Representations. ICLR’20. Virtual Only, 2020.
[LXS17] Yuntao Liu, Yang Xie, and Ankur Srivastava. “Neural Trojans”. In:
Proceedings of the 2017 IEEE International Conference on Computer
Design. ICCD’17. 2017. doi: 10.1109/ICCD.2017.16. url: https:
//ieeexplore.ieee.org/document/8119189.
[Lov75] László Lovász. “On the Ratio of Optimal Integral and Fractional
Covers”. In: Discrete Mathematics 13.4 (1975), pp. 383–390.
[Ma+18] Xingjun Ma, Bo Li, YisenWang, Sarah M. Erfani, Sudanthi Wijewickrema,
Grant Schoenebeck, Michael E. Houle, Dawn Song, and James Bailey.
“Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality”.
259
In: Proceedings of the 6th International Conference on Learning
Representations. ICLR’18. 2018. url: https://arxiv.org/abs/
1801.02613.
[Mad+18] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras,
and Adrian Vladu. “Towards Deep Learning Models Resistant
to Adversarial Attacks”. In: Proceedings of the 6th International
Conference on Learning Representations. ICLR’18. 2018. url: https:
//arxiv.org/abs/1706.06083.
[MWK20] Pratyush Maini, Eric Wong, and J. Zico Kolter. “Adversarial
Robustness Against the Union of Multiple Perturbation Models”. In:
International Conference on Machine Learning. ICML’20. 2020. url:
https://arxiv.org/abs/1909.04068.
[Mal+21] Andrey Malinin, Neil Band, Yarin Gal, Mark Gales, Alexander Ganshin,
German Chesnokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova,
Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy,
Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. “Shifts: A
Dataset of Real Distributional Shift Across Multiple Large-Scale
Tasks”. In: Proceedings of the 35th Conference on Neural Information
Processing Systems. NeurIPS’21. Curran Associates, Inc., 2021.
[MRA22] Neil G. Marchant, Benjamin I. P. Rubinstein, and Scott Alfeld. “Hard
to Forget: Poisoning Attacks on Certified Machine Unlearning”. In:
Proceedings of the 36th AAAI Conference on Artificial Intelligence.
AAAI’22. 2022.
[MY21] Jan Hendrik Metzen and Maksym Yatsura. “Efficient Certified Defenses
Against Patch Attacks on Image Classifiers”. In: Proceedings of the 9th
International Conference on Learning Representations. ICLR’21. 2021.
url: https://openreview.net/forum?id=hr-3PMvDpil.
[Muñ+17] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice,
Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. “Towards
Poisoning of Deep Learning Algorithms with Back-gradient Optimization”.
In: Proceedings of the 10th ACM Workshop on Artificial Intelligence
and Security. AISec’17. Dallas, Texas, USA: Association for Computing
Machinery, 2017.
[Mur16] Madhumita Murgia. “Microsoft’s Racist Bot Shows We Must Teach
AI to Play Nice and Police Themselves”. In: The Telegraph (Mar.
2016). url: https://www.telegraph.co.uk/technology/2016/
03/25/we-must-teach-ai-machines-to-play-nice-and-police-
themselves/.
260
[NP33] Jerzy Neyman and Egon S. Pearson. “On the Problem of the
Most Efficient Tests of Statistical Hypotheses”. In: Philosophical
Transactions of the Royal Society of London. Series A, Containing
Papers of a Mathematical or Physical Character 231 (1933),
pp. 289–337. issn: 02643952. url: http://www.jstor.org/stable/
91247.
[Ott+19] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross,
Nathan Ng, David Grangier, and Michael Auli. “fairseq: A Fast,
Extensible Toolkit for Sequence Modeling”. In: Proceedings of NAACL-
HLT 2019: Demonstrations. 2019.
[Pag20] David Page. How to Train Your ResNet. May 2020. url: https:
//myrtle.ai/learn/how-to-train-your-resnet/.
[Par+23] Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc,
and Aleksander Madry. “TRAK: Attributing Model Behavior at Scale”.
In: Proceedings of the 40th International Conference on Machine
Learning. ICML’23. 2023. url: https://arxiv.org/abs/2303.
14186.
[Pas+19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,
Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang,
Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch:
An Imperative Style, High-Performance Deep Learning Library”. In:
Proceedings of the 33rd Conference on Neural Information Processing
Systems. NeurIPS’19. Vancouver, Canada: Curran Associates, Inc.,
2019. url: https://arxiv.org/abs/1912.01703.
[Pea+22] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-
Gavitt, and Ramesh Karri. “Asleep at the Keyboard? Assessing the
Security of GitHub Copilot’s Code Contributions”. In: Proceedings of
the 43rd IEEE Symposium on Security and Privacy. SP’22. IEEE, 2022.
url: https://arxiv.org/abs/2108.09293.
[Pea+23] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and
Brendan Dolan-Gavitt. “Examining Zero-Shot Vulnerability Repair
with Large Language Models”. In: Proceedings of the 44th IEEE
Symposium on Security and Privacy. SP’23. IEEE, 2023. url: https:
//arxiv.org/abs/2112.02125.
[Pea94] Barak A. Pearlmutter. “Fast Exact Multiplication by the Hessian”. In:
Neural Computation 6 (1994), pp. 147–160.
261
[Ped+11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer,
Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos,
David Cournapeau, Matthieu Brucher, Matthieu Perrot, and
Édouard Duchesnay. “Scikit-learn: Machine Learning in Python”. In:
Journal of Machine Learning Research 12 (2011), pp. 2825–2830.
[Per+20] Neehar Peri, Neal Gupta, W. Ronny Huang, Liam Fowl, Chen Zhu,
Soheil Feizi, Tom Goldstein, and John P. Dickerson. “Deep k-NN
Defense Against Clean-label Data Poisoning Attacks”. In: Proceedings
of the ECCV Workshop on Adversarial Robustness in the Real World.
AROW’20. Virtual Only, 2020. url: https://arxiv.org/abs/1909.
13374.
[Pie21] Eric Pierce. Austin, TX House Listings. 2021. url: https://www.
kaggle.com/datasets/ericpierce/austinhousingprices.
[Pit+09] James Pita, Manish Jain, Fernando Ordóñez, Christopher Portway,
Milind Tambe, Craig Western, Praveen Paruchuri, and Sarit Kraus.
“Using Game Theory for Los Angeles Airport Security”. In: AI Magazine
30 (2009), pp. 43–57.
[Pru+20] Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan.
“Estimating Training Data Influence by Tracing Gradient Descent”. In:
Proceedings of the 34th Conference on Neural Information Processing
Systems. NeurIPS’20. Virtual Only: Curran Associates, Inc., 2020.
[Rad+19] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and
Ilya Sutskever. Language Models are Unsupervised Multitask Learners.
2019.
[Raj21] Kumar Rajarshi. Life Expectancy (WHO). 2021. url: https://www.
kaggle.com/datasets/kumarajarshi/life-expectancy-who.
[Ran+21] Yingli Ran, Zhao Zhang, Shaojie Tang, and Ding-Zhu Du. “Breaking
the rmax Barrier: Enhanced Approximation Algorithms for Partial
Set Multicover Problem”. In: INFORMS Journal on Computing 33.2
(2021), pp. 774–784.
[Rez+23] Keivan Rezaei, Kiarash Banihashem, Atoosa Chegini, and Soheil Feizi.
“Run-Off Election: Improved Provable Defense against Data Poisoning
Attacks”. In: Proceedings of the 40th International Conference on
Machine Learning. ICML’23. 2023. url: https://arxiv.org/abs/
2302.02300.
262
[Ric22] Terry Richards. Ex-Hospital Worker Arrested in SGMC Data Breach.
Jan. 2022. url: https://www.valdostadailytimes.com/news/
local _ news / ex - hospital - worker - arrested - in - sgmc - data -
breach/article_7ca92b22-a2e5-5541-b3b3-38472d3706b1.html.
[Ros+20] Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and J. Zico Kolter.
“Certified Robustness to Label-Flipping Attacks via Randomized
Smoothing”. In: Proceedings of the 37th International Conference on
Machine Learning. Vol. 119. ICML’20. PMLR, 2020, pp. 8230–8241.
[RC93] Peter Rousseeuw and Christophe Croux. “Alternatives to the
Median Absolute Deviation”. In: Journal of the American Statistical
Association (1993).
[RH11] Peter J. Rousseeuw and Mia Hubert. “Robust statistics for outlier
detection”. In: WIREs Data Mining and Knowledge Discovery 1.1
(2011), pp. 73–79.
[RH17] Peter J. Rousseeuw and Mia Hubert. “Anomaly Detection by Robust
Statistics”. In: WIREs Data Mining and Knowledge Discovery 8.2
(Nov. 2017).
[RL87] Peter J. Rousseeuw and Annick. M. Leroy. Robust Regression and
Outlier Detection. USA: John Wiley & Sons, Inc., 1987. isbn:
0471852333.
[Sal+20] Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang.
“Dynamic Backdoor Attacks Against Machine Learning Models”. In:
(2020). arXiv: 2003.03675 [cs.CR].
[Sch+19] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel.
“Towards the first adversarially robust neural network model on
MNIST”. In: Proceedings of the 7th International Conference on
Learning Representations. ICLR’19. 2019. url: https://openreview.
net/forum?id=S1EHOsC9tX.
[Sha+18] Ali Shafahi, W. Ronny Huang, Mahyar Najibi, Octavian Suciu,
Christoph Studer, Tudor Dumitras, and Tom Goldstein. “Poison Frogs!
Targeted Clean-Label Poisoning Attacks on Neural Networks”. In:
Proceedings of the 32nd Conference on Neural Information Processing
Systems. NeurIPS’18. Montreal, Canada: Curran Associates, Inc., 2018.
url: https://arxiv.org/abs/1804.00792.
[Sha+20] Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain,
and Praneeth Netrapalli. “The Pitfalls of Simplicity Bias in Neural
263
Networks”. In: Proceedings of the 34th Conference on Neural
Information Processing Systems. NeurIPS’20. 2020.
[Shi+19] Yishuo Shi, Yingli Ran, Zhao Zhang, James Willson, Guangmo Tong,
and Ding-Zhu Du. “Approximation algorithm for the partial set
multi-cover problem”. In: Journal of Global Optimization 75.4 (2019),
pp. 1133–1146.
[Sla97a] Petr Slav́ık. “A Tight Analysis of the Greedy Algorithm for Set Cover”.
In: Journal of Algorithms (1997), pp. 237–254.
[Sla97b] Petr Slav́ık. “Improved Performance of the Greedy Algorithm for
Partial Cover”. In: Information Processing Letters 64.5 (1997),
pp. 251–254. issn: 0020-0190.
[Soc+13] Richard Socher, Alex Perelygin, JeanWu, Jason Chuang, Christopher Manning,
Andrew Ng, and Christopher Potts. “Recursive Deep Models
for Semantic Compositionality Over a Sentiment Treebank”. In:
Proceedings of the 8th Conference on Empirical Methods in Natural
Language Processing. EMNLP’13. 2013.
[Sor+20] Ezekiel O. Soremekun, Sakshi Udeshi, Sudipta Chattopadhyay, and
Andreas Zeller. Exposing Backdoors in Robust Machine Learning
Models. 2020. arXiv: 2003.00865 [cs.LG].
[SGF22] Gaurang Sriramanan, Maharshi Gor, and Soheil Feizi. “Toward Efficient
Robust Training against Union of ℓp Threat Models”. In: Proceedings
of the 36th Conference on Neural Information Processing Systems.
NeurIPS’22. 2022. url: https : / / openreview . net / forum ? id =
6qdUJblMHqy.
[SKL17] Jacob Steinhardt, Pang Wei Koh, and Percy Liang. “Certified Defenses
for Data Poisoning Attacks”. In: Proceedings of the 31st Conference
on Neural Information Processing Systems. NeurIPS’17. Long Beach,
California, USA: Curran Associates, Inc., 2017.
[SD20] Cecilia Summers and Michael J. Dinneen. “Four Things Everyone
Should Know to Improve Batch Normalization”. In: Proceedings of the
8th International Conference on Learning Representations. ICLR’20.
Virtual Only, 2020.
[TZ00] Philip H. S. Torr and Andrew Zisserman. “MLESAC: A New Robust
Estimator with Application to Estimating Image Geometry.” In:
Computer Vision and Image Understanding 78.1 (2000), pp. 138–156.
264
[TB19] Florian Tramer and Dan Boneh. “Adversarial Training and Robustness
for Multiple Perturbations”. In: Proceedings of the 33rd Conference on
Neural Information Processing Systems. NeurIPS’19. Curran Associates,
Inc., 2019. url: https://arxiv.org/abs/1904.13000.
[Tra+20] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry.
“On Adaptive Attacks to Adversarial Example Defenses”. In:
Proceedings of the 34th Conference on Neural Information Processing
Systems. NeurIPS’20. Curran Associates, Inc., 2020. url: https://
arxiv.org/abs/2002.08347.
[TLM18] Brandon Tran, Jerry Li, and Aleksander Madry. “Spectral Signatures
in Backdoor Attacks”. In: Proceedings of the 32nd Conference on
Neural Information Processing Systems. NeurIPS’18. Montreal, Canada:
Curran Associates, Inc., 2018.
[Tur20] Matt Turek. Artificial Intelligence Exploration Opportunity DARPA-
PA-19-03-09 Reverse Engineering of Deceptions (RED) Amendment
#1. United States Defense Advanced Research Projects Agency
(DARPA). 2020.
[Ude+19] Sakshi Udeshi, Shanshan Peng, Gerald Woo, Lionell Loh, Louth Rawshan,
and Sudipta Chattopadhyay. Model Agnostic Defence against Backdoor
Attacks in Machine Learning. 2019. arXiv: 1908.02203 [cs.LG].
[Van14] Robert J. Vanderbei. Linear Programming: Foundations and
Extensions. Boston, MA: Springer, 2014. isbn: 978-3-030-39414-1.
[VB20] Miguel Villarreal-Vasquez and Bharat K. Bhargava. ConFoc: Content-
Focus Protection Against Trojan Attacks on Neural Networks. 2020.
arXiv: 2007.00711 [cs.CV].
[Wal+21] Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh. “Concealed
Data Poisoning Attacks on NLP Models”. In: Proceedings of the North
American Chapter of the Association for Computational Linguistics.
NAACL’21. 2021.
[Wan+18] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros.
Dataset Distillation. 2018. arXiv: 1811.10959 [cs].
[WF23] Wenxiao Wang and Soheil Feizi. Temporal Robustness Against Data
Poisoning. 2023. arXiv: 2302.03684 [cs.LG]. url: https://arxiv.
org/abs/2302.03684.
[WLF22a] Wenxiao Wang, Alexander Levine, and Soheil Feizi. “Improved
Certified Defenses against Data Poisoning with (Deterministic) Finite
265
Aggregation”. In: Proceedings of the 39th International Conference on
Machine Learning. ICML’22. 2022. url: https://arxiv.org/abs/
2202.02628.
[WLF22b] Wenxiao Wang, Alexander Levine, and Soheil Feizi. “Lethal Dose
Conjecture on Data Poisoning”. In: Proceedings of the 36th Conference
on Neural Information Processing Systems. NeurIPS’22. Curran
Associates, Inc., 2022. url: https://arxiv.org/abs/2208.03309.
[Web+23] Maurice Weber, Xiaojun Xu, Bojan Karlaš, Ce Zhang, and Bo Li.
“RAB: Provable Robustness Against Backdoor Attacks”. In: Proceedings
of the 44th IEEE Symposium on Security and Privacy. SP’23. IEEE,
2023. url: https://arxiv.org/abs/2003.08904.
[Wei+22] Kang Wei, Jun Li, Chuan Ma, Ming Ding, Sha Wei, Fan Wu,
Guihai Chen, and Thilina Ranbaduge. Vertical Federated Learning:
Challenges, Methodologies and Experiments. 2022. arXiv: 2202.04309
[cs.LG]. url: https://arxiv.org/abs/2202.04309.
[Wen+18] Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh,
Luca Daniel, Duane Boning, and Inderjit Dhillon. “Towards Fast
Computation of Certified Robustness for ReLU Networks”. In:
Proceedings of the 35th International Conference on Machine Learning.
ICML’18. PMLR, 2018. url: https://arxiv.org/abs/1804.09699.
[Wic16] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. New
York, NY: Springer-Verlag, 2016. isbn: 978-3-319-24277-4.
[XZZ20] Chang Xiao, Peilin Zhong, and Changxi Zheng. “Enhancing
Adversarial Defense by k-Winners-Take-All”. In: Proceedings of the
8th International Conference on Learning Representations. ICLR’20.
Virtual Only, 2020. url: https://arxiv.org/abs/1905.10510.
[Xia+15] Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera,
Claudia Eckert, and Fabio Roli. “Is Feature Selection Secure against
Training Data Poisoning?” In: Proceedings of the 32nd International
Conference on Machine Learning. ICML’15. Lille, France: PMLR, 2015.
[Yeh+18] Chih-Kuan Yeh, Joon Sik Kim, Ian E.H. Yen, and Pradeep Ravikumar.
“Representer Point Selection for Explaining Deep Neural Networks”. In:
Proceedings of the 32nd Conference on Neural Information Processing
Systems. NeurIPS’18. Montreal, Canada: Curran Associates, Inc., 2018.
[YHL23a] Wencong You, Zayd Hammoudeh, and Daniel Lowd. “Large Language
Models Are Better Adversaries: Exploring Generative Clean-Label
266
Backdoor Attacks Against Text Classifiers”. In: Proceedings of the
2nd ICML Workshop on New Frontiers in Adversarial Machine
Learning. AdvML-Frontiers’23. 2023.
[YHL23b] Wencong You, Zayd Hammoudeh, and Daniel Lowd. “Large Language
Models Are Better Adversaries: Exploring Generative Clean-Label
Backdoor Attacks Against Text Classifiers”. In: Findings of the
Association for Computational Linguistics. ELMNLP’23. 2023.
[Yu+21] Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu.
Indiscriminate Poisoning Attacks are Shortcuts. 2021. arXiv: 2111.
00898 [cs.LG].
[Zha+19] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing,
Laurent El Ghaoui, and Michael I. Jordan. “Theoretically Principled
Trade-off between Robustness and Accuracy”. In: Proceedings of the
36th International Conference on Machine Learning. ICML’19. PMLR,
2019. url: https://arxiv.org/abs/1901.08573.
[Zha+18] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-
Paz. “mixup: Beyond Empirical Risk Minimization”. In: Proceedings of
the 6th International Conference on Learning Representations. ICLR’18.
2018.
[ZZ22] Rui Zhang and Shihua Zhang. “Rethinking Influence Functions of
Neural Networks in the Over-Parameterized Regime”. In: Proceedings
of the 36th AAAI Conference on Artificial Intelligence. AAAI’22.
Vancouver, Canada: Association for the Advancement of Artificial
Intelligence, 2022.
[Zhu+19] Chen Zhu, W Ronny Huang, Ali Shafahi, Hengduo Li, Gavin Taylor,
Christoph Studer, and Tom Goldstein. “Transferable Clean-Label
Poisoning Attacks on Deep Neural Nets”. In: Proceedings of the
36th International Conference on Machine Learning. ICML’19. Los
Angeles, CA: PMLR, 2019.
[Zhu+21] Liuwan Zhu, Rui Ning, Chunsheng Xin, Chonggang Wang, and
Hongyi Wu. “CLEAR: Clean-up Sample-Targeted Backdoor in Neural
Networks”. In: Proceedings of the 18th International Conference on
Computer Vision. ICCV’21. 2021.
267