ROBUSTNESS OF NEURAL NETWORKS FOR DISCRETE INPUT: AN ADVERSARIAL PERSPECTIVE by JAVID EBRAHIMI A DISSERTATION Presented to the Department of Computer and Information Science and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2018 DISSERTATION APPROVAL PAGE Student: Javid Ebrahimi Title: Robustness of Neural Networks for Discrete Input: An Adversarial Perspective This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer and Information Science by: Daniel Lowd Co-Chair Dejing Dou Co-Chair Reza Rejaie Core Member Yashar Ahmadian Institutional Representative and Janet Woodruff-Borden Vice Provost and Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded December 2018 ii c© 2018 Javid Ebrahimi All rights reserved. iii DISSERTATION ABSTRACT Javid Ebrahimi Doctor of Philosophy Department of Computer and Information Science December 2018 Title: Robustness of Neural Networks for Discrete Input: An Adversarial Perspective In the past few years, evaluating on adversarial examples has become a standard procedure to measure robustness of deep learning models. Literature on adversarial examples for neural nets has largely focused on image data, which are represented as points in continuous space. However, a vast proportion of machine learning models operate on discrete input, and thus demand a similar rigor in understanding their vulnerabilities and robustness. We study robustness of neural network architectures for textual and graph inputs, through the lens of adversarial input perturbations. We will cover methods for both attacks and defense; we will focus on 1) addressing challenges in optimization for creating adversarial perturbations for discrete data; 2) evaluating and contrasting white-box and black- box adversarial examples; and 3) proposing efficient methods to make the models robust against adversarial attacks. This dissertation includes previously published co-authored material. iv CURRICULUM VITAE NAME OF AUTHOR: Javid Ebrahimi GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Tarbiat Modares University, Tehran, Iran University of Tehran, Tehran, Iran DEGREES AWARDED: Doctor of Philosophy, Computer Science, 2018, University of Oregon Master of Science, Computer Science, 2018, University of Oregon Master of Science, Software Engineering, 2012, Tarbiat Modares University Bachelor of Science, Software Engineering, 2008, University of Tehran AREAS OF SPECIAL INTEREST: Deep Learning Adversarial Machine Learning Natural Language Processing Relational Learning PROFESSIONAL EXPERIENCE: Research Assistant, Fall 2014-Fall 2018 Teaching Assistant, Fall 2013-Spring 2014 GRANTS, AWARDS AND HONORS: SIGIR Travel Grant, 2016 Best Graduate Teaching Fellow, Department of Computer and Information Science, 2013 v PUBLICATIONS: Ebrahimi J., Lowd, D. & Dou, D. (2018). On Adversarial Examples for Character-Level Neural Machine Translation. In Proceedings of 27th International Conference on Computational Linguistics. Ebrahimi J., Rao, A., Lowd D. & Dou, D. (2018). HotFlip: White-Box Adversarial Examples for Text Classification. (short) In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Ebrahimi J., Dou, D.& Lowd D. (2016). Weakly Supervised Tweet Stance Classification by Relational Bootstrapping. (short) In Proceedings of Conference on Empirical Methods in Natural Language Processing. Ebrahimi J., Dou, D.& Lowd D. (2016). A Joint Sentiment-Target- Stance Model for Stance Classification in Tweets. In Proceedings of 26th International Conference on Computational Linguistics. Nhathai P., Ebrahimi J., Kil D., Piniewski B. & Dou D. (2016) Topic- Aware Physical Activity Propagation in a Health Social Network In IEEE Intelligent Systems. Ebrahimi J., Nhathai P., Kil D., Piniewski B. & Dou D. (2016) Characterizing Physical Activity in a Health Social Network. In Proceedings of the 6th International Conference on Digital Health Ebrahimi, J. & Dou, D. (2015) Chain-Based RNN for Relation Classification. (short) In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. vi ACKNOWLEDGEMENTS I would like to express my gratitude and appreciation to my advisors, Dejing Dou and Daniel Lowd, for their support and supervision throughout the development of this thesis. I would like to thank other members on my committee, Reza Rejaie and Yashar Ahmadian, for their perspective and our discussions. I am forever indebted to my parents for everything they have done for me, and I am grateful to my siblings for their unconditional love and never-ending support. Many thanks to my collaborators Hai Phan and Anyi Rao; my friends and lab members Sabin Kafle, Nisansa De Silva, and Amnay Amimeur; Hank Childs for his generosity with his lab’s computing resources; and my friend and editor, Ellen Klowden. This work was funded by ARO grant W911NF-15-1-0265. vii In memory of my grandfather viii TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . 3 Adversarial Defense . . . . . . . . . . . . . . . . . . . . . 4 Adversarial Examples for Discrete Input . . . . . . . . . . . . . . 6 Adversarial Examples for NLP . . . . . . . . . . . . . . . . . . 7 Black-Box vs. White-Box Methods for NLP . . . . . . . . . . . . 8 Adversarial Examples for Graphs . . . . . . . . . . . . . . . . . 9 Robust Graph Neural Nets . . . . . . . . . . . . . . . . . . . . 9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 II. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . 12 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . . . 13 Linear Attacks . . . . . . . . . . . . . . . . . . . . . . . 14 One-Shot Attack . . . . . . . . . . . . . . . . . . . . . 14 Iterative Attack . . . . . . . . . . . . . . . . . . . . . 15 Nonlinear Attacks . . . . . . . . . . . . . . . . . . . . . . 16 Adversarial Example Generation with GANs . . . . . . . . . 17 Black-Box Attacks . . . . . . . . . . . . . . . . . . . . . . 18 Our Approach to Create Adversarial Examples . . . . . . . . . 19 ix Chapter Page Defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Gradient Masking . . . . . . . . . . . . . . . . . . . . . . 21 Adversarial Example Detection . . . . . . . . . . . . . . . . 22 Adversarial Training . . . . . . . . . . . . . . . . . . . . . 22 Robust Optimization . . . . . . . . . . . . . . . . . . . . . 23 Our Approach to Defend Against Adversarial Attacks . . . . . . 24 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Attacks on Continuous vs. Discrete Spaces . . . . . . . . . . . 25 Targeted vs. Untargeted Attacks for Machine Translation . . . . . 26 Black-Box Attacks vs. White-Box Attacks for NLP . . . . . . . 26 Defense for Graph Neural Nets . . . . . . . . . . . . . . . . 26 III.ADVERSARIAL EXAMPLES FOR TEXT . . . . . . . . . . . . . . 28 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 28 White-Box Adversarial Examples . . . . . . . . . . . . . . . . . 28 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Text Representation . . . . . . . . . . . . . . . . . . . . . 31 HotFlip: Differentiable String-Edit Operations . . . . . . . . . . 31 Multiple Changes . . . . . . . . . . . . . . . . . . . . . . 33 Beam Search . . . . . . . . . . . . . . . . . . . . . . . 33 Embeddings Under Adversarial Noise . . . . . . . . . . . . . . . 35 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Text Classification . . . . . . . . . . . . . . . . . . . . . . 37 Machine Translation . . . . . . . . . . . . . . . . . . . . . 40 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 41 x Chapter Page Transferability of Adversarial Examples . . . . . . . . . . . . . 43 Human Perception . . . . . . . . . . . . . . . . . . . . . . 44 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 IV.BLACK-BOX VS. WHITE-BOX ADVERSARIES FOR NLP . . . . . . 48 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Controlled and Targeted Attacks . . . . . . . . . . . . . . . . . 49 Multiple Changes . . . . . . . . . . . . . . . . . . . . . . 51 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Analysis of Adversaries . . . . . . . . . . . . . . . . . . . . 54 Untargeted Attack . . . . . . . . . . . . . . . . . . . . . . 54 Controlled Attack . . . . . . . . . . . . . . . . . . . . . . 55 Targeted Attack . . . . . . . . . . . . . . . . . . . . . . . 57 Some Adversarial Examples . . . . . . . . . . . . . . . . . . 58 Robustness to Adversarial Examples . . . . . . . . . . . . . . . 60 Adversarial Training . . . . . . . . . . . . . . . . . . . . . 60 Machine Translation . . . . . . . . . . . . . . . . . . . . . 60 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 63 Text Classification . . . . . . . . . . . . . . . . . . . . . . 64 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 V. ADVERSARIAL EXAMPLES FOR GRAPHS . . . . . . . . . . . . 68 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Transductive Case . . . . . . . . . . . . . . . . . . . . . . 69 xi Chapter Page Maximal Independent Adversarial Set (MIAS) . . . . . . . . 71 Inductive Case . . . . . . . . . . . . . . . . . . . . . . . 71 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Transductive Classification . . . . . . . . . . . . . . . . . . 72 Inductive Classification . . . . . . . . . . . . . . . . . . . . 74 Attacks on a Recommender System . . . . . . . . . . . . . . 76 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 VI.ROBUST GRAPH NEURAL NETWORKS . . . . . . . . . . . . . . 80 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Edge-Gated Message Passing . . . . . . . . . . . . . . . . . . . 80 Defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Link-Based Model . . . . . . . . . . . . . . . . . . . . . . 83 Feature-Based Model . . . . . . . . . . . . . . . . . . . . . 83 Robustness to Norm-Bounded Perturbations on Neighbors . . . . 85 Jacobian Regularization . . . . . . . . . . . . . . . . . . . 88 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Gated Message Passing . . . . . . . . . . . . . . . . . . . . 90 Transductive Experiments . . . . . . . . . . . . . . . . . 91 Inductive Experiments . . . . . . . . . . . . . . . . . . 93 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . 94 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Gradient Masking Is Not an Issue . . . . . . . . . . . . . . . 95 Inverted Word Attack . . . . . . . . . . . . . . . . . . . . 97 xii Chapter Page Models’ Behavior . . . . . . . . . . . . . . . . . . . . . . 99 Impact of Random Neighbors . . . . . . . . . . . . . . . . . 100 Impact of Covariance-Based Transformation . . . . . . . . . . . 102 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 VII.CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . 104 Summary of Contributions . . . . . . . . . . . . . . . . . . . . 104 Attacking a Model with Discrete Input . . . . . . . . . . . . . 104 Black-Box vs. White-Box Adversary for NLP . . . . . . . . . . 104 Improved Message Passing in Graph Neural Nets . . . . . . . . 105 Robust GraphSage . . . . . . . . . . . . . . . . . . . . . . 105 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 105 Theoretical Investigation of Black-Box Examples . . . . . . . . . 105 GANs for Defense . . . . . . . . . . . . . . . . . . . . . . 106 Robust Defense for NLP . . . . . . . . . . . . . . . . . . . 106 Data-Poisoning Attacks/Defense for Graphs . . . . . . . . . . . 107 APPENDICES A. DEEP LEARNING BACKGROUND . . . . . . . . . . . . . . . . 108 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 108 Feed-Forward Neural Network . . . . . . . . . . . . . . . . . 108 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . 109 Highway Network . . . . . . . . . . . . . . . . . . . . . . 110 Recurrent Neural Nets . . . . . . . . . . . . . . . . . . . . 110 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . 111 xiii Chapter Page Sequence-to-Sequence . . . . . . . . . . . . . . . . . . . . 112 Character-Level NLP Models . . . . . . . . . . . . . . . . . . . 113 CharCNN-LSTM Architecture . . . . . . . . . . . . . . . . . 114 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . 115 Graph Convolutional Neural Networks . . . . . . . . . . . . . 118 Sample and Aggregate . . . . . . . . . . . . . . . . . . . . 119 B. ROBUST GRAPH NEURAL NETS . . . . . . . . . . . . . . . . 121 Loopy-BP-Inspired Message Passing . . . . . . . . . . . . . . . . 121 Robust Optimization for GraphSage . . . . . . . . . . . . . . . . 122 Jacobian Regularization with Layer Normalization . . . . . . . . . 125 C. NOTATION AND SYMBOLS . . . . . . . . . . . . . . . . . . . 127 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 128 xiv LIST OF FIGURES Figure Page 1. Comparing the HotFlip direction and a random direction based on the average squared distance between the embedding of the original word, and the embedding of the modified word, found from the outputs of the CNN and highway layers. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2. Adversary’s success rate as a function of confidence for classification. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3. Adversary’s success rate as a function of the decrease in BLEU score for EN→DE translation. Similar pattern exists for DE→EN translation. . . . . . . . . . . . . . . . . . . . . . 41 4. Proportion of types of changes that the adversary chooses. . . . . . . . 43 5. Performance of humans with Mechanical Turk experiments. . . . . . . 45 6. Illustration of a goal-based attack. The fifth word of the translation will be either removed or replaced. The loss over the rest of the words in the translation are not involved in creating adversarial examples. . . . . . . . . . . . . . . . . . . . 51 7. Comparing the distribution of the true increase in loss and its gradient-based estimate, and their correlation, using best flips for each word in a sentence of the German test set. . . . . . . . . 55 8. Success rate of white-box and black-box adversaries in a controlled setting as a function of α. . . . . . . . . . . . . . . . . 57 9. Success rate of white-box and black-box adversaries in the second-most-likely targeted attack as a function of α. . . . . . . . . . 58 10. Distribution of types of noise in the natural noise corpora. . . . . . . . 61 11. Training loss on adversarial examples for FIDS-W. . . . . . . . . . . . 63 12. BLEU score on Nat and Key datasets using methods which do not use either type of noise in training. . . . . . . . . . . . . . . 65 xv Figure Page 13. Average number of flips committed by the adversary to attack each model, as a function of document length. . . . . . . . . . 66 14. Decrease of accuracy under adversarial edge perturbation. . . . . . . . 73 15. Micro-F1 score under adversarial edge perturbation. . . . . . . . . . 75 16. Adversary’s success in increasing the RMSE error for three collaborative filtering models. The x axis shows the percentage of manipulated ratings. . . . . . . . . . . . . . . . . . 77 17. Jaccard index for two pairs of sets: top-k popular items vs. top-k affected items, and top-k active users vs. top-k affected users. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 18. Comparing gated message passing, on the right, with the vanilla message passing on the left. Gated message passing will give more importance to some of the messages, as denoted by bolder edges. . . . . . . . . . . . . . . . . . . . . . 82 19. Instead of adding an adversarial edge from node 0 to node 1, we can we can add an edge to any data point within the convex hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 20. Loss on validation set for PPI dataset using vanilla and gated message passing. . . . . . . . . . . . . . . . . . . . . . . 91 21. Binary classification accuracy using GCN; edges are gradually added to a random graph. . . . . . . . . . . . . . . . . 92 22. Increase/Decrease in error from white-box to black-box attack on the PPI dataset. . . . . . . . . . . . . . . . . . . . . . 96 23. Distribution of gradients for a non-robust model and Jacobian-based robust model. . . . . . . . . . . . . . . . . . . . 97 24. Comparison of defense against inverted word attack on Pubmed. . . . . 98 25. Variance of the weights of the first-layer aggregator on Pubmed. . . . . 100 26. Ranking-based loss for the link-based method on Pubmed. . . . . . . . 101 27. Alignment of edge-perturbation with class distribution for Pubmed. The x-axis denotes the label of the adversarial node, and the y-axis denotes the label of the original node. . . . . . . 101 xvi Figure Page 28. Robustness of adversarially trained models on the PPI dataset using PGD and PGD with random starting points for 20% of neighbors. . . . . . . . . . . . . . . . . . . . . . . . 102 29. Robustness of adversarially trained models on the Reddit dataset using feature-based perturbations and covariance- transformed feature-based perturbations. . . . . . . . . . . . . . . 103 A.30.Neural network with four input neurons, one hidden layer of six units, and 1 output neuron. . . . . . . . . . . . . . . . . . . . 109 A.31.A recurrent neural network . . . . . . . . . . . . . . . . . . . . 111 A.32.A Simple sequence-to-sequence model . . . . . . . . . . . . . . . . 113 A.33.The CharCNN-LSTM architecture used for studying two tasks of text classification and translation. . . . . . . . . . . . . . . 115 A.34.Illustration of transductive node classification (left) and inductive node classification (right). The white nodes are the test nodes, which are to be classified to one of the two classes colored green and gray. . . . . . . . . . . . . . . . . . . . 118 A.35.Message Passing in GCN or GraphSage, using two-hop neighbors (blue) and one-hop neighbors (red) to the target node (green). In GraphSage not all neighbors contribute to the representation of the target node. For instance, the white node does not contribute to learning the representation of the target node. . . . . . . . . . . . . . . . . . . 120 B.36.A graphical representation contrasting standard message passing among immediate neighbors and loopy-bp-inspired message passing. Nodes from lower layers send messages to the higher ones. . . . . . . . . . . . . . . . . . . . . . . . . . 121 xvii LIST OF TABLES Table Page 1. Successful and unsuccessful adversarial examples. . . . . . . . . . . . 31 2. Comparing beam-search strategies, based on the number of changes (i.e., 1,2, 3 or more). We report the proportion of successfully created adversarial examples and the average time, measured in seconds, spent for each number of change. . . . . . . 35 3. Nearest neighbor words (based on cosine similarity) of word representations from CharCNN-LSTM picked after the highway layers. A single adversarial change in the word often results in a big change in the embedding, which would make the word more similar to other words, rather than to the original word. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4. Adversarial examples which are mislcassified by the model. The correct label is shown on top of each block. . . . . . . . . . . . . 39 5. Evaluation of different adversarial changes on AG’s news dataset for sentences with up to 25 words. . . . . . . . . . . . . . . 40 6. Adversarial examples for DE→EN translations. . . . . . . . . . . . . 42 7. Misclassification error of Char-CNN model on adversarial examples generated by CharCNN-LSTM and random changes. . . . . . 44 8. Adversarial examples imported from CharCNN-LSTM, which trick CharCNN too. . . . . . . . . . . . . . . . . . . . . . 44 9. Error analysis of human annotation. . . . . . . . . . . . . . . . . . 46 10. Controlled and Targeted Attack on DE→EN NMT. In the first example, the adversary wants to suppress a person’s name, and in the second example, the adversary wants to replace occurrences of therapist with psychopath . . . . . . . . . . . 49 11. Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . 54 12. BLEU score after greedy decoding in the existence of different types of untargeted attacks. . . . . . . . . . . . . . . . . 56 xviii Table Page 13. Efficiency of attacks. . . . . . . . . . . . . . . . . . . . . . . . 57 14. A controlled attack and seven targeted attacks on our DE- EN NMT. The first example shows a controlled attack; the rest show a second-most-likely, except for the last one which shows 100th-most likely targeted attack. . . . . . . . . . . . . . . . 59 15. BLEU score of models on clean and adversarial examples, using a decoder with beam size of 4. The best result on each test set is shown in bold. FIDS-W performs best on all noisy test sets, compared with models which have not been trained on that particular noise (shown in red). FIDS-B performs best on white-box adversarial examples compared with other black-box trained models (shown in blue). . . . . . . . . . . . . . . 62 16. Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . 74 17. Datasets statistics . . . . . . . . . . . . . . . . . . . . . . . . . 74 18. Accuracy of the models with one noisy feature per test instance. . . . . 74 19. Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . 75 20. Movies that appear in the top-20 popular items and top-20 items which are affected by the adversary. . . . . . . . . . . . . . . 78 21. Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . 90 22. Accuracy on Citation Networks . . . . . . . . . . . . . . . . . . . 93 23. Micro-F1 on three datasets. . . . . . . . . . . . . . . . . . . . . 94 24. Hyper-parameters for each of the robust baselines. . . . . . . . . . . 95 25. Micro-F1 score under white-box attacks on vanilla models (mean, attention, and gated) and our robust baselines (gatedLink, gatedFea, gatedJac, PGD, and convex), using 0.01, 0.05, 0.10, and 0.15 edge replacement strategies. . . . . . . . . . 96 26. Micro-F1 score under black-box attacks using 0.01, 0.05, 0.10, and 0.15 edge replacements from the mean model. . . . . . . . . 97 27. Models’ sensitivity to attacks. . . . . . . . . . . . . . . . . . . . 106 xix CHAPTER I INTRODUCTION Affordable technologies an ever-increasing number of apps have made machine learning algorithms an indispensable part of our life. While since the early days of AI there have been logical and philosophical discussions of dangers of machine intelligence, the dangers of machine stupidity are often disregarded. While auto- correct errors are innocuous, a mistake by an autonomous car can be deadly. While a spam in our inbox is annoying, a fraud detection failure could cost a business millions of dollars. Robustness of machine learning algorithms has been a subject of interest for more than a decade. It is common to devise models, which are robust to worst- case perturbation within some defined bounds. Examples which are created using such perturbations are called adversarial examples. Adversarial examples are powerful tools to investigate the vulnerabilities of a machine learning model (Szegedy et al., 2014). These examples are inputs to a predictive machine learning model that are maliciously designed to cause poor performance (I. J. Goodfellow, Shlens, & Szegedy, 2015). Robust machine learning has a long history in machine learning, going back to adversarial attacks on linear spam classifiers (Dalvi, Domingos, Sanghai, Verma, et al., 2004; Lowd & Meek, 2005). This line of research has recently received a lot of attention in the deep learning community, partly because adversarial examples undermine our confidence in the promise of deep learning. Specifically, adversarial examples expose weaknesses in models’ generalizability : the ability of the model to generalize to unseen data. Let us proceed with an example: Last year, a mistranslation by Facebook’s machine translator (MT) lead to someone’s wrongful arrest (Berger, 2017). Instead of translating an Arabic phrase to “Good Morning”, Facebook’s MT translated it to 1 “Murder Them”. Arabic is a morphologically-rich language, and the MT mistook the input word with another which differs from the input by only one character. As machine translation is used more and more, it is increasingly important to understand its worst-case failures to prevent incidents like this. Furthermore, by studying the robustness of MT models, we can create models which are robust to more benign but still noisy data (e.g., typos). The security threats of adversarial examples for textual and graph-based data cannot be underestimated; for instance, spammers have been modifying their emails to evade spam filters since the late ’90s, and they engage in link farming to get higher ranks in social media or search engines (Ghosh et al., 2012). Nevertheless, literature on adversarial examples for neural nets has largely focused on image data which are represented as points in continuous space. We will focus on the much-needed robustness of neural networks on discrete data, in particular text and graphs. In this work, we first develop methods to attack neural models that operate on discrete input, and then devise methods to defend against those attacks. We study text classification and machine translation in NLP, and transductive and inductive node classification in graphs. Background It was found in (Szegedy et al., 2014) that state-of-the-art image recognition systems can easily get tricked by adding small human-imperceptible noise to the image. This can have severe consequences not only from a business perspective, but also for safety threats it can pose if used in applications such as automatic car driving. Adversarial examples, which are created by maliciously manipulating an instance, are designed to attack a machine learning model at test time. The majority of the previous work on adversarial examples focuses on image classification (I. J. Goodfellow 2 et al., 2015; Nguyen, Yosinski, & Clune, 2015; Papernot, McDaniel, Jha, et al., 2016; Szegedy et al., 2014), and more recently on malware classification (Grosse, Papernot, Manoharan, Backes, & McDaniel, 2017), NLP (Belinkov & Bisk, 2018; Jia & Liang, 2017; Zhao, Dua, & Singh, 2018), and image segmentation (Xie et al., 2017). Adversarial Attacks. Adversarial examples are created in black-box and white-box settings. In the former, the adversary does not have access to model parameters, and can only query the system, which can be used to create a substitute model for the attacked model. In the latter, the adversary has access to model parameters, and in the case of differentiable models can use the gradients to attack the model. Due to stronger assumptions in the white-box setting, the error incurred by such adversaries can be regarded as an upper bound for black-box ones. Due to difficulty of discrete optimization, most works on adversarial NLP have focused on black-box attacks. Our work, focuses on creating white-box adversarial examples for NLP, which are adaptable to other discrete inputs, such as graphs. While there are some recent developments in creating unrestricted adversarial examples (Song, Shu, Kushman, & Ermon, 2018), most of the literature focuses on creating adversarial examples within a norm-bounded ball around inputs, and making the models robust against such attacks. Creating adversarial examples requires solving a non-convex optimization problem. Concretely, given the classifier g and the input x in the constrained set C, the goal is to find the optimal perturbation δˆ, such that the classifier makes a different prediction than on the original input. minimize δ ‖δ‖2 subject to g(x+ δ) 6= g(x) x+ δ ∈ C (1.1) 3 This problem was first tackled by a nonlinear optimization method (Szegedy et al., 2014). Other methods include one-shot gradient descent (I. J. Goodfellow et al., 2015), and iterative projected gradient descent (Madry, Makelov, Schmidt, Tsipras, & Vladu, 2018) methods, which use the linear approximation of the loss function (i.e., using first-order Taylor expansion). Our work also follows this approach of linear approximation of the loss function. Adversarial examples could be untargeted or targeted ; in the former, the adversary simply aims to inflict loss, and force the model to produce an output different from the correct output, while in the latter, the adversary wants to force the model to produce a desired output. Most of the literature on image classification report results on both types of attacks. On the other hand, none of the NLP works have put forth a quantitative analysis of targeted and untargeted adversarial examples. We investigate two new attack scenarios for machine translation, and propose a new framework for creating and evaluating targeted adversarial examples for machine translation. Adversarial Defense. One method for defense is adversarial training, which interleaves training with generation of adversarial examples (I. J. Goodfellow et al., 2015). Concretely, after every iteration of training, adversarial examples are created and added to the mini-batches. Virtual adversarial training (Miyato, Maeda, Koyama, Nakae, & Ishii, 2016) is another regularization method, which aims to minimize the KL divergence between predictions on the examples and their adversarial counterparts, for semi-supervised tasks, wherein we have a limited number of labeled examples, and many unlabeled examples. A robust model aims to fit model parameters given such adversarial examples. Concretely, the optimization problem we will need to solve is the following saddle- 4 point problem. min θ max δ Jθ(x+ δ, y) s.t. ‖δ‖p ≤  (1.2) where J is the loss function, x is the input, y is the label, δ is the perturbation vector, p is the type of norm-bounded perturbation,  is the magnitude of noise, and θ is the model parameters. A projected gradient-based approach for adversarial training by Madry et al. (2018) has proved to be one of the most effective defense mechanisms against adversarial attacks for image classification. The difference between this approach and the one proposed by I. J. Goodfellow et al. (2015) is the lack of clean examples in training and its closer connection with robust optimization for a non- convex problem. Some other methods to defend against adversarial examples use regularization techniques, such as distillation (Papernot, McDaniel, Wu, Jha, & Swami, 2016) and saturating networks (Nayebi & Ganguli, 2017), which often give a misleading sense of security. Concretely, these regularization techniques will often lead to gradient obfuscation, which would produce unstable gradients for a white-box attacker (Athalye, Carlini, & Wagner, 2018). Various tricks can avoid this instability and render these ostensibly secured models vulnerable (Carlini & Wagner, 2017b). Another approach to defend against adversarial examples is to detect them at test time, by augmenting the data or model to have a built-in capability to detect adversarial examples (Grosse, Manoharan, Papernot, Backes, & McDaniel, 2017; Metzen, Genewein, Fischer, & Bischoff, 2017). Recent research in defending adversarial examples focuses on convex relaxation of neural networks (Kolter & Wong, 2017; Raghunathan, Steinhardt, & Liang, 2018), and providing provable guarantees for robustness of neural models. The state-of- the-art provable defense (Kolter & Wong, 2017) uses relaxations of ReLU activation 5 functions (Nair & Hinton, 2010), which was first used for neural net verification (Ehlers, 2017). They pose the maximization problem in Eq. 1.2 as a linear program. Interestingly, they show that the dual of the linear program can also be represented by a neural network. We mainly use adversarial training for defense, but we will propose methods for robustness of graph neural nets, which detect adversarial edges, or perform robust optimization through a convex neural net. Adversarial Examples for Discrete Input While a gradient ascent on the input surface could approximate the increase in loss for continuous data, such approximation is not straightforward for discrete input. Concretely, a first order approximation of an adversarial point in continuous space is the following. Jθ(x+ δ) = Jθ(x)+ < δ,∇xJθ(x) > (1.3) Where < . > is the dot-product operator. An optimal unconstrained perturbation vector for continuous data, would be δ = ∇xJθ(x) (1.4) where  is tuning parameter, similar to learning rate used in parameter estimation in machine learning. However, such a δ cannot be used for discrete input, as x + δ is, almost for sure, an invalid point in discrete space. Thus, we have to define methods which use gradients to create valid adversarial examples to increase the loss. In the first part of our research, we will focus on creating adversarial examples for discrete input. For both graph and text, we manipulate discrete input to attack a model, in a white-box setting, using differentiable discrete operations. 6 Adversarial Examples for NLP The need to understand vulnerabilities of NLP systems is only growing. Companies such as Google are using text classifiers to detect abusive language1, and concerns are increasing over deception (Zubiaga, Liakata, Procter, Hoi, & Tolmie, 2016) and safety (Chancellor, Pater, Clear, Gilbert, & De Choudhury, 2016) in social media. In all of these cases, we need to to better understand the dynamics of how NLP models make mistakes on unusual inputs, in order to improve accuracy, increase robustness, and maintain security or privacy. So far, few existing works on adversarial examples for NLP have avoided discrete-level adversarial examples, due to difficulty in optimization (Miyato, Dai, & Goodfellow, 2017). The first work of crafting adversarial examples for NLP was proposed in (Papernot, McDaniel, Swami, & Harang, 2016). They added noise to the word embeddings and searched the neighborhood of that new embedding in order to replace the original word. While their adversary is able to trick the classifier, the word-level changes tend to produce nonsensical sentences (e.g., “I wouldn’t rent this...” → “Excellent wouldn’t rent this...”). Jia and Liang (2017) add distracting sentences to the end of paragraphs to fool deep reading comprehension systems, which are polished by crowdsourcing. They show how state-of-the-art text comprehension systems are matching patterns rather than understanding text. Hosseini, Kannan, Zhang, and Poovendran (2017) showed that simple modifications, such as adding spaces or dots between characters, can drastically change the toxicity score of the API. Adding noise to the continuous word embeddings (Miyato et al., 2017; Papernot, McDaniel, Swami, & Harang, 2016), random character changes (Belinkov & Bisk, 2018), rule-based semantic preserving adversaries (Iyyer, Wieting, Gimpel, & 1https://www.perspectiveapi.com 7 Zettlemoyer, 2018; Ribeiro, Singh, & Guestrin, 2018), and task-based heuristics (Hosseini et al., 2017; Jia & Liang, 2017) comprise the set of approaches to create adversarial examples for NLP. None of the existing works use gradients of the model to create adversarial examples. We will use gradient-based optimization to create white-box adversarial examples to investigate worst-case failures of deep NLP models. Black-Box vs. White-Box Methods for NLP Black-box attacks (Belinkov & Bisk, 2018; Jia & Liang, 2017; Ribeiro et al., 2018) often rely on task-related knowledge to create adversarial examples. In contrast, white-box attacks approximate the worst-case attack on text, within some allowed set of perturbations, in a general setting, regardless of the task at hand. In addition, white-box attacks can demonstrate and defend against a model’s most serious vulnerabilities, which might not be discovered by black-box methods. We will demonstrate that a white-box-trained model is stronger than a black-box-trained model in defending against a variety of types of noise. We focus on the task of machine translation, and we propose controlled and targeted adversaries which create adversarial examples with other goals, instead of merely decreasing the BLEU score (Papineni, Roukos, Ward, & Zhu, 2002). A controlled adversary aims to mute a word in the original translation, while a targeted adversary aims to push a word into it. We will penalize an adversary for ad- hoc manipulations, which could cause the MT to generate a radically different and possibly gibberish translation. Using these new scenarios, we will contrast black-box and white-box attacks, and demonstrate that a white-box adversary can achieve more difficult goals. 8 Adversarial Examples for Graphs Security of relational models has been an important subject of study for many years. Most notably, spammers engage in link farming to get higher ranks in social media or search engines (Ghosh et al., 2012). Another notable example of attacks on graph-based models is Sybil attacks (Douceur, 2002), in which a malicious user creates fake accounts to increase the power of a single user (Yang et al., 2014). We will devise relational adversarial perturbations to attack graph convolutional network (GCN) (Kipf & Welling, 2017) and GraphSage (Hamilton, Ying, & Leskovec, 2017) recently proposed neural models for graph-based learning. We focus on the task of node classification, wherein given labels for some of the nodes in the graph, the model will learn to predict labels for the unlabeled nodes in the graph. Node classification can be studied in transductive and inductive settings; in the former the model has access to the whole graph, including the test nodes, and uses that as extra knowledge. In the latter, the model does not see test nodes in training, and will only add the test nodes to the graph at test time. The latter setting has more applications and is more realistic; nodes are being added to real-world graphs constantly, and having access to all the nodes for training is a big limitation. A concurrent work (Dai et al., 2018) has studied attack on GCNs for transductive settings. We will study adversarial perturbations in both settings. Robust Graph Neural Nets While we can perform adversarial training for text, i.e., creating adversarial examples during training, we cannot use the same procedure for graph neural networks. Specifically, for each node in training, estimating the best edge perturbation would require computing gradients and searching among all estimates which is in the order of number of nodes in the graph; this would effectively lead to a polynomial time 9 search to find worst-case edge manipulation in a graph. Thus, we need to develop methods to create novel defense mechanisms which would not require finding these worst case perturbations. We first explore some architectural improvements to regular message passing schemes in graph neural nets using gating mechanisms, which would lead to a multi-view message passing scheme, wherein nodes receive different messages from a common neighbor. Apart from achieving state-of-the art results on some node classification benchmarks, we are able to use this gating mechanism toward a more robust node classifier. Concretely, we will incorporate a regularization term that would help the model produce higher gate values for edges present in the graph, compared with random edges not present in the graph. Further, we will show that there is a connection between perturbation of the input features and perturbation of edges in graph neural nets. Intuitively, edge perturbation can be regarded as constrained feature perturbation in the convex hull of the dataset. This will enable us to perform a variant of adversarial training on the features of nodes, without performing edge-level perturbations. We will extend this idea to improve robustness to norm-bounded feature perturbations on neighboring nodes, which would also help the model be robust to edge perturbations in the graph. Summary There has been a surging interest in studying adversarial examples in the past few years. The overwhelming majority of the literature has focused on image classification, with little attention paid to models with discrete input. In this dissertation, we focus on two canonical example of discrete inputs in machine learning problems, namely text and graphs. We will investigate the robustness of two neural 10 models: 1) a character-based neural model for NLP; and 2) graph convolutional neural nets and their inductive analogue for relational data. We will create adversarial perturbations and devise methods to defend against such attacks. Specifically, we will study the following research questions: – Perturbing text to trick a differentiable NLP model. – Perturbing graphs to trick a graph convolutional neural net in transductive and inductive settings. – Performing targeted attacks on a machine translation model. – Contrasting black-box and white-box attacks for NLP in various settings. – Devising methods to approximate edge-based perturbations for efficient robust training of graph neural nets. The rest of the dissertation will cover each section of this introduction in depth, and will discuss future directions in the last chapter. In Appendix A, we give a short background on deep learning, and introduce the models we have used in our work. Chapter III of this dissertation includes published co-authored material (Ebrahimi, Rao, Lowd, & Dou, 2018). Chapter IV of this dissertation includes published material (Ebrahimi, Lowd, & Dou, 2018). 11 CHAPTER II RELATED WORK Introduction Deep learning has had tremendous success in several areas of artificial intelligence, including computer vision, natural language processing, and speech recognition. As deep learning researchers become more confident in solving important problems, the need to understand our models’ vulnerabilities and failures become more crucial. This is even more important in critical applications, such as autonomous car driving or fraud detection. In some cases, real-world adversaries are actively working to trick our system; spammers have been trying to come up with creative methods to scramble text such that it bypasses spam filter system. The earliest work on adversarial examples focused on linear spam classifiers. Dalvi et al. (2004) pose classification as a game between the classifier and the adversary, and create a classifier that can defend against an adversary’s optimal strategy. Lowd and Meek (2005) attack a linear model by reverse engineering its weights for different types of features, without requiring full access to the model. Building models which are resilient to noisy inputs has been studied in the machine learning community in the robust optimization literature (Xu, Caramanis, & Mannor, 2009). Concretely, a robust model aims to find the solution to the following minimax optimization problem, where δ is the perturbation vector, p determines the type of norm-bounded perturbation,  is the magnitude of noise, and θ is the model parameters which we fit to minimize the loss. min θ max δ Jθ(x+ δ, y) s.t. ‖δ‖p ≤  (2.1) Robust convex optimization can provably give performance guarantees for the model to defend against noisy/adversarial inputs. In the case of neural nets, the 12 minimax problem is non-convex, and the inner maximization is solved by gradient- based perturbations which can be found by backpropagation. Consequently, such performance guarantees as in the case of convex models cannot be granted. A variant of robust support vector machines (SVMs) was proposed by Globerson and Roweis (2006), wherein the learned model is robust to feature deletions at test time. Teo, Globerson, Roweis, and Smola (2008) follow a similar approach, but includes a larger variety of attacks for different tasks. Xu et al. (2009) show that an SVM robust to a given norm-bounded noise, can be created by training a regularized version of the SVM, which includes a dual norm term on the weights of the SVM. Similarly, Torkamani and Lowd (2013) study robustness of structural SVMs; in all of these works, the minimax optimization problem in 2.1 is solved by replacing the inner maximization with the minimization of its dual which leads to a single quadratic minimization problem. In the coming sections of this chapter, we will cover related work in the literature of adversarial attack and defense mechanisms for deep learning. Adversarial Attacks In this section, we cover the literature on adversarial attacks. We will start by several methods for white-box adversarial attacks, and conclude with a review of black-box attack approaches. The main category for creating adversarial examples is linear attacks, in which the loss function is approximated by the gradient. There are non-linear optimization methods for creating adversarial examples which are slower than linear methods, and hence are only good for attacking models, and are not useful for adversarial training, a procedure to make the models robust, which we will cover in Section II. 13 Linear Attacks. We know from calculus that the difference between the value of function J , at point x+ δ and its value at x can be approximated by a linear sum over δ, the difference between the two points in input space. Concretely, Jθ(x+ δ)− Jθ(x) ≈< δ,∇xJθ(x) > (2.2) Now if the goal is to increase the loss with respect to δ, we only need to maximize the RHS. The attacks, which we study in this section, either closely follow the formulation in 2.2, or create adversarial examples using gradients, ∇xJ(x). We call all of these approaches linear attacks, as they approximate the loss by gradients. Note that, all of the following methods include nonlinear terms in their constraints, which would render the overall optimization procedure, nonlinear. One-Shot Attack. I. J. Goodfellow et al. (2015) propose an efficient one- shot method, which given constraints on the norm of the perturbation, would lead to efficient creation of adversarial examples. In particular, given a max norm constraint on the perturbation vector, ‖ δ ‖∞≤ , one can find the best perturbation vector as in 2.3. δ = sign(∇xJθ(x, y)), (2.3) Given an `2 constraint (i.e., ‖ δ ‖2≤ ), an optimal one-shot attack would be: δ =  ∇xJθ(x, y) ‖ ∇xJθ(x, y) ‖2 (2.4) where the numerator is the gradient of the loss function with respect to its input, and the denominator is the magnitude of the gradient (measured with the `2 norm). One-shot attacks are generally considered to be weak, but the advantage of this type of adversarial example is that its fast generation would enable us to use them in training, and hence increase models’ robustness. For instance, I. J. Goodfellow et al. 14 (2015) report some improvement over dropout in increasing a model’s accuracy when these adversarial examples are used in training. Iterative Attack. A reasonable extension to the previous method is projected gradient descent, which finds an adversarial example in several steps. This was first proposed by Kurakin, Goodfellow, and Bengio (2017a), in which they start with a bigger perturbation magnitude and project the adversarial example onto the desired ball. This is done for several iterations, which would lead to harder adversarial examples to defend. This projected gradient method was later improved by Madry et al. (2018) to create even stronger adversarial examples. Dong et al. (2018) applied the idea of momentum, e.g. keeping a memory for gradients for adaptive update of the input gradients, and created even harder adversarial examples to defend. DeepFool (Moosavi-Dezfooli, Fawzi, & Frossard, 2016) attacks a model by moving an adversarial example towards, and finally to the other side of, a decision boundary. For a linear classifier, the best perturbation vector has a closed form solution; however for a general classifier they have to perform a gradient-descent approach to reach their goal. The method can be regarded as an iterative method, in which the direction of the perturbation vector is not necessarily aligned with the gradient, but toward the decision boundary. Saliency maps are used for visualizing and interpreting predictive models (Simonyan, Vedaldi, & Zisserman, 2013). The map rates each input feature x(i) (e.g. each pixel) on its influence over the model to predict a particular class c. One can represent saliency maps as: S(x(i), c) =  0 if ∂g(x)(c) ∂x(i) < 0 or ∑ c′ 6=c ∂g(x)(c′) ∂x(i) > 0 −∂g(x)(c) ∂x(i) · ∑ c′ 6=c ∂g(x)(c′) ∂x(i) otherwise 15 S(·) measures positive correlation of x(i) with c, and its negative correlation with other classes c′ 6= c. An attacker can exploit this saliency map by targeting an adversarial class t that does not match the true class label y of a given sample x (Papernot, McDaniel, Jha, et al., 2016). This adversary inverts promising pixels iteratively, with the goal of minimizing the `0 difference between the adversarial and original image, i.e., the count of different pixels in corresponding locations. The use of saliency maps, and the choice of non-convex `0 leads to a search-based optimization method. Carlini and Wagner (2017b) propose an attack, which improves previous approaches, through incorporating a new nonlinear constraint. This constraint helps find an adversarial example which the model is most resilient to. In addition, their model is easily adaptable to `0, `2, and `∞ constraints. For instance, their `2 perturbation attack can be found through this optimization problem: min w ‖1 2 (tanh(w) + 1)‖2 + c · f(1 2 tanh(w) + 1). (2.5) where f is defined as max(max(Z(x ′ i : i 6= t)−Z(x′t)), 0), and Z in the network’s logits before the softmax. After a thorough investigation, Carlini and Wagner (2017a) show this attack framework can be adapted to attack several types of defense mechanisms, which will be discussed in Section II. Nonlinear Attacks. One extension to linear adversarial attacks, is to approximate the loss function with second-order Taylor expansion which will include the Hessian matrix. This has been reported to have little impact on the power of the adversary (Trame`r, Papernot, Goodfellow, Boneh, & McDaniel, 2017). 16 Szegedy et al. (2014) solve the following optimization problem to find an adversarial example around x ∈ Rm whose label l differs from the original example. minimize δ c‖δ‖2 + J(x+ δ, l) subject to g(x+ δ) = l x+ δ ∈ [0, 1]m (2.6) They start from a large c and perform line search to find the smallest c which can minimize the objective function. They use box-constrained L-BFGS after every step of the line search, which is a quasi-Newton method that approximates the Hessian matrix using gradients. Tabacof and Valle (2016) use a similar formulation to (Szegedy et al., 2014), but perform a bisection search to find the optimal c. Adversarial Example Generation with GANs. In a radically different approach, few recent works have investigated the problem of creating adversarial examples with generative adversarial networks (GAN) (I. Goodfellow et al., 2014), a successful architecture for generative modeling. Xiao et al. (2018) creates perturbation vectors with the help of a GAN whose generator receives the input image, and whose output will be the perturbation vector. The addition of the input and the perturbation vector will be considered a fake example, which will be fed to a discriminator. An additional term in the objective function will increase the loss on the attacked classifier. Another work by Song, Shu, et al. (2018) removes the requirement of having original images, and generates adversarial examples for different classes, which are validated by crowdsourcing to be adversarial examples, i.e., belonging to the class they claim to be. Zhao et al. (2018) perturb the latent representation in GAN to produce examples which would trick the model. Unlike previous methods, GAN-based optimization methods do not require norm- bounded perturbations. Note that GAN is a nonlinear model which incorporates 17 neural architectures in its generator and discriminator. In addition, these attacks do not approximate the loss function with gradients. Thus, we can call these methods as the most nonlinear adversarial attacks. Black-Box Attacks. The aforementioned attacks are all white-box, and rely on having access to the model to inflict loss. But several works demonstrate that simply cutting this access would not secure a machine learning model. Szegedy et al. (2014) found that adversarial examples can fool the same neural networks trained by different training instances, thereby pointing to the weaknesses in models’ defending against black-box attacks. Papernot, McDaniel, and Goodfellow (2016) showed that adversarial examples can not only fool other neural networks with different architectures, they can also trick other classifiers trained by different machine learning algorithms, such as SVMs. Highly transferable adversarial examples can be used to perform black-box attacks on models. A typical approach to launch black-box attacks is to train a substitute neural network model and then generate adversarial examples against the substitute model (Papernot et al., 2017). Y. Liu, Chen, Liu, and Song (2017) investigate transferability of targeted and non-targeted adversarial examples, and found that non-targeted adversarial examples are much more transferable than targeted ones. And interestingly, they observe that in some cases decision boundaries of different models aligned with each other. Trame`r et al. (2017) find that adversarial examples span a subspace of large (∼25) dimensionality. They show that adversarial subspaces with higher dimensionality are more likely to intersect, and a significant fraction of their subspaces is shared, thus leading to higher transferability. Due to difficulty of creating white-box adversarial examples for NLP, most previous work in adversarial examples for NLP (Belinkov & Bisk, 2018; Iyyer et al., 18 2018; Jia & Liang, 2017; Ribeiro et al., 2018; Zhao et al., 2018) has focused on creating adversarial examples in a black-box setting. Jia and Liang (2017) add distracting sentences, which are polished by crowdsourcing, to the end of paragraphs to fool deep reading comprehension systems. These confusing statements incorporate elements of the question and correct answer, and are untrue but grammatical statements. Hosseini et al. (2017) manipulate toxic words in a sentence to reduce their toxicity score. In both of these works, the adversary relies on knowledge about what the model does, in order to trick them. Zhao et al. (2018) search for black-box adversarial examples in the space of encoded sentences and generate adversarial examples by perturbing the latent representation until the model is tricked. However, it is not clear how many queries are sent to the model or what the success rate of the adversary is. Iyyer et al. (2018) rely on backtranslation to create paraphrases which follow certain syntactic templates. These paraphrases are used to attack models by simply querying the models to see if these examples can trick them. Ribeiro et al. (2018) use linguistic rules to perturb an input text, and perform a greedy search to apply rules in an iterative search-based procedure. We now briefly introduce our work on creating adversarial examples, to place it in the context of related work. We will explore these ideas in further detail in Chapter III. Our Approach to Create Adversarial Examples. We first point out that the first work of crafting white-box adversarial examples for NLP was proposed by Papernot, McDaniel, Swami, and Harang (2016). They add noise to the word embeddings and search the neighborhood of that new embedding in order to replace the original word. While their adversary is able to trick the classifier, the word- level changes tend to produce nonsensical sentences (e.g., “I wouldn’t rent this...” 19 → “Excellent wouldn’t rent this...”). Other methods (Miyato et al., 2017; Wu, Bamman, & Russell, 2017) simply use noise on the word embeddings to create pseudo- adversarial examples, which are only useful for adversarial training, i.e., they do not lead to actual textual adversarial examples. How can we go from white-box noise on word embeddings to noise on the discrete input? The answer is in one-hot vector representation of text or other types of discrete inputs, such as graphs. Word embeddings are vectors in low dimensional space, which capture syntactic and semantic properties of words (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). We often use indexes to these vectors in the vocabulary when training NLP models. This is simply an O(1) lookup operation in a dictionary. This lookup operation can also be represented by a vector-matrix multiplication, where the vector is a one-hot vector and the matrix is the vocabulary matrix. Concretely, the following holds: wi = xiW (2.7) where wi is the word embedding of the ith word, W is the vocabulary and xi is a one- hot vector with 0 in all dimensions except for the ith. A similar method can be used for character-level models in NLP, and any other model, in which a lookup operation is used to select representations. For instance, in a character-level model, instead of word embeddings, we have character embeddings. For a graph, in which nodes have features, the vocabulary simply contains a dictionary to access nodes’ features. It is straightforward to use the chain rule to backpropgate error with respect to wi, the embedding, to the one-hot vector. ∂J ∂xi = W T ∂J ∂wi (2.8) where J is the loss of the model. The current gradients are still not quite useful; for instance, what does ∂J ∂xi k , the kth dimension of this gradient vector mean? In the next 20 chapter, we will demonstrate how we can use these gradients to score manipulations of text, such as replacing one character to another. Further, we can use a greedy to beam search strategy to apply manipulations iteratively. None of the previous NLP works have put forth a quantitative analysis of targeted and untargeted adversarial examples. We will investigate two new attack scenarios for machine translation, and propose a new framework for creating and evaluating targeted adversarial examples for machine translation. Our work on adversarial perturbations for graphs are limited to untargeted attacks. Defense In this section, we focus on different defense mechanisms for deep learning models. Gradient Masking. Gradient masking is a method to obfuscate gradients by means of numerical instability to ensure that a white-box adversary fails in attacking the system. This was first proposed by Papernot, McDaniel, Wu, et al. (2016), by setting a large T in the computation of the softmax function, shown in 2.9, which would make the model make confident predictions for the true label, producing unstable gradients to attack the model: exp(zi/T )∑ j exp(zj/T ) (2.9) where zj is the final output of the network for class j, and T is the temperature hyper-parameter, which is normally set to 1. However, Carlini and Wagner (2017b) show that using the logits before the softmax, we can still attack the model, rendering the defense useless. A similar work (Brendel & Bethge, 2017) shows the weakness in another gradient masking defense mechanism proposed by Nayebi and Ganguli (2017). An extensive study conducted by Athalye et al. (2018) shows that gradient obfuscation can give a misleading sense of security. 21 Adversarial Example Detection. Some defense mechanisms empower the models with a built-in capability to detect an adversarial example, through having auxiliary components in the model for that goal, or simply augmenting training data with adversarial samples. Metzen et al. (2017) create a binary classifier to detect adversarial examples as an auxiliary network is added to the original neural network. Grosse, Manoharan, et al. (2017) add an outlier class to the original dataset, composed of adversarial examples, which the model learns to detect by classifying them as the outlier class. Feinman, Curtin, Shintre, and Gardner (2017) hypothesize that adversarial examples lie off the data manifold, and provide a Bayesian view of detecting adversarial examples. They show that uncertainty of adversarial examples is higher than the clean data, by measuring randomness in predictions after dropout (N. Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) is added to layers of the model. Song, Kim, Nowozin, Ermon, and Kushman (2018) show that the distribution of adversarial examples is different from clean data. They calculate p-value based on the rank given by PixelCNN (Oord, Kalchbrenner, & Kavukcuoglu, 2016), and reject adversarial examples using these statistics. Carlini and Wagner (2017a) found that almost all of the adversarial example detection algorithms are defeated in a variety of white-box and black-box settings, using appropriate objective functions for the adversary. Adversarial Training. Adversarial training interleaves training with generation of adversarial examples (I. J. Goodfellow et al., 2015). Concretely, after every iteration of training, adversarial examples are created and added to the mini- batches. Virtual adversarial training (Miyato et al., 2016) is another regularization method, which aims to minimize the KL divergence between predictions on the 22 examples and their adversarial counterparts, for semi-supervised tasks, wherein we have a limited number of labeled examples. To deal with the transferred black-box model, Trame`r, Kurakin, Papernot, Boneh, and McDaniel (2018) propose Ensemble Adversarial Training method that trains the model with both white-box and imported black-box adversarial examples. According to (Kurakin, Goodfellow, & Bengio, 2017b), one-shot attacks, e.g., FGSM, are easy to defend for a model which is trained on these white-box adversarial examples, but these attacks have high transferability rate, and can be used to attack other models in a black-box setting. In addition, Trame`r et al. (2018) show that a FGSM-trained model is more robust to white-box attacks than to black-box attacks due to gradient masking. They propose Rand-FGSM, which adds Gaussian noise to the example before computing the gradients. A projected gradient-based (PGD) approach to create adversarial examples by Madry et al. (2018) has proved to be one of the most effective defense mechanisms against adversarial attacks for image classification. The PGD algorithm, which creates adversarial examples within ‖ δ ‖∞≤ , is given in Algorithm 1. The difference between this approach and the one proposed by I. J. Goodfellow et al. (2015) is the lack of clean examples in training and its closer connection with robust optimization for a non-convex problem. Adversarial training has been used to make text classification (Miyato et al., 2017) and information extraction (Wu et al., 2017) more robust. However, instead of creating real-world textual adversarial examples, they add noise to word embeddings and create adversarial examples in continuous space. Robust Optimization. Recall that when the problem posed in 2.1 is convex, robust convex optimization can provably give performance guarantees for the model to defend against noisy/adversarial inputs. In the case of neural nets, the minimax 23 Algorithm 1 Adversarial training Input: model parameters θ, stepsize sequence {rt > 0}T−1t=0 for t = 0, . . . , T − 1 do xa ← x for i = 0, . . . , iter do xa ← xa + Π sign(∇xaJ(θt;xa)) θt+1 ← θt − rt∇θJ(θt;xa) problem is non-convex; up until recently it was not known how to train robust neural classifiers. Kolter and Wong (2017) builds on linear ReLU relaxations, first proposed by Ehlers (2017) for neural network verification. This defense find a convex outer bound for the adversarial polytope, i.e., the set of all final-layer activations attainable by perturbing the input. Sinha, Namkoong, and Duchi (2018) provide a method for achieving certified robustness for perturbations defined by a certain distributional Wasserstein distance, which is not translatable to norm-bounded perturbations. Raghunathan et al. (2018) develop a semidefinite programming-based relaxation of the adversarial polytope and employ this for training a robust classifier. Both Raghunathan et al. (2018) and Kolter and Wong (2017) create worst-case adversarial examples in the outer bound via dual; for the latter of which, this results in another neural network, which looks like the backward pass in the original network. We now briefly introduce our work on defending against adversarial examples. We will explore these ideas in further detail in Chapter III. Our Approach to Defend Against Adversarial Attacks. The main method we use to defend models is adversarial training. In particular, our discrete adversarial example generation method is efficient for NLP models, especially for character-level models with a few hundred tokens in the vocabulary. Our adversarial training algorithm for our machine translation and text classification experiments were only 3-4 times slower than vanilla training. 24 As we will elaborate later, this procedure is not efficient for large graphs, in which the size of the vocabulary (i.e., W in Equation 2.7), could be hundreds of thousands, or even more. So we devise efficient methods to bypass this issue. One of our methods is similar to the defense method of detecting adversarial examples; we will develop a gating mechanism which will help us measure trustworthiness of edges. When the output of the gate is small the impact of the edge is minimized. Later, we will show that there is a connection between relaxed edge perturbations and unrestricted feature-based for the nodes of the graph, and we develop a gradient-based method which perturbs features and projects them in the outer cube of the convex hull. Building on this observation, we will develop three more baselines which make the model robust to perturbations on the features of neighboring nodes. In one of these methods, we look at robust training of graph neural nets through convex relaxation of neural nets. All of these models significantly improve the robustness of vanilla models against adversarial edge perturbations. In short, we provide a set of novel algorithms for robustness of graph neural nets, which belong to different classes of defense mechanisms. Summary In this section, we provide a brief summary of related work, and how our contributions fit within the literature. We will compare our methods with the previous work along the main axes of adversarial machine learning. Attacks on Continuous vs. Discrete Spaces. The overwhelming majority of the literature uses image classification as the task for studying adversarial examples (Carlini & Wagner, 2017a; I. J. Goodfellow et al., 2015; Kolter & Wong, 2017; Madry et al., 2018; Szegedy et al., 2014). For the first time, we propose a framework to attack neural network models with discrete input, through defining differentiable 25 discrete operations that estimate the loss with a limited number of queries to the model. We will focus on character-level NLP models for translation and classification as well as graph-based models for classification. Targeted vs. Untargeted Attacks for Machine Translation. Many works on image classification focus on targeted attacks (Kurakin et al., 2017a; Moosavi-Dezfooli et al., 2016; Papernot, McDaniel, Jha, et al., 2016). In classification domains with few classes, targeted attacks are relatively simple, since an adversary can perturb the input to move it to the other side of a decision boundary; whereas in machine translation, we deal with vocabulary sizes in the order of at least tens of thousands, and it is less likely for an adversary to be successful in targeted attacks for most possible target words. Zhao et al. (2018) provide a few hand-picked adversarial examples for machine translation; we investigate two new attack scenarios for machine translation, and propose a new framework for creating and quantitively evaluating targeted adversarial examples for machine translation. Black-Box Attacks vs. White-Box Attacks for NLP. Due to the difficulty of creating white-box adversarial examples for NLP, most previous work in adversarial examples for NLP (Belinkov & Bisk, 2018; Iyyer et al., 2018; Jia & Liang, 2017; Ribeiro et al., 2018; Zhao et al., 2018) has focused on creating adversarial examples in a black-box setting. We compare our white-box adversarial examples with the black-box baseline of (Belinkov & Bisk, 2018), and show that white-box adversaries are much stronger, and demonstrate that adversarial training on white- box examples would make the model much more resilient to adversarial attacks. Defense for Graph Neural Nets. Defense for graph neural nets has not been studied in any prior work. Defending against adversarial edge perturbations requires changes in the regular adversarial training procedure. We will create novel defense 26 algorithms to make GraphSage (Hamilton et al., 2017), a recent neural model for inductive node classification, robust. Our defense algorithms can be categorized into different classes of defense mechanism, which have been already explored for image classification. In particular, our edge-gating-based defense mechanism is related to adversarial example detection work of (Metzen et al., 2017), our convex defense model is inspired by (Kolter & Wong, 2017), and our Jacobian-based defense is related to layer-wise regularization of Gu and Rigazio (2015). 27 CHAPTER III ADVERSARIAL EXAMPLES FOR TEXT Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018). Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th annual meeting of the Association for Computational Linguistics (volume 2: Short papers) (Vol. 2, pp. 31-36). Introduction In this chapter, we will describe an efficient method to generate white-box adversarial examples for text. The method estimates the increase in loss by first- order approximation of loss over a manipulated input, e.g., removal or addition of a character. We will provide adversarial examples for a character-level neural text classifier and a character-level machine translator. We will also discuss transferability of adversarial examples (i.e., ability to trick models to whose parameters the adversary has not access to), and human perception of textual adversarial example. White-Box Adversarial Examples Human language processing faculty can detect typos, misspellings, and the complete omission of letters when reading (Rawlinson, 1976). The following amusing text circulated on the internet in September 2003, and has been passed on many times, including through internet memes. Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. As an experiment, Belinkov and Bisk (2018) gave a noisy German version of the sentence above to Google Translate, and got the following gibberish translation. 28 After being stubbornly defiant, it is clear to kenie Rlloe in which Reiehnfogle is advancing the boulders in a Wrot that is integral to Sahce, that the utterance and the lukewarm boorstbaen stmimt. Belinkov and Bisk (2018) show that Neural Machine Translation (NMT) systems are brittle and will produce gibberish translations under typos, random noise, and natural noise. A similar work by (Hosseini et al., 2017) shows that Google’s Perspective API, which is used to detect toxic language in text, can produce very different toxicity scores when noisy input is given to the model. For example, the score for the following sentence changed from 84% to 20% after the word idiot was changed to idiiot. Climate change is happening and it’s not changing in our favor. If you think differently you’re an idiot. The adversary has simply added a typo to the toxic word in the sentence, and successfully tricked the model. In a similar approach, Jia and Liang (2017) devise adversarial examples for text comprehension system by adding a distracting sentence based on the question and true answer to the question. For instance, if the question to the system is “What city did Tesla move to in 1880?”, and the true answer is “’Prague”, a possible distracting sentence would be “Tadakatsu moved the city of Chicago to in 1881.”, which contains some linguistic features of the question and answer, e.g., a city, a date, a name. The central question we answer is whether we can find worst-case failures of an NLP model through the lens of white-box adversarial examples. Concretely, the aforementioned examples were black-box attacks which were created by random changes or heuristic-based changes. In such a scenario the adversary does not have a preference over possible manipulations, except for querying the model to see if the 29 model’s output has changed, or relies on heuristics which are limited to that task. Our work, on the other hand focuses on creating adversarial examples solely based on the generic optimization problem to increase the loss of an NLP model. Since the space of possible discrete changes is combinatorial, querying the model to rank adversarial manipulations is intractable. However, in a white-box setting the adversary can use model’s gradients as a surrogate loss function to estimate the effectiveness of different manipulations, and manipulate text accordingly. In other words, a white-box adversary can use gradients to estimate the loss without exhaustively querying the model for every manipulation. Table 1 shows a correctly- classified sentence on top; the next two examples show a successful and an unsuccessful manipulation, respectively, where in the first one, the label of the text has changed from World to Sci/ Tech, while the label of the second example has not changed. A naive adversary would query the model for its loss, and pick the adversarial example which increased the loss more; a gradient-based adversary can estimate the loss, and avoid excessive querying of the model. As we will see, the gradient based adversary is orders of magnitude faster than the naive adversary. We develop a method, HotFlip, to create white-box adversarial examples, which can be applied to character-level models, word-level models, and more generally any model with a one-hot representation scheme for its input data. At the core of our method lies an atomic character flip operation which changes one token to another, which can be extended to insertion and deletion operations to constitute a set of plausible adversarial attacks. While a few character changes can trick a character- level model, the meaning of the text is very likely to be preserved or inferred by the reader. By contrast, word-level adversarial manipulations are much more likely to change the meaning of text. 30 Table 1. Successful and unsuccessful adversarial examples. South Africa’s historic Soweto township marks its 100th birthday on Tuesday in a mood of optimisim. 57% World South Africa’s historic Soweto township marks its 100th birthday on Tuesday in a mooP of optimisim. 95% Sci/Tech South Africa’s historic Soweto townshiq marks its 100th birthday on Tuesday in a mood of optimisim. 52% World Method The architecture we study for our character-level experiments is based on the one proposed by Kim, Jernite, Sontag, and Rush (2016) for character-level language modeling. Feature extraction is performed by convolutional neural networks (CNN) over characters, which are passed to layers of highway networks, and finally given to stacks of long short-term memory networks (LSTM) for modeling a sequence of words. This architecture has be used to perform sequence labeling (Kim et al., 2016) and machine translation (Costa-Jussa & Fonollosa, 2016). We adapt it for our text classification too, wherein the output of the last recurrent unit is passed to a softmax to predict the label of text. Text Representation. Let us use character-level text classification as our running example, and let V be the alphabet, x be a text of length L characters, and xij ∈ {0, 1}|V | denote a one-hot vector representing the j-th character of the i-th word. The character sequence can be represented by x = [(x11,.. x1n);..(xm1,.. xmn)] wherein a semicolon denotes explicit segmentation between words. The number of words is denoted by m, and n is the number of maximum characters allowed for a word, where padding is applied if the number of characters is fewer than the maximum. HotFlip: Differentiable String-Edit Operations. Let J(x, y) denote the log-loss over the output of the softmax unit. Imagine the adversary is allowed to 31 change r characters in the input text to fool the model to which it has access to. Using a brute-force search, it would need to do ( L r )|V |r forward passes to exhaust the search space. That is, query the classifier, by calling J , for all combinations of character flips within the allowed budget, r. This can decrease to O(brL|V |) if beam search is used. That is, after trying all possible single character flips, keep the top b flips which had the highest loss increase, and continue for r steps. Our algorithm requires a single forward and a single backward pass to estimate the best possible flip. We represent string operations as vectors in the input space and estimate the change in loss by directional derivatives with respect to these operations. Based on these derivatives, the adversary can choose the best loss-increasing direction. A character flip in the j-th character of the i-th word (a→ b) can be represented by this vector: ~vijb = (~0 ,..;(~0 ,..(0,..-1,0,..,1,0)j,..~0 )i; ~0 ,..) where -1 and 1 are in the corresponding positions for the a-th and b-th characters of the alphabet, respectively, and x (a) ij = 1. Due to directional derivatives, we have the following: ∇~vijbJ(x, y) = ∇xJ(x, y)T · ~vijb A first-order approximation of change in loss can be obtained from a derivative along this vector: ∇~vijbJ(x, y) = ∇xJ(x, y)T · ~vijb = ∂J ∂xij (b) − ∂J ∂xij (a) (3.1) An immediate benefit of using derivatives with respect to the one-hot vectors is that they can be used to select the best character change (a → b). Concretely, the derivative vector contains the information about the loss increase obtained due to a character flip. This makes the number of queries (forward passes) be independent of the alphabet size. 32 We simply need to find the best change by maximizing eq. 3.1, to estimate the best character change (a → b). This requires searching in |V |mn values for a given text of m words with n characters each, in a vocabulary of size |V |. Character insertion at the j-th position of the i-th word can also be treated as a character flip, followed by more flips as characters are shifted to the right until the end of the word. Hence the estimate in change of loss corresponding to a character insertion can be given by: ∂J ∂xij (b) − ∂J ∂xij (a) + n∑ j′=j+1 ( ∂J ∂xij′ (b ′ ) − ∂J ∂xij′ (a ′ )) (3.2) where x (a ′ ) ij′ = 1 and x (b ′ ) ij′−1 = 1. Similarly, character deletion can be written as a number of character flips as characters are shifted to the left. Since the magnitudes of operation vectors are different, we normalized them by both L1 and L2 norm but found it had little impact. Multiple Changes. We explained how to choose the best single change in text to get the maximum increase in loss. After applying a change, we query the classifier and check whether the predicted label also changes, upon which we stop the search. For additional changes we can either perform manipulations greedily, i.e., applying the best manipulation at every step of the way, or keep the most promising manipulations and perform a beam search. The search is continued until we reach the maximum number of allowed changes, r, or successfully trick the classifier into misclassifying the instance. Beam Search. Our proposed beam search requires only O(br) forward passes and an equal number of backward passes. Concretely, there are two sources of efficiency in our algorithm: First, the breadth-search is done simply by sorting the derivatives rather than querying the classifier for the actual loss; second, after the 33 breadth-search, we keep the top r paths, which are sorted by the sum of the loss up to the current step and gradient of the current step, rather than querying to obtain their actual loss. We elaborate on this with an example; consider the loss function J(.), input x0, and an individual change cj. We estimate the score for the change as ∂J(x0) ∂cj . For a sequence of 3 changes [c1,c2,c3], the following holds: score([c1, c2, c3]) = J(x2) + ∂J(x2) ∂c3 where x2 is the modified input after applying [c1, c2]. With a beam width of b, we need b forward and b backward passes to compute derivatives at each step of the path, leading to O(br) queries. In contrast, a loss-based approach requires computing the actual loss for every possible change at every stage of the beam search, leading to O(brL|V |) queries. Thus, our algorithm limits the number of queries by using derivatives as surrogates for change in loss. To showcase the efficiency of our method, we compare the amount of time it takes to carry out a beam-search method based on querying for the actual loss, and our derivative-based beam-search approach. Because the derivative approach is less exhaustive, it requires a larger beam size, so we set the beam size for the derivative approach to 10, and for the loss-querying approach to 5. The sample that we used for this experiment had an 80-token alphabet and an average of 44 characters per sentence. Both approaches had a 100% success rate (perfect misclassification) on this sample. It can be seen in Table 2 that the derivative-based approach needs an average of 1 more character flip to trick the classifier, e.g., 34% of examples are misclassified after one change using the derivative- based approach, while this proporation increases to 59% for the loss-based approach. However, the derivative-based approach is 90 times faster on average; if we decrease 34 CNN Highway 0 50 100 150 Av g. S qu ar ed D ist an ce HotFlipRandom Figure 1. Comparing the HotFlip direction and a random direction based on the average squared distance between the embedding of the original word, and the embedding of the modified word, found from the outputs of the CNN and highway layers. the beam size to 5 for our method, the adversary will fail to misclassify a few of the examples, but the speedup will increase to 400 times. Table 2. Comparing beam-search strategies, based on the number of changes (i.e., 1,2, 3 or more). We report the proportion of successfully created adversarial examples and the average time, measured in seconds, spent for each number of change. No. change(s) 1 2 3+ Loss-based Time 10.3 70.2 705 Proportion 59% 29% 12% Gradient-based Time 0.11 0.93 2.7 Proportion 34% 29% 37% Embeddings Under Adversarial Noise To compare changes by the HotFlip direction with a random direction, we measure the average squared distance between the original word representation in each layer (i.e., using the outputs of the CNN and the highway layers) with those after applying changes in each direction. Figure 1 shows a much larger difference between the embedding of the original word and the one modified in the HotFlip direction. In Table 3, we study the change in the word embedding as a change occurs. We use the output of the highway layer as the word representation. We report the embedding for a few adversarial words, for which the original word is not among their 35 Table 3. Nearest neighbor words (based on cosine similarity) of word representations from CharCNN-LSTM picked after the highway layers. A single adversarial change in the word often results in a big change in the embedding, which would make the word more similar to other words, rather than to the original word. Alps → llps talk → taln local → loral you → yoTu ships → hips actor → actr lowered → owered lips tall moral Tutu dips act powered laps tale Moral Hutu hops acting empowered legs tales coral Turku lips actress owed slips talent morals Futurum hits acts overpowered top 5 nearest neighbors. Deletion examples (i.e., ships, actor, lowered) often maintain some information about the word, such as its prefix and tense, which makes them more benign changes. In a character-level model, the lookup operation to pick a word from the vocabulary is replaced by a character-sequence feature extractor which gives an embedding for any input, including OOV words which would be mapped to an UNK token in a word-level model. This makes the embedding space induced in character- level representation more dense, which makes characterlevel models more likely to misbehave under small adversarial perturbations. Experiments We analyze the effect of character-level adversarial examples on two different tasks. Our seq-2-seq implementation relies largely on OpenNMT1, which mostly follows the guidelines of Luong, Pham, and Manning (2015) for attentional translation. Specifically, we used a CharCNN-LSTM encoder and a word-based attentional decoder (Bahdanau, Cho, & Bengio, 2014). We used a beam size of 5 for decoding at test time. We used a 800k pair from the Europal corpus of German-English WMT data2, and reproduced the results of Costa-Jussa and Fonollosa (2016). Preprocessing consisted of tokenizing, normalizing punctuation, and filtering sentences with more 1https://github.com/opennmt/opennmt 2http://www.statmt.org/wmt15/translation-task.html 36 than 50 words. For text classification, we use the AG’s news group dataset3, which consists of 120,000 training and 7,600 test instances from four equal-sized classes: World, Sports, Business, and Science/Technology. The architecture consists of a 2-layer stacked LSTM with 500 hidden units, and a character embedding size of 25. This classifier was able to outperform (Conneau, Schwenk, Barrault, & Lecun, 2017), which has achieved the state-of-the-art result on some benchmarks, on AG’s news. Both models were trained with stochastic gradient descent and gradient clipping, and the batch size was set to 64. For classification, we used 10% of the training data as the development set, and trained for a maximum of 25 epochs. For translation, we use OpenNMT’s suggested hyper-parameters. Throughout our experiments, we only allow character changes if the new word does not exist in the vocabulary, to avoid changes that are more likely to change the meaning of text. The adversary uses a beam size of 5 for translation and a beam size 10 for classification. We limited per-word character change to one, which was found to facilitate faster beam search. For both tasks, we use a maximum of 10% of characters in the document as the budget for the adversary. For experiments with random changes, we use the same number of character changes as the one committed by our adversary in that experiment. Text Classification. In Figure 2, we plot the success rate of the adversary against an acceptable confidence score for the misclassification. That is, we consider the adversary successful only if the classifier misclassifies the instance with a given confidence score. As can be seen, the beam-search strategy is very effective in fooling the classifier, even with a 90% confidence constraint, tricking the classifier for more 3https://www.di.unipi.it/˜gulli/AG corpus of news articles.html 37 0.5 0.6 0.7 0.8 0.9 Confidence 0 0.2 0.4 0.6 0.8 1 Su cc es s ra te beam search greedy random Figure 2. Adversary’s success rate as a function of confidence for classification. than 90% of the instances. A greedy search is less effective, especially in producing high-confidence scores. In this task, 30% of adversarial examples have either one or two changes. The adversary manipulates 4.1% and 5.28% of the characters in a document on average, for 0.5 and 0.9 confidence thresholds respectively. For this task, random changes barely trick the model (2.7% success rate at 0.5 confidence); we performed a beam search with random gradients which increased the success rate to only 6.2% at 0.5 confidence. Table 4 shows some of our adversarial examples which tricked the model to change its correctly predicted label, as denoted on top of each set of adversarial examples. In Table 5, we report the correct-to-incorrect rate (i.e., number of instances that are originally correctly classified but will be misclassified after the adversarial manipulations), and the average number of changes made by the adversary, for sentences up to 25 words. We also report the percentage of the time that the originally misclassified instances will be correctly classified after the adversarial manipulation 38 Table 4. Adversarial examples which are mislcassified by the model. The correct label is shown on top of each block. Sci/Technology Orange tells customers to Talk Now. European carrier Orange is rolling out its own Push To Talk service ahead of efforts to create a standardized PTT system. European mobile carrier OrangI haC announcet. Arien CAntact More Likely by “Mail” Than Radio, Study Says. Researchers behind the s.udy speculate that other life-forms may have already sent us messages, perhaps even as organic material embedded in asteroids that hkve strudk EaPth. Microsoft spends 1bn to keep out the hackers. The growing threat of hackers and Ciruses has prompted Microsoft to rAll out a billion- dollar upgrade of its Windows computer operating system to stretgthen seturity. This week in game news. They risked hypothermia and fought off the effects of sleep depriSation so they could be among the first to achieve their quest in the wee hours of the morfing. Energy Dept. funds open-sourhe InfiniBand work. Three-year project will back programmers’ effort to build Linui software support for the high-speeu networping technolowy. Sports Berkman tears AfL, may be out until June. HOUSTON- Houston Amtros star outfielder Lance Berkman suffered a torn ACD in his right knee and will unvergo arthrosoopic surgery within the next \0 days, the teym announded Friday. Sporting News: Bonds is player of year; Pujols fourth. San Francisco Giants slugger Barry Bonds, who hit .362, set a record with 232 walks and topped 700 career homers, was named 2004 player of the year by The Sporping Ne.s Norman looking to post a low score. Greg Normaf will be looking to post a low score to give the leaders a target in the final round of the Australian PGl Championship at Coopum’s Hyatt Resort, north of Brtsbane. Montgomerie, Woods, Furyk Tied at Target (AP). AP- Colin Montgomerie was thrilled to get an invitation from Tigep Woods to play in his year-end tournament with 15 of the best players in g\f. Even better w\s matching \oods’ score. Dolfhins finally win, taking out frustration on Rams. The Miari Dolp\ins finally gave their fans reason to celebrate, corbining a polished offensibe perforpance with solid defense for their first victorp this s\ason, 31-14 over the St. World CtA accused over Iraq detaipees. US army generals tell a Senate committee that dozens of detainees may have been heln in secre. in Iraq. Rumsfeld to Meet Foreign Defense Chiefs on Iraq. MANAMA (Reuters)- Defense Secretary Donald Rumsfeld was set to meet defense chiefs from about 18 nations aborrd a U.S. aircrart carrier in the Gulf Saturday as the United States looks to improve the securrty situation in Iraq with Janpary elections loom\ng. Europe Must Adapt to U.S. View on Terror, NcTO Chief Says. The head of N\TO said there was a critical “perception gap” between Europe and the U.S. on the subject of global terror. In Asia, Powell defends N. Kore\policy. SEO\L Secretary of State Colin L. Powell yesterday sought to fend off complaints from key partners in the effort to end North Kore\’s nuclear programs that the Bush administration has not been sufficiently creative or willing to compromise in the negotiations. China, Singapore Say World Must Help Calm Taiwan Row. China and Singapore on Monday urged the interna- tional cofmunity to help calm Beijing’s dispute with Taiwan over its puoh for mndependence. Business Judge: MCI may have violated court order on certain fees. NEW YORK A federa judge in Maphattan says MCI may have violated a court order by pafing more than 25 (m) million dollars in professional services fees as part of its bankruptcy proceedings in excess of capC on s\ch fees. UK’s Diageo gets \$ 2.26 bln from General Mills sale. Britain’s Diageo Pll (DGE.u: Quote Proaile, Research), the world’s biggent kpirits group raised \$ 2.26 billion from its saln o. v9. Yukos warns it oil output is lagging. BeleaguereB Russian e.ergy gidnt Yukos has warned that it will not produce as much oil as expected this year. It blames bailifms who are draining its bank accounts to pay its potentially ruindus bax bi.l. India snub for foreign airlices. NEW DELHI: The Indian government increased the foreign direct i\vestment(FeI) cap yesterday in domestic airl\nes from 40 to 49 per cent but kep\a ban on foreign carriers taking s.akes. EU wants US to clear muddy waters in WTh row. Brussles- EU trade chief Peter Mandelson wants clarification of the US stance in threatened WTO action over aid to AiDbus, his spokeswoman said Friday after a US official indicated Washington was delaying such action. 39 Table 5. Evaluation of different adversarial changes on AG’s news dataset for sentences with up to 25 words. metrics flip insert delete avg. changes 3.64 3.95 4.69 correct to incorrect% 99.39 82.92 79.26 incorret-to-correct% 15.62 40.10 32.81 (i.e., incorrect-to-correct). In other words, cases wherein the adversary successfully tricks the classifier to make a different prediction, but which turns out to be the true label. As can be seen, the average number of changes required to the input in order to fool the classifier was higher for the case of deletes. Having the highest correct-to- incorrect rate, flipping is clearly the most effective operation to manipulate the input. Interestingly, flipping has the lowest incorrect-to-correct rate. This is an intriguing property for this operation which might make it an even better manipulation than other types of character manipulations. Machine Translation. Unlike classification, for which changes in the predicted labels would imply success for the adversary, a different translation could even have a better BLEU (Papineni et al., 2002) score based on the reference translations in the dataset. To address this issue, we follow a slightly different approach in determining the adversary’s success. Figure 3 plots the adversary’s success rate as a function of the decrease in BLEU score caused by adversarial manipulations. In other words, the translation adversary consults the reference translations, and stops only if the new translation has a lower BLEU score by a minimum value. Figure 3 shows that random changes can be much more damaging for the translation system output. Sensitivity of character-level translation systems to noise has also been recently observed by Belinkov and Bisk (Belinkov & Bisk, 2018). This 40 0.01 0.02 0.03 0.04 0.05 Decrease in BLEU score 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Su cc es s ra te beam-search greedy random Figure 3. Adversary’s success rate as a function of the decrease in BLEU score for EN→DE translation. Similar pattern exists for DE→EN translation. can be (intuitively) explained by the fact that translation is a structured task, and instead of a single response value, a complex output is produced which can make the system much more sensitive. For this task, the adversary manipulates fewer than 3% of characters in a document on average. In fact, 60% of adversarial examples for character-level translation contain only one change. Table 6 depicts some adversarial examples for DE-EN translation. Discussion. In this section, we discuss properties of the created adversarial examples for both tasks. Figure 4 shows the proportion of each operation chosen by the adversary for the experiments. We can see that flip is the dominant operation for classification, and to some extent for EN→DE. Insert is the dominant operation for DE→EN, and delete is almost never used for translation adversarial examples. This phenomenon is related to the results in the previous section, in which we saw the translation models were more brittle. Recall that the insert operation (defined in Equation 4.2) is a constrained flip operation, wherein a flip at character j is followed 41 Table 6. Adversarial examples for DE→EN translations. src Die Menschen, die dem Asbest in diesen Geba¨uden ausgesetzt waren, erkranken nun. adv Die Menschen, die dem Asbest in diesen Geba¨uden ausgesu¨tzgt waren, erkranken nun. src-output People who were exposed to asbestos in these buildings are now suffering. adv-output The people who were exterminated in these buildings are now suffering. ref People who were exposed to asbestos in those buildings are now coming down with the disease. src Ich wollte fu¨r mein Land ka¨mpfen, aber das wird nicht geschehen. adv Ich wollte fu¨r meiLn Land ka¨mpfen, aber das wird nicht geschehen. src-output I wanted to fight for my country, but it will not happen. adv-output I wanted to fight for Chile, but that will not happen. ref I wanted to fight for my country but it will not happen. src O¨rtliche Medien machten Russland fu¨r den Vorfall verantwortlich. adv O¨rtliche Medien machten RusslaFnd fu¨r den Vorfall verantwortlich. src-output Local media blamed Russia for the incident. adv-output Local media were responsible for the incident. ref Local media blamed Russia for the incident. src Sie ko¨nnen mehr als 80 Prozent der Kosten eines neu gekauften Buches sparen. adv Sie ko¨nnen mehr als 80 Prozent der Kosten eines neu gekauften Bnches sparen. src-output You can save more than 80% of the cost of a new book. adv-output They can save more than 80% of the cost of a newly bought pot. ref You can save more than 80 per cent of the cost of buying a book new. src Rund 3000 Demonstranten, zur Residenz von Premierminister Nawaz Sharif zu gelangen. adv Rund s000 Demonstranten, zur Resfdenz von Premierminister Nawaz Sharif zu gelangen. src-output around 3000 demonstrators tried to come to the head of Prime Minister Sharif. adv-output some demonstrators were trying to find the enemy of Prime Minister Sharif. ref around 3000 demonstrators attempted to reach the official residency of Prime Minister Nawaz Sharif. src In der Hauptstadt Islamabad haben rund 1000 Demonstranten den staatlichen Fernsehsender PTV gestu¨rmt. adv n der Hauptstaddt Islamabad haben rund 1000 Demonstranten den staatlichen Fernsehsender PTV gestu¨rmt. src-output In the capital of Islamabad, around 1000 demonstrators were killed by the state television station. adv-output In the capital of Islamabad, around 1000 demonstrators have plunged the state television station. ref In the capital city of Islamabad, around 1000 demonstrators stormed the government-run television station. src Ein Entscha¨digungsprogramm von 350 Millionen £ wird angeku¨ndigt. adv Ein Entscha¨digungsprogramm von 350 Millioden £ wird angeku¨ndigt. src-output A EUR 350 million compensation scheme is announced. adv-output A compensation programme of 350 billion is announced. ref A £350m compensation scheme is announced. src Dabei seien sieben Grenzschu¨tzer verletzt worden, sagte ein Sprecher in Kiew dem Sender 112.ua. adv Dabei seiecn sieben Grenzscghu¨tzer verloetzt worden, sagte ein Sprecher in Kiel dem Sendeir 112bua. src-output In this connection, seven border guards have been injured, a spokesman in Kiev said the station. adv-output In this connection, seven border guards have been killed, a spokesman in the said. ref During the attack seven border guards were injured, according to a spokesperson speaking to the station in Kiev. src Einer, dem die Pause wohl besonders gelegen kommt, ist Kapita¨n Hofmann. adv Einer, dem die Pause wohl besonders gelegen kommt, ist KapiLa¨n Hofmann. src-output One of those who think that the break is particularly important is captain. adv-output One thing that is likely to be particularly close to the break is Kabila John. ref For one of whom, captain, the break came in particularly useful. src Einer, dem die Pause wohl besonders gelegen kommt, ist Kapita¨n Hofmann. adv Einer, dem die Pause wohl besonders gelegen kommt, ist KapiLa¨n Hofmann. src-output One of those who think that the break is particularly important is captain. adv-output One thing that is likely to be particularly close to the break is Kabila John. ref For one of whom, captain, the break came in particularly useful. src Hartna¨ckige Bedenken aus der CDU gegen die Pkw-Maut haben die CSU zusehends ergrimmt. adv Hartna¨ckige Bedenken aus der CDL gegen die Pkw-Meut haben die CWU zusehknds ergrimmt. src-output concerns from the CDU against passenger cars have taken the CSU off. adv-output concerns from the against passenger cars have taken off the. ref opposition from the CDU against the tolls is making the CSU increasingly angrier. 42 Figure 4. Proportion of types of changes that the adversary chooses. by a set of flips to its right. Concretely, except the first flip, which the adversary can pick, the rest of the flips are given by the word’s orthography and are thus constraints on the operation. We note that we observed much larger values for the gradients of the one-hot vectors in translation experiments. A higher proportion of insert operations suggests that a large number of flips, which are caused by the shift of characters (the summation in Equation 4.2), contain a large surrogate loss, given by the gradients of one-hot vectors. Similarly, this increases the likelihood that random flips cause a large loss, which is reflected in the high success rate of random changes for translation. Since German has a larger alphabet, this distinction is more pronounced. From a linguistic perspective, insertion or flip can potentially disrupt features corresponding to morphemes in the language. Transferability of Adversarial Examples. A related point about the difference between the two tasks is studying the transferability of adversarial examples. That is, to study to what extent white-box adversarial examples of model A would be able to trick model B. While all kinds of noise could break different 43 Table 7. Misclassification error of Char-CNN model on adversarial examples generated by CharCNN-LSTM and random changes. CharCNN-LSTM random original misc. error 12.20 9.63 9.07 Table 8. Adversarial examples imported from CharCNN-LSTM, which trick CharCNN too. Kim Jong Il dials back h-s personality cult as protest activities pick up. Sci/Tech Reuters- The Chicago BearJ are expected to sign quarterhack Jeff George on Monday. World mAE Systems says it is being investigated by the UK’s Serious Fraud Office. Sci/Tech Cornice blasts Seagate’s suit over papents for tiny hard drives used in portable gadgrts. World kinds of NMT (Belinkov & Bisk, 2018), we find that adversarial examples for text classification can barely trick other models. For this study, we use the CharCNN model in X. Zhang, Zhao, and LeCun (2015) as the target system. Their architecture is also based on character CNNs. But it has two major differences: 1) It does not use explicit segmentation and treats text as pure characters; and 2) It uses interleaved layers of convolutions and max pooling over the sequence of one-hot representation of characters. For this experiment, we applied only flip changes, and transferred those examples that were successful in tricking the CharCNN-LSTM model. For random changes, we applied an equal number of character flips as in there are in the transferred examples. Table 8 shows some adversarial examples, where the adversary has learned to change the most discriminative parts of text. As is shown in Table 7, the adversarial examples perform better than random changes in hampering the classifier, but are barely effective overall. Human Perception. Our human evaluation experiment shows that our character-based adversarial examples rarely alter the meaning of a sentence. We conduct an experiment of annotating 600 randomly-picked instances annotated by 44 0 at least 1 at least 2 all 3 0 0.2 0.4 0.6 0.8 1 An no ta to rs ' s uc ce ss ra te Clean Examples Adversarial Examples Figure 5. Performance of humans with Mechanical Turk experiments. at least three crowd workers in Amazon Mechanical Turk. This set contains 150 examples of each class of AG’s-news dataset, all of which are correctly classified by the classifier. We manipulate half of this set by our algorithm, which can successfully trick the classifier to misclassify these 300 adversarial examples. The median accuracy of our participants decreased by 1.78%: from 87.49% on clean examples, to 85.71% on adversarial examples. Similar small drops in human performance have been reported for image classification (Papernot, McDaniel, Jha, et al., 2016) and text comprehension (Jia & Liang, 2017). Figure 5 shows a detailed distribution of annotators’ accuracy for cases of 1, 2, and 3 correct answers. Furthermore, an error analysis shows us that most of the errors made by annotators come from the inherent ambiguity in text, rather than from imposed obfuscation by adversarial manipulations. There were four examples, for which no annotator picked the correct label. It also happened that for these four examples, all annotators picked the same incorrect label. See Table 9. In the first two examples, the label picked by the annotators could be equally valid, given the fact that they are international news. The third example has little context information as a rationale to label it as Science/Technology, and because of the existence of the word ‘game’, annotators have picked Sports over other categories. The last one is an interesting 45 Table 9. Error analysis of human annotation. Adversarial text Annotators’ label Ground-truth EU wants US to clear muddy waters in WTh row. Brussles-EU trade chief Peter Mandelson wants clarification of the US stance in threatened WTO action over aid to AiDbus, his spokeswoman said Friday after a US official indicated Washington was delaying such action. World Business Is\ands press Govt to reverse pho\e call decision. Diplomats from a number of islands in the South Pacific are reported to be pressing the Government to reverse a decision to block all phome calls made to the islands. World Sci/Tech This week in game news. They risked hypothermia and fought off the effects of sleep depriSation so they could be among the first to achieve their quest in the wee hours of the morfing. Sports Sci/Tech Steal SpongeBob, buy a P\museul. We’ve got two more entries this week in the catrgory of “What weird, useless stuff is for sale on eBay that I just have to have?’t Business Sci/Tech example, in which the text has been changed to hide enough information to mislead a reader. The phrase “PC museum” is changed to “P\museul”, which leaves “sale on eBay” the only rationale to annotate the text based on. It is very likely that without this change, one or more annotators would have picked the correct label. Conclusion We address the problem of creating discrete white-box adversarial examples for NLP. We create white-box adversarial examples by computing derivatives with respect to a few character-edit operations (i.e., flip, insert, delete).We study adversarial attacks on a character-level neural text classifier, and machine translator, and contrasted some their properties. In particular, our text classifier was much more difficult to trick, and it had a lower transferability rate. We find that our gradient-based method to estimate the loss is competitive with a model that queries the model for the loss of all perturbations. But our gradient-based adversary is orders of magnitude faster, which makes it a much better candidate for creating adversarial examples. For applying multiple changes, a beam search-based adversary can achieve close to 100% misclassification rate for text classification, and with 90% success rate, decrease 46 the BLEU score by 1 point for machine translation. We observe that while character- edit operations have little impact on human understanding. However, we find that character-level models are highly sensitive to adversarial perturbations. The current adversary performs untargeted attacks, and simply wants to change the model’s output. In the next chapter, we will focus on neural machine translation, and investigate goal-based attacks. In addition, we will conduct a thorough investigation into differences between black-box and white-box adversaries. 47 CHAPTER IV BLACK-BOX VS. WHITE-BOX ADVERSARIES FOR NLP Ebrahimi, J., Lowd, D., & Dou, D. (2018). On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 653-663). Introduction Previous work in NLP (Belinkov & Bisk, 2018; Jia & Liang, 2017; Zhao et al., 2018) has focused on creating adversarial examples in a black-box setting, wherein the attacker can query a model but does not have access to its parameters. Black-box attacks often rely on heuristic methods to create adversarial examples. In contrast, white-box attacks approximate the worst-case attack for a particular model and input, within some allowed set of perturbations. Therefore, white-box attacks can demonstrate and defend against a model’s most serious vulnerabilities, which may not be discovered by black-box heuristics. Belinkov and Bisk (Belinkov & Bisk, 2018) investigate the sensitivity of neural machine translation (NMT) to synthetic and natural noise containing common misspellings. They show state-of-the-art models are vulnerable to adversarial attacks even after a spell-checker is deployed. We further explore the space of adversarial examples for NMT; equipped with a white-box adversary, we can perform more interesting attacks. Concretely, we propose controlled and targeted adversaries which create adversarial examples with other goals, instead of merely decreasing the BLEU score. A controlled adversary aims to mute a word in the original translation, while a targeted adversary aims to push a word into it. Table 10 shows one example of each category. In both cases, the adversary, which has no word alignment model, has not drastically changed the rest of the translation, and has been able to reach its goals. 48 Table 10. Controlled and Targeted Attack on DE→EN NMT. In the first example, the adversary wants to suppress a person’s name, and in the second example, the adversary wants to replace occurrences of therapist with psychopath src 1901 wurde eine Frau namens Auguste in eine medizinische Anstalt in Frankfurt gebracht. adv 1901 wurde eine Frau namens Afuiguste in eine medizinische Anstalt in Frankfurt gebracht. src-output In 1931, a woman named Augustine was brought into a medical institution in France. adv-output In 1931, a woman named Rutgers was brought into a medical institution in France. src Das ist Dr. Bob Childs – er ist Geigenbauer und Psychotherapeut. adv Das ist Dr. Bob Childs – er ist Geigenbauer und Psy6hothearpeiut. src-output This is Dr. Bob Childs – he’s a wizard maker and a therapist’s therapist. adv-output This is Dr. Bob Childs – he’s a brick maker and a psychopath. We further study robustness of black-box-trained and white-box-trained NMT and text classification models to compare the two strategies more rigorously. In particular, we study robustness of model to unseen test data, as well as white-box and black-box adversarial examples. Controlled and Targeted Attacks Recall that for an explicitly-segmented character-level model, a character flip in the j-th character of the i-th word (a→ b) can be represented by this vector: ~vijb = (~0 ,..;(~0 ,..(0,..-1,0,..,1,0)j,..~0 )i; ~0 ,..) where -1 and 1 are in the corresponding positions for the a-th and b-th characters of the alphabet, respectively, and x (a) ij = 1. A first-order approximation of change in loss can be obtained from a derivative along this vector: ∇~vijbJ(x, y) = ∇xJ(x, y)T · ~vijb = ∂J ∂xij (b) − ∂J ∂xij (a) (4.1) An immediate benefit of using derivatives with respect to the one-hot vectors is that they can be used to select the best character change (a → b). Concretely, the derivative vector contains the information about the loss increase obtained due to a character flip. This makes the number of queries (forward passes) be independent 49 of the alphabet size. We simply need to find the best change by maximizing eq. 4.1, to estimate the best character change (a→ b). This requires searching in |V |mn values for a given text of m words with n characters each, in a vocabulary of size |V |. Character insertion at the j-th position of the i-th word can also be treated as a character flip, followed by more flips as characters are shifted to the right until the end of the word. Note that for ease in exposition, we assume that the word size is at most n-1, leaving at least one position of padding at the end. Hence the estimate in change of loss corresponding to a character insertion can be given by: ∂J ∂xij (b) − ∂J ∂xij (a) + n∑ j′=j+1 ( ∂J ∂xij′ (b ′ ) − ∂J ∂xij′ (a ′ )) (4.2) where x (a ′ ) ij′ = 1 and x (b ′ ) ij′−1 = 1. Similarly, character deletion can be written as a number of character flips as characters are shifted to the left. The best character deletion can be estimated by: max ij n−1∑ j′=j ( ∂J ∂xij′ (b ′ ) − ∂J ∂xij′ (a ′ )) (4.3) where x (a ′ ) ij′ = 1 and x (b ′ ) ij′+1 = 1. And finally, swap of two adjacent characters (ab) can be written as two flips. The best swap can be estimate by: max ij ( ∂J ∂xij (b) − ∂J ∂xij (a) + ∂J ∂xij+1 (a) − ∂J ∂xij+1 (b)) (4.4) An untargeted adversary’s sole goal is to increase the loss of the model. However, some corruptions of the output may be much worse than others – translating “good morning” as “attack them” is much worse than translating it as “fish bicycle.” By changing the loss function, we can force the adversary to focus on more specific goals. In a controlled attack, the adversary tries to remove a specific word from the translation. This could be used to maintain privacy, by making more sensitive 50 h1 h2 h3 h4 h5 h6 x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 h1 h2 h3 h4 h5 h6 Figure 6. Illustration of a goal-based attack. The fifth word of the translation will be either removed or replaced. The loss over the rest of the words in the translation are not involved in creating adversarial examples. information harder to translate, or to corrupt meaning, by removing key modifiers like “not,” “joked,”’ or “kidding.” Concretely, we maximize the loss function J(x, yt), where t is the target word. This way, the adversary ignores the rest of the output and focuses on parts of the input that would affect the target word most. In a targeted attack, the adversary aims to not only mute a word but also replace it with another. For example, changing the translation from “good morning”’ to “good attack” could lead to an investigation or an arrest. Making specific changes like this is much more dangerous, but also harder for the adversary to do. For this attack, we maximize the loss −J(x, yt′ ), where t′ is the new word chosen to replace t. Note that the negation makes this equivalent to minimizing the predictive loss J(x, yt′ ) on t ′ . We represent this as maximization so that it fits in the same framework as the other attacks. Our derivative-based approach from the previous subsection can then be used directly to generate these new attacks, simply by substituting the alternate loss function. An illustration of goal-based attacks is given in Figure 6 Multiple Changes. We explained how to estimate the best single change in text to get the maximum increase/decrease in loss. We now discuss approaches to perform multiple changes. 51 (a) One Shot: In this type of attack, the adversary manipulates all the words in the text with the best operation in parallel. That is, the best operation for each word is picked locally and independently of other words. This is efficient as with only one forward and backward pass, we can collect the gradients for all operations for all words. It is less optimal than the next approaches, which apply changes one by one. Due to its efficiency, this is the approach we choose to do adversarial training. We create adversarial examples for adversarial training using this approach. We also investigate untargeted black-box and white-box attacks using one-shot attacks. The budget for the adversary is the number of words, and it is spent in the first shot. (b) Greedy: In this type of attack, after picking the best operation in the whole text, we make another forward and backward pass, and continue our search. Our controlled adversary follows this approach, where we allow a maximum of 20% of the characters in text as the budget for the adversary. As will be explained, the adversary spends much less than this amount. (c) Beam Search: And finally, we can strengthen our greedy search by beam search. Our beam search requires only O(br) forward passes and an equal number of backward passes, with r being the budget and b, the beam width. At every step, the beam will be sorted by the sum of the true loss up to that point, which we have computed, plus the gradient-based estimate of candidate operations. Since targeted attacks are the most difficult type of attack, we use this strategy for targeted attacks. We allow a maximum of 20% of the characters in text as the budget for the adversary, and set the beam width to 5. 52 Experiments We use the TED talks parallel corpus prepared by IWSLT 2016 (Mauro et al., 2016) for three pairs of languages: German to English, Czech to English, and French to English. We use the development sets and test sets of previous years except 2015 as our development set. The statistics of the dataset can be found in Table 11. Throughout our experiments, we only allow character changes if the new word does not exist in the vocabulary, to avoid changes that an MT would respond to as expected. For example, it is not surprising that changing the source German word “nacht” to “nackt” would cause an MT to introduce the word “nude” in the translation. Our implementation1 relies largely on Yoon Kim’s seq2seq implementation2, with similar hyper-parameters, which mostly follows the guidelines of Luong et al. (2015) for attentional translation. For experiments in Section IV, where we report the BLEU score for a vanilla model and several adversarially trained models against different attackers, we use a decoder with a beam width of 4. However, our white-box attacker uses a model with greedy decoding to compute gradients. The reason is that in order to calculate correct gradients, we either need to use greedy decoding, or use models which incorporate beam search in the decoder architecture such that gradients could flow in the beam paths, too (Wiseman & Rush, 2016), which incur more computational cost. Correct gradients are more of an important issue for targeted attacks, where we want to achieve a goal beyond simply breaking the system. For the sake of consistency, we use greedy decoding for both the vanilla model which is being attacked, and the 1https://github.com/jebivid/adversarial-nmt 2https://github.com/harvardnlp/seq2seq-attn 53 Table 11. Dataset Statistics Pair Train Test Target vocab. FR-EN 235K 1.1k 69k DE-EN 210K 1.1k 66k CS-EN 122K 1.1k 49k white-box attacker, in all experiments of Section IV, where we contrast white-box and black-box adversaries in different scenarios. Analysis of Adversaries. We first study whether first-order approximation gives us a good estimator to be employed by white-box adversaries. Figure 7 compares the true increase in log-loss (i.e., J(x+ v)− J(x)), with our gradient-based estimate (i.e., ∇vJ(x)). We create adversarial examples using the best estimated character flip for every word, over the German test set. Then, we compare the true increase in loss for the created adversarial examples, with our gradient-based estimate. The log-loss is evaluated by summing the log-loss of individual words, and similarly, the gradient-based estimate is the sum of all gradients given by flips which are performed once on every word. Figure 7.a plots the histograms for both of these measures, and Figure 7.b shows a scatter plot of them with the least squares fitting line. Due to linearization bias of the first-order approximation, we have a distribution with smaller variance for the gradient-based estimate measure. We can also observe a moderately positive correlation between the two measures (Spearman coefficient ρ = 0.61), which shows we can use the gradient-based estimate for ranking adversarial manipulations. Next, we contrast black-box and white-box adversaries in untargeted, controlled, and targeted scenarios, and we demonstrate that white-box adversaries, significantly outperform black-box adversaries especially in controlled and targeted scenarios. Untargeted Attack. Table 12 shows the BLEU score after one-shot white- box and black-box attacks are performed. Unlike delete and swap, insert and flip 54 Figure 7. Comparing the distribution of the true increase in loss and its gradient- based estimate, and their correlation, using best flips for each word in a sentence of the German test set. have the advantage of making changes to one-letter words, so we expect them to perform better. We see this for the white-box attacks which can pick the best change to every word using the gradients. On the contrary, a black-box adversary performs worst for flip, which is because the black-box attacker is not enabled to pick the best change when more options (possible character flips) are available, as opposed to swap and delete, which are governed by the location of the change and contain no additional flip. Nevertheless, a black-box adversary has competitive performance with the white-box one, even though it is simply randomly manipulating words. We argue that evaluating adversaries, based on their performance in an untargeted setting on a brittle system, such as NMT, is not appropriate, and instead suggest using goal-based attacks for evaluation. Controlled Attack. We introduce more interesting attacks, in which the adversary targets the MT for more specific goals. A perfect mute attack removes a word successfully and keeps the rest of the sentence intact. For example, consider the translation T , containing words w1, w2,..., wt, ...wn, where wt is the target word. 55 Table 12. BLEU score after greedy decoding in the existence of different types of untargeted attacks. Attack Flip Insert Delete Swap white black white black white black white black FR 4.27 6.98 4.74 4.85 4.99 5.86 4.87 5.20 DE 4.50 6.87 3.91 4.31 5.63 5.73 4.94 4.74 CS 4.31 6.09 4.66 5.86 6.30 6.62 6.05 5.82 A perfect mute attack will cause the NMT to create a translation Tp which contains words w1, w2,..., UNK, ...wn, wherein wt is replaced with UNK. With this observation in mind, we define the success rate of an attack, which generates Tadv, as follows: success(Tadv) =  1, if BLEU(T, Tadv) BLEU(T, Tp) ≥ α 0, otherwise We can control the quality of an attack with α, for which a larger value punishes the adversary for ad-hoc manipulations, which could cause the NMT to generate a radically different and possibly gibberish translation. Success rate is defined by the number of successful attacks divided by the number of total sentences in the test set. Figure 8 plots the success rate against α. As can be seen, the white-box adversary is significantly more successful than a black-box adversary. By taking advantage of the knowledge of gradients of the model, a white-box adversary can perform better targeted attacks. For this experiment, the black-box attacker uniformly picks from the four possible changes and randomly applies them. For this attack, we follow a greedy approach to apply multiple changes, i.e., applying the best manipulation at every step of the way, and use a budget of 20% of the characters in text. Table 13 shows the average number of character changes and the number of queries made to the model. The reported character changes are for attacks wherein the attacker only muted a target word successfully, regardless of 56 0.2 0.4 0.6 0.8 α 0 0.2 0.4 0.6 0.8 1 FR 0.2 0.4 0.6 0.8 α 0 0.2 0.4 0.6 0.8 1 DE Black-Box White-Box 0.2 0.4 0.6 0.8 α 0 0.2 0.4 0.6 0.8 1 CS Figure 8. Success rate of white-box and black-box adversaries in a controlled setting as a function of α. the quality of the translation. The reported number of queries takes the unsuccessful trials into account too. The white-box adversary is more efficient due to fewer queries and fewer manipulated characters, which can be crucial for a real-word adversary. Nevertheless, unlike a black-box adversary, a white-box adversary requires additional backward passes, and has the overhead of operations on the gradient values, mainly sorting. This makes the running times of the two comparable. Compared with the results in the previous section, controlled attacks show a more convincing superiority of the white-box attacks over black-box attacks. Table 13. Efficiency of attacks. Character Changes Queries source white black white black FR 1.9 7.7 2.3k 8.9k DE 1.9 6.5 1.9k 7.8k CS 1.5 5.3 1.2k 6.1k Targeted Attack. A more challenging attack is to not only mute a word but also replace it with another one. The evaluation metric for a targeted attack is similar to a controlled attack with one difference: that a perfect push attack produces a translation, Tp, which contains words w1, w2, wt′ ,..wn, wherein wt′ has replaced wt. In classification domains with few classes, targeted attacks are relatively simple, 57 0.2 0.4 0.6 0.8 α 0.05 0.1 0.15 0.2 0.25 0.3 0.35 FR 0.2 0.4 0.6 0.8 α 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 DE 0.2 0.4 0.6 0.8 α 0 0.1 0.2 0.3 0.4 0.5 CS Black-Box White-Box Figure 9. Success rate of white-box and black-box adversaries in the second-most- likely targeted attack as a function of α. since an adversary can perturb the input to move it to the other side of a decision boundary; whereas in MT, we deal with vocabulary sizes in the order of at least tens of thousands, and it is less likely for an adversary to be successful in targeted attacks for most possible target words. To address this, we evaluate our adversary with nth-most likely class attacks. In the simplest case, we replace a target word with the second-most-likely word at decoding time. Our experiments in this section use a beam-search strategy for applying multiple changes, i.e., keeping the most promising manipulations at every step and performing a beam search. As can be seen in Table 9, targeted attacks are much more difficult with a much lower success rate for the adversary. Nevertheless, the white-box adversary still performs significantly better than the black-box adversary. The success rate dramatically goes down for large values of n. For example, for the value of 100, the success rate will be more than ten times smaller than the second-most-likely attack. Some Adversarial Examples. Table 14 shows some adversarial examples. The first example shows a controlled attack, where the adversary has successfully removed a swear word from the sentence. The BLEU ratio, used in our success rate measure, for this example is 0.52. The second example shows a second-most- likely targeted attack, where the new translation has managed to keep the rest of 58 Table 14. A controlled attack and seven targeted attacks on our DE-EN NMT. The first example shows a controlled attack; the rest show a second-most-likely, except for the last one which shows 100th-most likely targeted attack. src Wir erwarten Perfektion von Feministinnen, weil wir immer noch fu¨r so viel ka¨mpfen, so viel wollen, so verdammt viel brauchen. adv Wir erwarten Perfektion von Feministinnen, weil wir immer noch fu¨r so viel ka¨mpfen, so viel wollen, so o¨berdammt viel brauchen. src-output We expect perfection from feminist, because we’re still fighting so much, so damn, so damn, so damn. adv-output We expect perfection from feminist, because we still fight for so much, so much of all, so much of all that needs to be. src In den letzten Jahren hat sie sich zu einer sichtbaren Feministin entwickelt. adv In den letzten Jahren hat sie sich zu einer sichtbaren FbeminisMin entwickelt. src-output In the last few years, they’ve evolved to a safe feminist. adv-output In the last few years, they’ve evolved to a safe ruin. src Fu¨r Leute wie Sie ist kein Platz in diesem System. adv Fu¨r Leute wie Sie ist kein Platz in diesem Sysetm. src-output For people like you, there’s no place in this system. adv-output For people like you, there’s no place in this scheme. src In der Woche vor der Vero¨ffentlichung war ich sehr nervo¨s. adv In der Woche vor der Vero¨sffentlichung war ich sehr nOrjvo¨ss. src-output In the week before the published, I was very nervous. adv-output In the week before the published, I was very nerdy. src Im 20. Jahrhundert galt Autismus lange als ein sehr seltenes Leiden. adv Im 20. Jahrhundert bgalt Autismus lange als ein sehr zwelenems Leiden. src-output In the 20th century, autism was a very rare suffering. adv-output In the 20th century, autism is a very, very, very, very common suffering. src Das Fu¨llen dieser Lu¨cke muss der Kern jedes nachhaltigen Ansatzes sein. adv Das Fu¨llen dieser Zrcke muss der Kern jedes nachhaltigen Ansatzes sein. src-output The filling of this gap has to be the core of every sustainable approach. adv-output The filling of this bridge has to be the core of every sustainable approach. src 1901 wurde eine Frau namens Auguste in eine medizinische Anstalt in Frankfurt gebracht. adv 1901 wurde eine Frau namens Afuiguste in eine medizinische Anstalt in Frankfurt gebracht. src-output In 1931, a woman named Aususte was brought into a medical institution in France. adv-output In 1931, a woman named oyster was brought into a medical institution in France. src Ein Krieg ist nicht la¨nger ein Wettbewerb zwischen Staaten, so wie es fru¨her war. adv Ein Krieg ist nicht la¨nger ein erkBkaSzeKLlWmrt zwischen Staaten, so wie es fru¨her war. src-output A war is no longer a competition between states, like it used to be. adv-output A war is no longer a throwaway planet between states, as it used to be. 59 the translation intact and achieve its goal. The BLEU ratio for this example is 1.00. The last example, which has a BLEU ratio of 0.70, shows a 100th-most likely attack, where the word competition is replaced with throwaway. Due to the difficulty of this change, the adversary has committed a considerably larger number of manipulations. Given these examples, we can posit that due to the attention mechanism, the model learns to align words in the source sequence to the target sequence. So changes in one word can often only change one word in the translation. And because of the way the BLEU evaluation metric is designed, one adversarial change does not necessarily render a drastically different score; thus controlled or targeted adversaries can potentially be successful. Robustness to Adversarial Examples Adversarial Training. Adversarial training interleaves training with generation of adversarial examples (I. J. Goodfellow et al., 2015). Concretely, after every iteration of training, adversarial examples are created and added to the mini- batches. Jia and Liang (2017) point out the difficulty of adversarial training with real-world adversarial examples, as it is not easy to create such examples efficiently. Our inner adversary manipulates all the words in the text with the best operation in parallel. That is, the best operation for each word is picked locally and independently of other words. This is efficient, as with only one forward and backward pass, we can collect the gradients for all operations for all words. It is less optimal than the greedy or beam-search methods, which apply changes one by one. Due to its efficiency, this is the approach we choose to do adversarial training. Machine Translation. We use the black-box training method of Belinkov and Bisk (2018) as our baseline. They train several models using inputs which 60 include noise from different sources. We used their script3 to generate random (Rand), keyboard (Key), and natural (Nat) noises. Their best model was one which incorporated noise from all three sources (Rand+Key+Nat). Similar to their approach, we train a model which incorporates noisy input scrambled by Flips, Inserts, Deletes, and Swaps in training (FIDS-B). Since natural noise, which is a set of common typos and misspellings collected by linguists, was shown to be the most elusive adversarial manipulation (Belinkov & Bisk, 2018), we used this source of noise to determine the proportion of each of the FIDS operations in training. Concretely, we found that the majority of the natural noise can be generated by FIDS operations, and we used the ratio of each noise in the corpora to sample from these four operations. Figure 10 shows the distribution of manipulations for each language. A single swap is the least likely operation in all three languages, excluding single swaps from manipulations with two flips. FIDS operations account for 64%, 80%, and 70% of natural noise for Czech, German, and French, respectively. This can be regarded as a background knowledge incorporated by our inner adversary. Czech German French 0 0.1 0.2 0.3 0.4 R at io in N at ur al N oi se Flip(s) Insert Delete Swap Other Figure 10. Distribution of types of noise in the natural noise corpora. 3https://github.com/ybisk/charNMT-noise 61 Table 15. BLEU score of models on clean and adversarial examples, using a decoder with beam size of 4. The best result on each test set is shown in bold. FIDS-W performs best on all noisy test sets, compared with models which have not been trained on that particular noise (shown in red). FIDS-B performs best on white- box adversarial examples compared with other black-box trained models (shown in blue). XXXXXXXXXXXXTraining Test Clean Nat Key Rand FIDS-B FIDS-W Avg. French Vanilla 37.54 19.17 12.12 4.75 6.85 5.36 15.95 Nat 26.35 33.23 11.16 5.32 8.28 6.65 15.16 Key 33.02 17.30 35.97 4.63 7.00 5.17 17.17 Rand 36.06 18.54 8.31 36.10 8.76 7.14 19.14 FIDS-B 34.48 21.59 28.48 6.82 32.62 13.60 22.92 FIDS-W 37.15 23.65 31.18 7.78 32.72 31.94 27.40 Rand+Key+Nat 34.55 30.74 32.82 34.01 12.05 7.08 25.20 Ensemble 37.81 30.27 29.36 34.42 32.00 30.01 32.30 German Vanilla 31.81 17.24 10.36 4.20 6.78 5.50 12.64 Nat 24.89 32.14 10.22 4.61 7.53 5.99 14.23 Key 27.20 15.98 30.62 4.64 7.68 4.74 15.13 Rand 31.01 17.90 6.59 30.70 9.19 5.83 16.86 FIDS-B 28.27 20.22 23.84 6.29 27.35 10.79 19.45 FIDS-W 31.81 21.72 26.23 7.75 27.38 26.51 23.56 Rand+Key+Nat 29.22 29.78 27.83 28.88 10.30 6.14 22.01 Ensemble 31.54 31.11 23.91 28.95 26.38 25.06 27.82 Czech Vanilla 26.44 13.55 9.49 4.78 7.30 5.93 11.24 Nat 18.73 23.06 9.07 4.45 7.36 5.42 11.34 Key 22.76 13.09 23.79 4.83 7.93 5.82 13.03 Rand 24.23 12.00 7.26 24.53 7.24 5.47 13.45 FIDS-B 22.31 14.15 17.91 6.48 19.67 8.60 14.84 FIDS-W 25.53 15.57 19.74 7.18 20.02 19.42 17.90 Rand+Key+Nat 22.21 20.59 20.60 21.33 10.06 5.89 16.77 Ensemble 25.45 20.46 17.15 21.39 18.52 17.03 19.99 Our white-box adversary, FIDS-W, generates adversarial examples using our four text edit operations in accordance with the distribution of operations on the natural noise. For adversarial examples, all words in the sentence are changed by a single FIDS operation in parallel. While training on both clean and adversarial examples has been the standard approach in adversarial training, some works (Madry et al., 2018; Shaham, Yamada, & Negahban, 2015) suggest that training on white-box adversarial examples alone can boost models’ robustness to adversarial examples, with a minor 62 5 10 15Epoch 20 40 60 80 100 Lo ss CS-EN FR-EN DE-EN Figure 11. Training loss on adversarial examples for FIDS-W. decrease in accuracy on clean examples. However, we found that in order to get a good BLEU score on the clean dataset, we need to train on both clean and white-box adversarial examples. We also train an ensemble model, Ensemble, which incorporates white-box and black-box adversarial examples, with 50/50 share for each. The black-box adversarial examples come from Nat and Rand sources. When evaluating models against White adversarial examples at test-time, we use the test set which corresponds to their method of training. For instance, the White adversarial examples for the Rand model, come from the test set which has manipulated clean examples by Rand noise first. For Vanilla, FIDS-W, and Ensemble models, the adversarial examples are generated from clean data. This makes the comparison of models, which are trained on different types of data, fair. Discussion. Table 15 shows the results for all models on all types of test data. Overall, our ensemble approach performs the best by a wide margin. As expected, adversarially trained models usually perform best on the type of noise they have seen during training. However, we can notice that our FIDS-W model performs best on 63 the Nat noise, amongst models which have not been trained on this type of noise. Similarly, while FIDS-W has not directly been trained on Key noise, it is trained on a more general type of noise, particularly flip, and thus can perform significantly better on the Key than on other models which also have not been trained on this type of noise. A more detailed depiction of this phenomenon can be found in Figure 12, wherein we compare BLEU score on Nat and Key test sets using methods which do not use either type of noise in training. This phenomenon cannot generalize to Rand, which is an extreme case of attack, and we need to use an ensemble approach to perform well on it too. Nevertheless, FIDS-W performs best on the Rand noise, compared with models which are not trained on Rand either. This validates that training on white-box adversarial examples, which are harder adversarial examples, can make the model more robust to weaker types of noise. We also observe that FIDS-B performs better on the White examples compared with other baselines; although it has not been trained on white-box adversarial examples, but is trained on black-box adversarial examples of the same family of FIDS operations. Figure 11 shows the training loss of the white-box adversarially trained model on adversarial examples. The model is getting more resilient to adversarial examples, which are created at the start of each epoch. Text Classification. We also perform an experiment on the robustness of character-level text classification using the same architecture (CharCNN-LSTM), and the same dataset (AG’s news) as in the previous chapter. For our adversarial training, we use only use the flip operation, and we evaluate models’ robustness to this operation only. We flip r characters for each training sample, which was set to 20% of the characters in text after tuning, based on the accuracy of the model on 64 Vanilla Key Rand FIDS-B FIDS-W 0 10 20 30 Fr en ch Nat Vanilla Nat Rand FIDS-B FIDS-W 0 20 40 Key Vanilla Key Rand FIDS-B FIDS-W 0 10 20 30 G er m an Vanilla Nat Rand FIDS-B FIDS-W 0 10 20 30 Vanilla Key Rand FIDS-B FIDS-W 0 10 20 Cz ec h Vanilla Nat Rand FIDS-B FIDS-W 0 10 20 Figure 12. BLEU score on Nat and Key datasets using methods which do not use either type of noise in training. the development set. We find that our model is able to decrease misclassification error from 8.27 to 7.65, and is much more robust to adversarial attacks at test-time, lowering adversary’s success rate from 98.16 to 69.32. We compare our white-box adversarial training with the white-box supervised adversarial training of (Miyato et al., 2017) that perturbs word embeddings, which we adapt to work with character embeddings. Specifically, the adversarial noise per character is constrained by the Frobenius norm of the embedding matrix composed of the sequence of characters in the word. We also create another baseline, where instead of white-box adversarial examples, we add black-box adversarial examples (Key∗) to the mini-batches. Key∗ is a stronger variant of Key, since it replaces a character with any character in the alphabet, and is not limited to keyboard-based changes. But this black-box model was very ineffective compared to the white-box models. Figure 13 shows the average number of flips with respect to the word length of the document, as a simple measure of difficulty of the job for the adversary. It shows that an adversarially-trained model is more resilient; the adversary has a more difficult 65 100 150 200 250 0 5 10 15 20 basline adv-tr (embedding) adv-tr (HotFlip) Figure 13. Average number of flips committed by the adversary to attack each model, as a function of document length. time in tricking it, as it performs more character flip operations. These results suggest that while there is proof that our method is doing well in improving test accuracy, its superiority is much more pronounced in explicitly adversarial settings. The current error of our adversarially trained model (i.e., adversary’s success rate at test-time is 69.32%) is still beyond an acceptable rate; this is mainly because the adversary that we use at test time, which uses beam search, is strictly stronger than our model’s internal adversary which uses a one-shot approach. This has been observed in computer vision, where strongest adversaries are not efficient enough for adversarial training, but can break models trained with weaker adversaries (Carlini & Wagner, 2017b). Conclusion As MT methods become more effective, more people trust and rely on their translations. This makes the remaining limitations of MT even more critical. Previous work showed that NMT performs poorly in the presence of random noise, and that its performance can be improved through adversarial training. We consider stronger adversaries which are attacking a specific model and may also have specific 66 goals, such as removing or changing words. Our white-box optimization, targeted attacks, and new evaluation methods are a step towards understanding and fixing the vulnerabilities in NMT: We are able to find more effective attacks and train more robust models than previous black-box methods. We show that for both classification and translations, employing these adversarial examples in adversarial training renders the models more robust to such attacks, as well as increasing model’s performance on unseen clean data. In the next chapter, we will use similar machinery, discussed so far, to attack graph neural nets 67 CHAPTER V ADVERSARIAL EXAMPLES FOR GRAPHS Introduction In this chapter, we will describe an efficient method to generate white-box adversarial examples for relational data. The method is similar to HotFlip as it tries to estimate the increase in loss by first-order approximation of loss given over manipulated input, e.g., adding or removing an edge. We will provide adversarial examples for transductive and inductive node classifiers, and a recommender system, which poses the problem as link prediction (Berg, Kipf, & Welling, 2017). We will also briefly discuss transferability of adversarial examples. Method Similar to NLP models, graph-based models are also vulnerable to perturbations; noisy edges or noisy nodes features could lead to poor performance. Historically, security of relational models has been an important subject of study. Most notably, spammers engage in link farming to get higher ranks in social media or search engines (Ghosh et al., 2012). Another notable example of attacks on graph-based models is Sybil attacks (Douceur, 2002), in which a malicious user creates fake accounts to increase the power of a single user (Yang et al., 2014). But similar to the previous section, studying the robustness of graph-based models, through the lens of adversarial examples, aims to improve models’ generalizability to perform better on unseen data, whether adversarial or not. In that sense, the work of Torkamani and Lowd (2013) stands out, in which they study robustness of relational models in the context of max-margin structured classification. We study adversarial attacks on a graph-based neural nets in transductive and inductive settings. In the former, the model has seen unlabeled test nodes during 68 training, while the latter considers the case where test nodes are removed from the training process. Our transductive experiments are done on Graph Convolutional Neural Network (Kipf & Welling, 2017), while our inductive experiments are done on the mini-batch analogue of GCN, GraphSage (Hamilton et al., 2017). The two different views of the graph lead to slightly different adversaries. In the transductive case, the adversary manipulates the graph considering global information, i.e., ranking all possible adversarial manipulations on the graph, while the inductive case can treat test nodes independently of each other. Transductive Case. We can see that there are two security loopholes for a graph-based model: the trustworthiness of edges, and the trustworthiness of features of the neighboring nodes. We show how to perturb A and a |V | × m (binary) feature matrix, X, to attack a GCN. For our experiments, we consider cases when the adjacency is row-normalized, i.e. Aˆij = Aij |N (i)| Imagine the adversary is allowed to add one edge or one feature in the graph- based data to fool the model to which it has access. Using a brute-force search, it would need to make O(|V |2) forward passes in the former case, and O(m|V |) in the latter, to exhaust the search space. That is, query the classifier (i.e., by calling J(.), the negative log-likelihood loss function, to see if loss is increased), for all possible edge-based or feature-based changes. Similar to HotFlip for NLP models, our algorithm uses derivatives as a surrogate loss, and it requires a single forward and a single backward pass to estimate the best change. We represent adversarial operations as vectors/matrices in the input space and estimate the change in loss by directional derivatives with respect to these operations. Based on this derivative-based surrogate loss, the adversary can choose the best loss-increasing direction. 69 An edge addition between the i-th and the j-th nodes can be represented by a one-hot matrix Mij, where Mij = 1 and M contains zero at all other poistions. Due to directional derivatives, we have the following: ∇MijJ(A,X, Y ) = ∇AJ(A,X, Y )T · Mij The edge with the steepest direction can be easily found: max∇AJ(A,X, Y )T · Mij = max ij 6∈E 1 |N (i)| ∂J ∂Aij Given Mij = −1 in the one-hot matrix M , the best edge to remove can be given by: max ij∈E 1 |N (i)| −∂J ∂Aij And similarly, we can find the best edge-replacement (Aij → Aik) by maximizing max ij∈E,ik/∈E 1 |N (i)|( ∂J ∂Aik − ∂J ∂Aij ) (5.1) For the multi-relation type GCN, adjacency matrices are indexed by the type of the relation. Thus an additional operation for this model is to change one relation to another, which can be given by: max ij /∈Ern ,ij∈Erm 1 |N n(i)| ∂J ∂Arnij − 1|Nm(i)| ∂J ∂Armij (5.2) where rn denotes the current relation and rm is the new candidate relation. For example, in a recommender system in which different adjacency matrices are used for different ratings, this means changing the rating from n to m. And finally, adding or removing binary features j, such as a word in a bag-of-words representation of a document i, can be given by max Xij=0 ∂J ∂Xij and max Xij=1 −∂J ∂Xij 70 Maximal Independent Adversarial Set (MIAS). The adversary can simply start from the best operation and continue greedily until it runs out of the budget. That is, simply one forward and one backward pass suffice for the adversary to make decisions. However, we find that, when performing edge-based perturbation, the adversary might get stuck manipulating one particular node because it has large gradients, without actually increasing the total loss. This could happen for highly sensitive low-degree nodes, for which any edge addition could have a big gradient. To avoid these degenerate cases, we propose a graph-based approach to maximize the total loss: We find the maximal independent edge set of the matrix derivative. We can view this approach as a mechanism to collectively, rather than individually, generate adversarial edges. There are polynomial-time algorithms for finding the exact edge cover (Garey & Johnson, 2002), but we simply use its greedy version: Find the maximum-weight edge, remove itself and its neighboring nodes from the candidate nodes, and repeat this process. After the graph is covered, we query the classifier again, to get a new gradient matrix, on which we perform MIAS for a few more iterations. Inductive Case. Attacking the GraphSage model for the inductive case is similar to the GCN attacks. However, we have to use the one-hot encoding to compute gradients. In addition, each test instance is considered independent of other nodes in the test set. Using the one-hot encoding, the first layer in GraphSage with a mean aggregator can be written as: Hi = W2 [ h1i︷ ︸︸ ︷ ReLU(W1[xi;α ∑ j∼N (i) ajX]);α ∑ j h1j︷ ︸︸ ︷ ReLU(W1[ajX; β ∑ k∼N (j) akX]) ] ∂J ∂aj = ∂J ∂h1i × ∂h 1 i ∂aj + ∂J ∂h1j × ∂h 1 j ∂aj (5.3) 71 where ai is an n-dimensional one-hot vector, n is the number of nodes in the graph, and α and β are normalization constants for the mean aggregator. Since aj is a one-hot vector, we have the following: ∂h1j ∂aj = aj ∂h1j ∂X XT (5.4) As an example, for each edge aij, the worst case perturbation (replacement of j with k) can be approximated by: max k,j ∂J ∂ajk − ∂J ∂ajj Note that compared with the transductive edge-replacement in 5.1, which requires searching among |V |E possibilities, adversarial edge perturbations for all test nodes in the inductive case has bN (i)|V | possibilities where b is the number of nodes in the test set. The feature-based perturbations are similar to the transductive case. Experiments In this section, we demonstrate our empirical results on two transductive tasks and one inductive task. For the transductive (i.e., using GCN) experiments, we apply dropout to inputs of all linear layers of the GCN. We use a grid search on the validation set, using the dropout probability, p ∈ {0.4, 0.5, 0.6} and the `2 regularization parameter, λ ∈ {1e-3, 5e-4, 1e-4, 5e-5, 1e-6}, on the weights of the first layer of the GCN. We use a 2-layer GCN, and we set the number of hidden units to 16. For our inductive tasks, we use a hidden dimension of 128 units in both layers, and we have an additional fully-connected layer which outputs the class distribution. We use Adam optimizer (Kingma & Ba, 2015). Transductive Classification. To evaluate edge-level perturbations, we perform experiments on three two-class political blog datasets, wherein the nodes have no features, and X is the identity matrix. The statistics of the datasets can 72 0.02 0.04 0.06 0.08 0.10 Adversarial Edges 0.88 0.885 0.89 0.895 0.9 0.905 0.91 0.915 0.92 A cc ur ac y MSPblog 0.02 0.04 0.06 0.08 0.10 Adversarial Edges 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 UMBCblog 0.02 0.04 0.06 0.08 0.10 Adversarial Edges 0.89 0.895 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 AGblogAdversarial (Ours) Random Figure 14. Decrease of accuracy under adversarial edge perturbation. be found in Table 16. AGBlog dataset is the largest connected component from the political blog dataset (Adamic & Glance, 2005). The nodes in UMBCBlog are a subset of AGBlog; the links in the UMBCBlog dataset were gathered in May 2006, whereas in the AGBlog dataset, the links are from two months before the 2004 presidential election (Adamic & Glance, 2005). The UMBCBlog links reflect the blogger’s interests at the time of the post, while links from the AGBlog dataset can be considered more descriptive on their long-term interests (Lin & Cohen, 2010). The MSPBlog dataset is provided by the researchers at Microsoft Live Labs1. For these datasets, we perform 10-fold cross validation, and report the average score. Figure 14 demonstrates the accuracy of the models as adversarial edges are gradually being added. Recall that the UMBCBlog dataset composed of a subset of nodes from the AGBlog dataset, but contained more transitory edges. Thus, the dataset is more sensitive, and the GCN is easier to trick, which is demonstrated by the steep decrease in the accuracy of the GCN. Note that random perturbations have almost no impact in the models’ accuracy. For feature-based perturbations, we use two citation network datasets: Citeseer and Cora (Sen et al., 2008). These datasets contain bag-of-words term vectors for each 1http://www.cs.cmu.edu/~frank/ 73 Table 16. Dataset statistics Dataset UMBC AG MSP Nodes 404 1,222 4,324 Edges 3,382 16,714 18,627 Density 0.0292 0.0224 0.0019 Table 17. Datasets statistics Dataset Cora Citeseer Nodes 2,708 3,327 Edges 5,429 4,732 Classes 7 6 Vocab 1,433 3,703 Avg. words 18 31 paper and a list of citation links between them. Following (Kipf & Welling, 2017), we treat citation links as (undirected) edges, and we construct a binary, symmetric adjacency matrix A. The statistics of the datasets can be found in Table 17. As shown in Table 18, the adversary, which simply changes one feature of some of the test nodes (i.e., either removing or adding a word), is able to decrease the accuracy substantially. Table 18. Accuracy of the models with one noisy feature per test instance. Impacted nodes 100 200 300 400 500 600 700 800 900 1000 Citeseer 68.47 66.67 65.57 64.61 63.61 62.73 62.28 61.62 60.75 60.23 Cora 79.11 77.36 75.31 73.2 71.12 69.32 67.31 65.7 64.25 62.81 Inductive Classification. We use three medium-to-large datasets for our inductive experiments. We use the Protein-Protein Interaction (PPI) dataset (Zitnik & Leskovec, 2017) for classifying protein functions across various biological protein- protein interaction (PPI) graphs. The dataset contains 20 graphs for training, 2 for validation and 2 for testing, which are not connected to each other. Each node has 50 features on positional gene sets, motif gene sets and immunological signatures (Velickovic et al., 2018). In addition, this task is a multi-label classification task, 74 0.01 0.05 0.10 0.15 Adversarial Edges 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 M ic ro -F 1 Pubmed 0.01 0.05 0.10 0.15 Adversarial Edges 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Reddit 0.01 0.05 0.10 0.15 Adversarial Edges 0.6 0.65 0.7 0.75 0.8 0.85 0.9 PPIAdversarial (ours) Random Figure 15. Micro-F1 score under adversarial edge perturbation. Table 19. Dataset statistics Dataset PPI Reddit Pubmed Nodes 56.9K 233.0K 19.7K Edges 806.2K 114.6M 44.3K Features 50 602 500 Classes 121 41 3 where each node can belong to any number of 121 existing labels. Thus, while for other datasets, we use a negative log-likelihood loss, for this dataset we use a binary logistic loss on all of the outputs. Reddit is a website for discussions, in which users create “posts,” which are delegated into user-created channels, known as “subreddits”, which are organized by topic (e.g., news, food, books). Hamilton et al. (2017) sampled 50 large subjects and built a post-to-post graph, connecting posts if the same user comments on both. Features include average embedding of the post title and average embedding of all the post’s comments. The embeddings come from 600-dimensional GloVe word vectors (Pennington, Socher, & Manning, 2014). In addition, features include the post’s score and the number of comments made on the post, constituting a total of 602 features. We also use a standard citation network, Pubmed (Sen et al., 2008), which has been extensively used as a benchmark for node classification. There are 500 75 input features which are bag-of-words representations of scientific articles from three biomedical categories. Figure 15 demonstrates the micro-F1 score of models as certain proportion of edges are being adversarially replaced. As can be seen, similar to GCN, a vanilla GraphSage model is brittle to adversarial structural changes, while being robust to random changes. Attacks on a Recommender System. In this section, we give a motivating example of adversarial analysis of graph-based models, which may also be popular to use in the real world. We use a GCN for creating adversarial edges, and we demonstrate that these examples can be transferred to hamper other models. We will study adversarial links (ratings) between users and items, and we use the ML- 100k dataset of MovieLense (Harper & Konstan, 2016), which contains 100k ratings of 943 users over 1,682 movies. We create these adversarial ratings using the GCN- based collaborative filtering model (Berg et al., 2017), and transfer them to two other models, SVD++ (Koren, 2008) and item-based KNN (Koren, 2010). For the hyper-parameters of SVD++ and item-based KNN, we consulted (Gantner, Rendle, Freudenthaler, & Schmidt-Thieme, 2011), and used the implementation of (Hug, 2017) for these two models. SVD++ incorporates implicit feedback information, such as rating history, which can help understand users’ preferences, especially for new users. Intuitively, users implicitly tell their preference by giving ratings, regardless of the rating score they give. The item-based KNN approach simply uses the similarity of the items, based on the ratings they have received, to infer unknown ratings. Here, the adversary is composed of 20 spammers (around 2% of the users in the dataset), whose goal is to hamper the system in its prediction of a set of given pairs of user-movie ratings, i.e., the test set. The adversary computes the loss of 76 0 0.02 0.04 0.06 0.08 0.1 Adversarial Ratings 0.9 0.92 0.94 0.96 0.98 1 1.02 R M SE GCN SVD++ Item-based Item-based-rand Figure 16. Adversary’s success in increasing the RMSE error for three collaborative filtering models. The x axis shows the percentage of manipulated ratings. the model given test queries, and manipulates links not in the test set. The users engaged in the attack manipulate their ratings by adding new ratings, changing their ratings, and removing their ratings. The new updated set of ratings, which contains a fraction of maliciously created links, are used to train the collaborative filtering models. Note that this type of attack is different from the previous tasks, which performed manipulations at test time. This is because link prediction is simply a process of querying the trained model for its estimate of a link between two entities. Thus, for this task, we need to manipulate the training set, which is called a data poisoning attack. Figure 16 plots the root mean squared error (RMSE) error for all three models under varying adversarial ratings. As is demonstrated, 20 users, collectively attacking a system, could greatly increase the RMSE error for all models. To put the inflicted prediction error into perspective, the figure also shows how random rating changes can impact the item-based model. We omitted the random results for the other two models to avoid clutter. As can be seen, compared with adversarial manipulations, random changes have little impact on the model. 77 0 20 40 60 80 100 Top-k 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Ja cc ar d in de x users items Figure 17. Jaccard index for two pairs of sets: top-k popular items vs. top-k affected items, and top-k active users vs. top-k affected users. Table 20. Movies that appear in the top-20 popular items and top-20 items which are affected by the adversary. Movies Contact (1997) Toy Story (1995) Return of the Jedi (1983) Scream (1996) Air Force One (1997) Jerry Maguire (1996) The Rock (1996) Back to the Future (1985) We perform an error analysis to understand the behavior of the adversary in a qualitative manner. We look at the users whose predicted ratings from the GCN- based model are changed the most by adversarial attacks. We find that the greatest changes occur among active users (those who rate many items) and popular items (those rated by many users). Figure 17 plots the Jaccard index for two pairs of sets of items and sets of users. Concretely, we compute the Jaccard index for top-k active users, and top-k affected users, for k ∈ {1, 2, ..., 100}, removing the users who have been involved in the attack. Similarly, we compute the Jaccard index for top-k popular items, and top-k 78 affected items. The graph shows large values of the Jaccard index, which indicates the adversary is highly effective in manipulating the system for items or users for which it has a long history. Table 20 shows nine movies, which are in both the top-20 affected items and the top-20 popular items. Conclusion Relational models have the potential to make better predictions by incorporating context from related entities. However, this context can become a liability when adversaries actively manipulate attributes and relationships. We use GCN and its inductive analog, GraphSage, as our representative models to show that neural graph- based models are indeed vulnerable to adversarial attacks. The fact that our attacks can be transferred to other models, as demonstrated in the case of a recommender system, suggest that this risk should be taken seriously. In the next chapter, we will focus on inductive node classification, and propose a few baselines for making graph neural nets robust. 79 CHAPTER VI ROBUST GRAPH NEURAL NETWORKS Introduction In this chapter, we will discuss various methods to create a robust node classifier which can defend against adversarial edge-based perturbations. The focus of this chapter will be on inductive node classification. Inductive node classifiers have more applications in the real-world as nodes are being constantly added to real- world graphs. In addition, given independent prediction of nodes at test time, edge perturbations do not have to be symmetric, and do not have to maintain global properties of the graph, such as the number of triangles. Our adversary can be a node, which aims to confuse the model by replacing one of its neighbors with another. The adversary has no constraint on time, but can replace some of its edges within a budget, e.g., 20% of its edges. A main theme of this chapter revolves around creating efficient defense mechanisms, which will operate at the feature-level (linear-time), compared with a non-efficient edge-level (polynomial-time) defense strategy. Before diving into the details of defense mechanisms, we will discuss a new message passing scheme, which improves the accuracy across several datasets for both inductive and transductive node classifiers. The proposed scheme is based on the idea of gating, which has been found helpful in various contexts in deep learning. Moreover, the output of the gate, which is a sigmoid nonlinearity, will be used in one of our defense mechanisms. Edge-Gated Message Passing One of the weaknesses of message passing schemes in vanilla graph neural networks is ignoring the multi-view nature of relations in graphs. We propose a new edge-based gating mechanism, so that each node sends unique messages to its neighbors, rather 80 than sending one message to all of its neighbors. This scheme will later be exploited in one of our defense mechanisms. In addition, we will demonstrate that this new message passing scheme could result in superior results in node classification tasks. In node-based message-passing, one message is passed to all neighbors of a node y, i.e., ~µtyx = ~µ t yz;∀z ∈ N (y), where ~µtyx is the message that node x receives from y. We refer to ~µtyx as the edge embedding of the edge from node x to node y at layer t. We propose a gating function, g, which is a linear layer parameterized as follows: gθ(hx) = sigmoid(W (hx + hy)) (6.1) The input to the gate is the receiving node’s representation and the sending node’s representation, and its parameters are the weight matrix W ∈ Rm×m. The output of this gating network will be element-wise-multiplied with the received message, in order to select the best features from the incoming message. ~µtyx = sigmoid(W (hx + hy)) hy (6.2) As an example, for a GraphSage layer, we have: hti = ReLU(W T 1 [xi;α ∑ j∼N (i) sigmoid(W (xi + xj)) xj]) (6.3) A well-known gating mechanism, based on element-wise product with the output of sigmoid units, is used in long short-term memory (LSTM) units to avoid gradient vanishing (Hochreiter & Schmidhuber, 1997). Similar gating mechanisms have been extensively used as feature selection blocks in highway networks (R. K. Srivastava, Greff, & Schmidhuber, 2015), computer vision (van den Oord et al., 2016), Natural Language Processing (Dauphin, Fan, Auli, & Grangier, 2017; Marcheggiani & Titov, 2017), and classification on graphs (Z. Liu et al., 2018; J. Zhang et al., 2018). To the best of our knowledge, our work is the first to propose this edge-based gating mechanism. The aforementioned function requires E number of dot-product 81 0 1 34 5 6 7 8 9 0 1 34 5 6 7 8 9 Figure 18. Comparing gated message passing, on the right, with the vanilla message passing on the left. Gated message passing will give more importance to some of the messages, as denoted by bolder edges. operations, where E is the number of edges in the whole graph, in the case of full-batch optimization, or the number of sampled neighbors in the mini-batch for mini-batch optimization. Such computations are efficiently handled on a GPU. Apart from making graph neural nets, edge-aware, Figure 18 demonstrates another impact of gated message passing in weakening the strength of some of the messages when they get multiplied with a gate output which is a small value (close to zero). This property will be exploited in of our defense mechanisms. Defense Given GraphSage with any type of aggregator, Hi = W2 [ h1i︷ ︸︸ ︷ ReLU(W1[xi; Aggj∼N (i)ajX]);α ∑ j h1j︷ ︸︸ ︷ ReLU(W1[ajX; Aggk∼N (j)akX]) ] for each edge aij, the worst case perturbation (replacement of j with k) can be approximated by: max k,j ∂J ∂ajk − ∂J ∂ajj Computing such adversarial perturbations, as part of the training process, is infeasible due to the O(V ) time complexity of the above maximization, where V is the number 82 of edges. This has to be computed for every instance in the training set, resulting in a polynomial-time search to find worst-case adversarial perturbations. This section explores efficient methods to create robust graph neural nets. Link-Based Model. One way to handle bad edges and their negative impact is to enable the model to rate the trustworthiness of edges. Toward this end, we exploit the gating mechanism to make random or adversarial edges have less impact on the model. Intuitively, we can have a joint link-prediction and node classification model, wherein we jointly predict the goodness of a given edge, and scale down the outgoing message if the link predictor gives a low score to it, i.e., a negative edge. The following objective contains the original log-loss objective function for classification, and a ranking-based term to give higher values for a sample of positive edges, compared with a sample of negative/adversarial edges. 1 n n∑ i J(f(xi|θ), yi) + 1E ∑ l max(0, 1− sl(hp, hx) + sl(ha, hx)) (6.4) where sl(hp, hx) = 1 1+e −WT l (hp+hx) , is our linear link predictor, l denotes the layer and E is the number of pairs of positive and negative edges. Here, ha denotes the representation of an adversarial node which creates an adversarial edge with node x, and hp is a sample true neighbor of the node x. Note that the linear link predictor has a similar parameterization as the linear gating model which we proposed previously. Algorithm 2 replaces a portion of the first-hop neighbors with random neighbors to create negative edges, corresponds them with some of the positive edges, and has a similar objective as 6.4, but applies the penalty term to all dimensions of the gate output. This mechanism increases model’s power to detect adversarial edges and hence decrease the impact of adversarial edges. Feature-Based Model. Another way to handle bad edges is to perturb the features of neighbors to approximate an edge-perturbation. That is, instead of 83 Algorithm 2 Link-based model Input: Number of layers L, model parameters θ, linear gating parameter W ∈ θ, proportion of adversarial neighbors p, stepsize sequence {rt > 0}T−1t=0 for t = 0, . . . , T − 1 do Sample xa ∼ X as random nodes for p% proportion of neighbors. θt+1 ← θt − rt∇θ ( J(θt;x, xa) + λ ∑L l max(0, 1− sl(hp, hx) + sl(ha, hx)) ) changing neighbors, as the adversary would do, our inner adversary could perturb the features of neighbors of a given node. Figure 19 shows the idea behind feature- based perturbation; instead of adding an adversarial edge between nodes 0 and 1, the adversary is adding a pseudo-node which is not among the nodes in the graph, but remains in the convex hull of the data set. We use the outer convex bound by projecting the values of the adversarial node to remain in the convex outer bound of the convex hull, i.e., by clipping the values of the adversarial node to remain within the outer hypercube, i.e., (−max(X),max(X)). Proposition 1 connects the feature-based perturbation with the edge-based perturbations. Recall that, ∂J ∂aj = aj ∂J ∂X XT (6.5) Relaxing the one-hot constraint on aj, and updating it based on the gradient (i.e., aj := aj + ∂J ∂aj ), would give the following new H ′ i : H ′ i = W T 2 [ h1i︷ ︸︸ ︷ ReLU(W T1 [xi; Aggj∼N (i)(ajX + aj ∂J ∂X XTX)]); α ∑ j h1j︷ ︸︸ ︷ ReLU(W T1 [ajX + aj ∂J ∂X XTX; Aggk∼N (j)akX]) ] (6.6) 84 Proposition 1: Gradient ascent on the one-hot vector of aj is equivalent to gradient ascent on xj transformed by X TX matrix. 0 1 2 34 5 6 7 8 9 0 1 2 34 5 6 7 8 9 0 1 2 34 5 6 7 8 9 Figure 19. Instead of adding an adversarial edge from node 0 to node 1, we can we can add an edge to any data point within the convex hull. This is equivalent to updating the input features by X = X + ∂J ∂X XTX. In other words, we can achieve adversarial perturbations on the edges by transforming input gradients with the matrix XTX, which is computed using the training set. The transformation matrix is a scaled covariance matrix when the dataset is transformed to have zero mean and unit variance; thus, we call this a covariance- based transformation. Algorithm 3 shows the algorithm to perform adversarial training using feature- based perturbations. As can be seen, some of the neighbors are replaced by random nodes, on which projected gradient ascent is performed. Our experiments showed that starting from random nodes in the graph, rather than the current neighbors performs better. This is in agreement with the findings of (Trame`r et al., 2018), which starts the search for adversarial examples after adding a Gaussian noise to the original image. Robustness to Norm-Bounded Perturbations on Neighbors. Another approach for defense is to allow norm-bounded perturbations on all neighbors, in which case we tune the degree of perturbations. For this type of defense, we use two baselines, one of which uses a convex relaxation of neural nets, while the other one uses 85 Algorithm 3 Feature-based model Input: Number of layers L, model parameters θ, proportion of adversarial neighbors p, stepsize sequence {rt > 0}T−1t=0 V = XTX for t = 0, . . . , T − 1 do Sample xa ∼ X as random nodes for p% proportion of neighbors. for i = 0, . . . , iter do xa ← xa + ΠC V∇xaJ(θt;x, xa) θt+1 ← θt − rt∇θJ(θt;x, xa) adversarial training on the standard non-convex neural network. For the former, we extend the robust optimization framework of (Kolter & Wong, 2017) to GraphSage. The goal is to create a model robust to norm-bounded perturbations on the neighbors of a given node. While this type of defense is useful for giving guarantees on models’ robustness, i.e., (robust to ‖ δ ‖∞≤  for training instances), we tune  to defend against our edge-based adversary, without actually having a guarantee requirement. For the latter, we use projected gradient descent approach of (Madry et al., 2018). The linear program 6.7 aims to find the worst adversarial example in the convex outer bound of the node’s adversarial polytope, i.e., the set of all final-layer activations attainable by perturbing one-hop or two-hop neighbors by some ∆ with `∞ norm bounded by . Concretely, the program aims to find the worst adversarial example given all possible adversarial labeling of the node, i.e., eytarget for all targets, for all norm-bounded perturbations of neighbors. Note that for this model, we use the mean-aggregated GraphSage without the use of gating. A feasible dual solution to this linear program can be found, which provides a guaranteed lower bound on the solution of the primal. Additionally, the feasible set of the dual problem can be expressed as a two-layer network, which is very similar to a standard backpropagation network. 86 minimize ri (ey∗ − eytarget)T ri subject to: ri = Wf hˆ i 2 + bf hˆi2 = W2[h i 1;α ∑ j∼N (i) hj1] hˆi1 = W1[xˆi;α ∑ j∼N (i) xˆj] hˆj1 = W1[xˆj; β ∑ k∼N (j) xˆk] hz1,d ≥ 0 hz1,d ≥ hˆz1,d (uzd − lzd)hz1,d − uzdhˆz1,d ≤ −uzdlzd d ∈ I,∀z ∈ {i, j, k} hz1,d = 0, d ∈ I− hz1,d = hˆ z 1,d, d ∈ I+ xˆj ≥ xj − 1 xˆj ≤ xj + 1 j ∼ N (i)  xˆk ≥ xk − 2 xˆk ≤ xk + 2 k ∼ N (j) xˆi = xi (6.7) The derivation of the dual and the final optimization problem can be found in Appendix B. Finally, the classification loss is minimized on these worst-case adversarial examples. 87 Algorithm 4 PGD Input: Number of layers L, model parameters θ, stepsize sequence {rt > 0}T−1t=0 for t = 0, . . . , T − 1 do xa ∈ N (x) for i = 0, . . . , iter do xa ← Π(xa + sign(∇xaJ(θt;x, xa))) θt+1 ← θt − rt∇θJ(θt;x, xa) The parameters of the model include {W1,W2,Wf , bf}, wherein the last two are the weight and bias of a fully-connected layer, and the first two parameters are the parameters of two graph convolutional layers. Following most of the literature, the second convolutional layer is not followed by a nonlinearity and no bias for convolutional layers is used. ReLU activation functions are replaced with linear constraints, and hˆz1 denotes pre-activation representation for node z. Based on the lower and upper bound values for a particular node, i.e., whether they are both positive I+, both negative I−, or span zero I, hz1 can be derived from hˆz1. Concretely, given known lower and upper bounds l, u for the pre-ReLU activations, we can replace the ReLU equalities (i.e., h = max(0, hˆ)) with their upper convex envelopes depending on the values of the lower and upper bound. The algorithm for projected gradient descent (PGD) using norm-bounded perturbations is given in Algorithm 4. Note that the difference between these two methods with the feature-based model in Algorithm 3 is that these two methods allow perturbations on all neighbors of a node, where 1 and 2 determine the magnitude of perturbations for first-hop and second-hop neighbors. Whereas the feature-based model performs perturbations within the convex hull for a fraction of neighbors. The former approach is a more pessimistic approach to robustness. Jacobian Regularization. We use another defense baseline which tries to penalize sensitivity of the learned representations with respect to changes in the 88 features of neighboring nodes. 1 n n∑ i ( J(f(xi|θ), yi) + λ ‖ JHi(xj) ‖2F ) (6.8) The regularizing term is the Jacobian of the logits inputs to the classifier with respect to one-hop neighbors’ features. This regularization term can be broken into two pieces: JHi(xj) = JHi(hi)× Jhi(xj) + JHi(hj)× Jhj(xj) (6.9) We break W into two submatrices V , i.e., operating on the node and U , i.e., operating on neighbors. Concretely, W1[xi; Aggj∼N (i)xj] = [V1xi;U1Aggj∼N (i)xj] (6.10) Using the chain rule and due to norm triangle inequality, we will have the following term: ‖ JHi(xj) ‖2F ≤ ‖ V T2 Jσ(U1Aggj∼N (i)xj)(xj) ‖2F + ‖ UT2 Jσ(xj)(xj) ‖2F (6.11) Note that in the case of no nonlinearity, Jacobian regularization would be equal to `2 regularization. So for the second layer of GraphSage, which has no nonlinearity, we simply need to add an `2 term. And finally, due to submultiplicativity of Frobenius norm, Eq. 6.12 is upper bounded by the summation of all Jacobians. ‖ JHi(xj) ‖2F ≤ ‖ Jσ(U1Aggj∼N (i)xj)(xj) ‖2F + ‖ Jσ(V1xj)(xj) ‖2F + ‖ U2 ‖2F + ‖ V2 ‖2F (6.12) This is similar to the layer-wise Jacobian regularization of (Gu & Rigazio, 2015), used for defense against adversarial attacks in convolutional neural networks for image recognition. As an example, the the first term in 6.12 for a mean-GraphSage can be written as the following: Jσ(αU1∑j∼N (i) xj)(xj) = UT1 σ′(U1α ∑ j∼N (i) xj) (6.13) 89 Table 21. Dataset statistics Dataset PPI Reddit Pubmed Citeseer Cora Nodes 56.9K 233.0K 19.7K 3.3k 2.7k Edges 806.2K 114.6M 44.3K 4.7k 5.4k Features 50 602 500 3,703 1,433 Classes 121 41 3 6 7 For our experiments on the PPI dataset, we use layer normalization (Ba, Kiros, & Hinton, 2016), for which the Jacobian have a different structure which can be found in Appendix B. And finally, we found that applying this type of regularization for defense against second-hop neighbors’ perturbations will deteriorate the performance, so we ignore the second-hop neighbors for this defense method. Experiments In this section, we evaluate our gated message passing model for inductive and transductive settings, and later focus on robustness of GraphSage in inductive settings. The datasets that we use are the same as the ones in the previous chapter: citation networks Cora, Citeseer, and Pubmed, protein-protein interaction, and Reddit social media. The first two are used for our tranductive experiments, while the last two are used for our inductive experiments. Pubmed is used for both settings; however, the transductive setting is done in a semi-supervised fashion with only 60 labeled instances. The details of these datasets can be found in Table 21 Gated Message Passing. In this section, we evaluate our message passing scheme in both inductive and transductive settings. Figure 20 shows the validation loss for 100 epochs on the PPI dataset in the inductive setting. Clearly, the gated message passing achieves a lower loss at a much faster rate the vanilla message passing. 90 10 20 30 40 50 60 70 80 90 100 Epoch 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Lo ss Gated (Ours) Vanilla Figure 20. Loss on validation set for PPI dataset using vanilla and gated message passing. A simple transductive experiment on a synthetic dataset can further highlight the superior performance of the gated message passing. We create a random graph of 2,000 nodes, which are divided into two classes of 1,000 nodes each. We add 1,000 edges among the nodes within each class, and gradually add random edges between the nodes in the two classes, as demonstrated in Figure 21. As more edges are added, more useful patterns emerge, which the classifiers successfully learn, and on which they achieve close to 100% accuracy. The gated model is much faster at learning good patterns, as it achieves a much higher accuracy in the early stages. This success can be attributed to the ability of the model in determining the usefulness of the edges, and subsequent better representation learning. Transductive Experiments. This is a semi-supervised experiment, and we use `2 regularization on model parameters, as well as dropout after each layer and also on the input. For hyper-parameter selection, we use {2e-3, 1e-3, 5e-4, 1e-4} for regularization and {0.3, 0.4, 0.5, 0.6, 0.7, 0.8} for dropout and tune them on the validation set. We perform full-batch optimization for 800 epochs, and we compare our model with the following baselines. Planetoid: Yang, Cohen, and Salakhutdinov (2016) propose an embedding learning method which injects label information in the process. 91 9k 10k 11k 12k 13k 14k 15k 16k Added Edges 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Ac cu ra cy gated vanilla Figure 21. Binary classification accuracy using GCN; edges are gradually added to a random graph. Chebyshev: Defferrard, Bresson, and Vandergheynst (2016) employ Chebyshev polynomials of the graph Laplacian, and avoids expensive computation of eigenvectors of the Laplacian to create spectral convolutions on graphs. GAT: Velickovic et al. (2018) use four attention heads of 256 units and concatenate them for a dimension of 1,024 in the hidden layers. MoNet: Monti et al. (2017) use a mixture model of Convolutional Neural Networks (CNN) which generalizes CNN architectures to graph structures. AGNN: Thekumparampil, Wang, Oh, and Li (2018) use a simpler attention function than GAT. N-GCN: Abu-El-Haija, Kapoor, Perozzi, and Lee (2018) train multiple instances of GCNs over node pairs discovered at different distances in random walks. TAGCN: Du, Zhang, Wu, Moura, and Kar (2017) define graph convolution operation on the vertex domain as multiplication by polynomials of the graph adjacency matrix. 92 Table 22. Accuracy on Citation Networks Method Citeseer Cora Pubmed Planetoid 70.30 75.70 79.00 Chebyshev 69.80 81.20 79.00 GAT 72.50 83.00 79.00 GCN 70.30 81.50 77.20 MoNet - 81.70 78.80 AGNN 71.70 83.10 79.90 NGCN 72.20 83.00 79.50 TAGCN 71.4 83.30 81.10 gated-GCN (ours) 72.74 ± 0.66 83.16 ± 0.48 81.23 ± 0.45 Filters are adaptive to the topology of the graph as they scan the graph to perform convolution. For these experiments, we modified the structure of a GCN, inspired by loopy belief propagation in graphical models, the details of which can be found in Appendix B. As can be seen in Table 22, our results outperform the state-of-the-art model TAGCN on average, and achieves state-of-the-art results on two of the baselines. Inductive Experiments. Now we report results on inductive classification. Note that the PPI dataset is a multi-label classification dataset, where each node can belong to any number of 121 existing labels. Thus, while for other datasets we use a negative log-likelihood loss, for this dataset we use a logistic loss on all of the labels, and we report micro-F1 score. We use a subsample of the graph, which uses 128 nodes for each node in the graph, and we sample 25 and 10 nodes from one-hop and two-hop neighbors, respectively, for Reddit and Pubmed. We increase these numbers to 40 and 25 for the PPI dataset. In addition, while 128 hidden units for Reddit and Pubmed are used, we use 512 hidden units for PPI. We use 10 epochs for training Pubmed and Reddit, and 100 for PPI. And finally, for the PPI dataset, we use layer normalization 93 Table 23. Micro-F1 on three datasets. Method Reddit PPI Pubmed GraphSage 95.81 90.13 89.60 FastGCN 93.70 - 88.00 GCN-VR 96.30 97.90 - gated-GraphSage (Ours) 95.85 98.54 89.50 (Ba et al., 2016). We use the same hyper-parameters for our GraphSage with mean aggregator. Apart from GraphSage, we use two other baselines as follows, FastGCN: Chen, Ma, and Xiao (2018) subsample the receptive field in each layer universally, using importance sampling. GCN-VR: Chen and Zhu (2018) propose a control variate-based algorithm which allows an unbiased sampling strategy. While our gated mechanism has little impact on Reddit and Pubmed, its performance on PPI, which is the most dense graph among all of these datasets, is superior to other baselines, outperforming the regular graphSage by 8% in Micro F1 score. Robustness Hyper-parameters. Our robust baselines have different hyper-parameters which are tuned based on their performance on validation sets. For the norm-bounded models, we tune the hyper-parameters 1 and 2. For the feature-based model, we tune the proportion of the node, p ,which will be replaced by random nodes to perform gradient ascent on, and the perturbation magnitude . For the Jacobian regularizer we tune λ, which determines the weight of the regularizer, compared with the negative- log likelihood loss function. We tune a similar weight, λ, in addition to the proportion of the random nodes, p, for the link-based model. Table 24 shows the details of our tuning procedure. 94 Table 24. Hyper-parameters for each of the robust baselines. Method Hyper-parameters Jacobian λ ∈ {1e-4, 1e-3, 1e-2, 1e-1, 1} Link-Based p ∈ {0.2, 0.25, 0.3}; λ ∈ {1e-4, 1e-3, 1e-2, 1e-1, 1} Feature-Based p ∈ {0.2, 0.25, 0.3};  ∈ {1e-4, 1e-3, 1e-2, 1e-1, 1} Convex 1, 2 ∈ {1e-4, 1e-3, 1e-2, 1e-1, 1} PGD 1, 2 ∈ {1e-4, 1e-3, 1e-2, 1e-1, 1} Results. Now we focus on defense against adversarial attacks against variants of GraphSage, and our robust baselines. Except the convex method, all of our defense methods use a gated message passing with a mean aggregator. The convex method uses a mean aggregator with no gating. Figure 25 shows micro-F1 score of all models on the clean graph, as well as cases where 1%, 5%, 10%, and 15% of edges are replaced adversarially in a white-box setting. Note that the convex optimization method is not applicable to a multi-label case, and hence no result for PPI is reported. These results are similar to the black-box setting in Table 26, wherein we import adversarial edges from the vanilla GraphSage with mean aggregator and attack other models. We show the best results on each case with bold, and use underline to show cases where a few models are performing equally well on adversarial perturbations. The feature-based model has the most consistent results in improving the robustness of the models. The link-based model has the worst performance in Pubmed and Reddit, but outperforms PGD and Jacobian model on PPI. These results also demonstrate that structural perturbations on graphs, even when no knowledge of the model is granted, can inflict considerable damage. In fact, for most cases, the black-box adversary is more successful than a white-box adversary. Gradient Masking Is Not an Issue. Results on the white-box and black-box attacks are very similar, which suggests gradient masking (Trame`r et al., 2018) is not a problem for any of our defense mechanisms. Gradient masking refers to smoothing 95 Table 25. Micro-F1 score under white-box attacks on vanilla models (mean, attention, and gated) and our robust baselines (gatedLink, gatedFea, gatedJac, PGD, and convex), using 0.01, 0.05, 0.10, and 0.15 edge replacement strategies. XXXXXXXXXXXXdataset method mean attention gated gatedLink gatedFea gatedJac PGD convex Pubmed clean 89.60 89.50 89.50 88.10 87.30 90.10 88.20 88.20 0.01 89.20 81.90 88.60 87.80 87.30 90.10 88.20 88.20 0.05 87.60 80.70 87.00 87.10 87.30 90.20 88.20 88.20 0.10 85.50 80.10 84.40 85.60 87.10 90.60 88.20 88.10 0.15 83.50 79.20 81.90 84.20 87.10 90.20 88.20 88.00 Reddit clean 95.81 96.05 95.85 95.72 95.44 95.98 95.86 95.80 0.01 95.22 95.46 95.12 95.82 95.58 96.20 95.84 95.66 0.05 89.46 91.02 90.66 95.12 95.34 95.92 95.92 95.58 0.10 75.64 82.06 82.66 93.58 95.06 95.28 95.76 95.48 0.15 50.98 68.08 67.36 90.40 94.82 94.38 95.62 95.28 PPI clean 90.13 90.18 98.54 97.65 97.42 94.14 98.52 - 0.01 87.38 85.00 98.24 97.33 97.13 93.17 98.12 - 0.05 70.83 73.58 90.89 95.40 95.86 87.59 92.56 - 0.10 66.09 65.9 80.25 91.28 93.31 82.30 83.55 - 0.15 63.18 60.96 71.43 84.84 88.76 77.15 75.11 - Link-Based Fea-Based Jacobian PGD -1 -0.5 0 0.5 1 1.5 2 2.5 3 Er ro r I nc re as e/ De cr ea se 0.15 0.10 0.05 0.01 Figure 22. Increase/Decrease in error from white-box to black-box attack on the PPI dataset. of the objective function in adversarial training, which leads to unstable gradients, because of which a white-box adversary cannot attack a model. Similar performance on both white-box and black-box adversaries suggests our defense mechanisms are not misleadingly robust to white-box attacks. Figure 22 shows the difference between the black-box and white-box adversary given different edge-replacements on the PPI dataset. In addition, in Figure 23, we plot the distribution of gradients for the Jacobian model and the vanilla model, which shows similar pattern for both models, 96 Table 26. Micro-F1 score under black-box attacks using 0.01, 0.05, 0.10, and 0.15 edge replacements from the mean model. XXXXXXXXXXXXdataset method attention gated gatedLink gatedFea gatedJac PGD convex Pubmed 0.01 82.50 89.50 88.00 87.30 90.00 88.20 88.20 0.05 81.90 82.50 87.50 87.30 88.90 88.20 88.20 0.10 81.60 84.90 87.00 87.30 88.00 88.20 88.20 0.15 80.20 82.60 85.80 87.30 87.00 88.20 88.20 Reddit 0.01 95.64 95.40 95.72 95.52 96.12 95.84 95.66 0.05 93.10 92.58 94.02 95.28 95.92 95.70 95.36 0.10 87.36 87.22 91.44 94.78 95.28 95.36 95.00 0.15 74.28 74.56 86.14 94.26 94.38 94.80 94.26 PPI 0.01 85.75 98.24 97.47 97.27 93.66 98.26 - 0.05 77.52 90.89 96.16 96.32 88.19 92.27 - 0.10 71.33 80.25 91.48 92.85 81.89 82.31 - 0.15 66.45 71.43 83.47 86.16 75.78 73.20 - Figure 23. Distribution of gradients for a non-robust model and Jacobian-based robust model. pointing to the fact that the robust model is not driving the weights of the network to simply diminish the gradients. Inverted Word Attack. We extend the idea of working with features rather than edges, and create a new attack which we call inverted word attack, for the Pubmed dataset. In this attack, the adversary inverts all words in the features of first-hop and second-hop neighbors, which are binary bag-of-words representation of papers. That is, words that do not appear in the neighboring papers will appear, and words that appear in those papers will be removed. This is the worst-case attack an adversary can make, in which links are not informative, and indeed will decrease the accuracy if used. 97 Jacobian Link-Based Fea-Based Convex PGD 60 65 70 75 80 85 90 Clean Inv Figure 24. Comparison of defense against inverted word attack on Pubmed. For more perspective, a two-layer multi-layer perceptron, which does not use links, will achieve 87.4 accuracy on the dataset; while a two-layer gated-GraphSage will achieve a 89.50 accuracy on the dataset. An inverted word attack will decrease the accuracy of the latter to 60.40, considerably worse than the case, in which the model does not use links. Figure 24 shows that the inverted attack inflicts the biggest damage on the Jacobian-defended and link-based models by decreasing the accuracy by 4.80% and 4.40%, respectively. The accuracy of the feature-based and the convex models are decreased by 0.4% and 0.5%, respectively, while the PGD model remains completely robust to the inverted word attack. Note that for the convex and PGD methods, we set the hyper-parameters 1 and 2 to 1, while we tune hyper-parameters for other models based on their performance on the validation set of the inverted data. The convex and PGD models are the only models that achieve an accuracy (i.e., 87.7 and 88.20, respectively) above the multi-layer perceptron on clean data (87.4); effectively guaranteeing that when edges are not trustworthy, the accuracy of the model does not get worse than a case where we do not take edges into account. These two most successful models incorporate norm-bounded perturbations on all neighbors, and are 98 thus expected to perform better than other methods on this attack scenario, which manipulates all first-hop and second-hop neighbors. Models’ Behavior. We can gain some insight by studying the inner workings of these defense mechanisms. Figure 25 plots the variance of the weights of the aggregator at the first layer, during training. The feature-based, convex, PGD, and Jacobian defenses all keep the variance very low, penalizing overfitting to some of the features. The vanilla model and the link-based model do not have this property. This should not be surprising, as these four models all try to reduce the impact of noise on neighbors’s features. The link-based defense has an additional term in the objective function, which would penalize higher scores to adversarial edges than regular edges. Figure 26 plots the decrease of this loss term in the link-based defense for the first and second layer of the network, in which we see the decrease for the second layer is stronger. This additional term helps the link-based model to be competitive with other robust models, while being agnostic to noise on the features of neighbors. Furthermore, we study alignment of edge-perturbation with class distribution in Figure 27. Concretely, we collect best edge replacement for all 1,000 test nodes in Pubmed, and plot the distribution of the label of adversarial neighbors against the original neighbors that they replace. Reminding that Pubmed has 3 classes, in each figure, the color of the block at (3, 2) denotes the proportion of edge replacements, which replaced a neighbor of class 2 to a node of class 3, to all 9 types of edge perturbations. Brighter blocks denote more perturbation for that type. Again, we see the behavior of the adversary against the link-based model is the most similar to the vanilla model. This suggests that the link-based model is simply enabled to detect adversarial edges, without changing the behavior of the model. Similarly, the Jacobian model and the PGD model have the most similar behavior 99 500 1000 1500 2000 Iteration 0 0.02 0.04 0.06 0.08 0.1 Va ria nc e Convex Link-Based Fea-Based Jacobian Vanilla PGD Figure 25. Variance of the weights of the first-layer aggregator on Pubmed. among the other robust models. A large portion of edge perturbations against the Jacobian and PGD models, and to a lesser extent the feature-based model, fall in the (2,2) block, suggesting that the adversary is picking nodes from the same class, which is a weak adversarial move; effectively, the robust models are making it harder for the adversary to attack them. The adversary is, almost always, replacing a node of the majority class, 2, in the the feature-based model. On the other hand, the convex model yields change of neighbors from different classes, more than PGD, Jacobian, and feature-based models. This might be due to optimizing a different objective function (robust error in the linear program 6.7), and the fact that it does not rely on computing the gradients to increase robustness. Impact of Random Neighbors. As was pointed out, the feature-based model replaces some of the neighbors with random nodes, on which it performs gradient ascent. This randomization process was found to beneficial in improved robustness of the feature-based model. We perform a similar experiment for the PGD model; while PGD aims to make the model robust to perturbations on all neighboring nodes, we replace 20% of its neighbors with some random nodes, and 100 0 500 1000 1500 2000 2500 Iteration 0 0.2 0.4 0.6 0.8 1 R an ki ng -B as ed L os s Layer 1 Layer 2 Figure 26. Ranking-based loss for the link-based method on Pubmed. Vanilla 1 2 3 1 2 3 Link-Based 1 2 3 1 2 3 Jacobian 1 2 3 1 2 3 Fea-Based 1 2 3 1 2 3 Convex 1 2 3 1 2 3 PGD 1 2 3 1 2 3 Figure 27. Alignment of edge-perturbation with class distribution for Pubmed. The x-axis denotes the label of the adversarial node, and the y-axis denotes the label of the original node. 101 clean 0.01 0.05 0.10 0.15 75 80 85 90 95 100 M ic ro -F 1 White PGD PGD+Rand clean 0.01 0.05 0.10 0.15 70 75 80 85 90 95 100 M ic ro -F 1 Black PGD PGD-Rand Figure 28. Robustness of adversarially trained models on the PPI dataset using PGD and PGD with random starting points for 20% of neighbors. keep the rest of the training as is. Figure 28 shows that PGD is also benefiting from this randomization process, and its performance on PPI is now competitive with the feature-based model, and outperforms the link-based model, for both white-box and black-box attacks. Impact of Covariance-Based Transformation. Recall that the feature- based model transforms the gradient vector with XTX; we plot the accuracy on the Reddit dataset for two variants of the feature-based defense: one where we transform the gradient vector with XTX, and one where we do not. We saw the strongest improvement by the transformation on the Reddit dataset, and very little difference on the other datasets. Since the Reddit dataset is normalized to have zero-mean and unit-variance features, we call this transformation a covariance-transformed feature- based perturbation. Note that for both cases, the hyper-parameter , is tuned on the validation set, which for the latter method is picked from a pool of large numbers, (i.e., {1e1, 1e2, 1e3, 1e4}). Conclusion In this chapter, we first propose a gated message passing technique, which provides significant improvements over vanilla message passing for inductive and transductive 102 clean 0.01 0.05 0.10 0.15 92 93 94 95 96 Ac cu ra cy White cov nocov clean 0.01 0.05 0.10 0.15 93 93.5 94 94.5 95 95.5 96 Ac cu ra cy Black cov nocov Figure 29. Robustness of adversarially trained models on the Reddit dataset using feature-based perturbations and covariance-transformed feature-based perturbations. node classification. Then, we propose five baselines for robustness analysis of GraphSage. We evaluate these models using white-box and black-box attacks, and discuss their properties. Our best results were given by models that aim to be robust to feature perturbations. This could be done by performing bounded perturbation on all one-hop and two-hop neighbors, as in the case of our convex or PGD method, or it could be done by having unbounded perturbations on a fraction of neighbors, as in the case of the feature-based perturbation method. In either case, this type of defense makes the model robust to edge perturbations; this is mainly because the model is able to defend against a broader set of attacks (i.e., nodes similar to the actual neighbors but not necessarily in the graph), and thus can defend against adversarial edge perturbations. 103 CHAPTER VII CONCLUSION AND FUTURE DIRECTIONS As deep learning becomes more powerful in solving problems in AI, more attention needs to be given to cases where these models fail. In this dissertation, we investigated the robustness of neural network models with discrete input through the lens of adversarial examples. We first developed methods to attack these models, and then created methods to defend against such attacks. Summary of Contributions Attacking a Model with Discrete Input. Prior to this, it was not known how to create adversarial examples using gradients of the model. We created differentiable string-edit and edge-manipulation operations to attack deep NLP and graph-based neural models, respectively. We studied a variety of tasks, i.e., machine translation and text classification in NLP, inductive and transductive node classification and a recommender system for graph-based models. Black-Box vs. White-Box Adversary for NLP. After developing the white-box adversary, we investigated the problem of targeted attacks on a character- level neural machine translation (NMT), in which the adversary aims to remove or replace certain words in text. While simple black-box attacks can simply hamper an NMT, a white-box adversary can attack a model more aggressively. We showed that our white-box adversarially trained model performs better than previous black-box trained models on different types of noise, including the most elusive type of noise, natural noise, which consists of most common typos and misspellings. We proposed a new methodology for creating and evaluating targeted adversarial examples for machine translation. 104 Improved Message Passing in Graph Neural Nets. We created a novel message passing scheme that achieves state-of-the-art results on some node classification benchmarks, as well as giving us a foundation for creating a link-based defense mechanism, which would rank positive edges, i.e., edges existing in the graph, higher than negative edges, i.e., edges that do no exist in the graph. Robust GraphSage. We focused on developing robust node classification models, which would defend against adversarial edge perturbations. These adversarial perturbations could be due to a real-world adversary, e.g., a link farmer, or inherent noise in our measurements. In either case, the goal of a robust model is to perform well under noisy conditions. We provided five efficient methods to create a robust GraphSage; most of these methods rely on feature-based perturbations, which are equivalent to a variant of relaxed edge-based perturbations. We empirically show that our defense mechanisms do not mask gradients, and have similar performance under white-box and black-box attacks. Future Directions Theoretical Investigation of Black-Box Examples. Different problems show varying degrees of sensitivity to adversarial examples. In other words, some models are easier to trick than others. Table 27 shows that neural machine translation is very sensitive even to random changes. On the other hand, text classification is the most robust model, wherein only white-box attacks can break it. And finally, while graph-based models are not sensitive to noise, they can be broken with black-box attacks imported from a different model. Investigating differences between models would give us insight about security properties of machine learning models. While there are some empirical investigation of black-box examples (Y. Liu et al., 2017), 105 Table 27. Models’ sensitivity to attacks. Model White-Box Black-Box Random Classification High Low Low NMT High High High Graph High High Low studying theoretical properties of black-box adversarial examples (Trame`r et al., 2017) can shed light on differences between different machine learning models. GANs for Defense. A few recent works have investigated the problem of creating adversarial examples with generative adversarial networks (GANs). Works of Xiao et al. (2018) and Song, Shu, et al. (2018) have the advantage that the norm-bounded perturbation requirement is removed. This can be extremely helpful for attacks on graph-based models. Recall that we have shown that feature-based perturbation defense provides good defense. In particular, one of our feature-based adversarial training schemes use projection of adversarial neighbors to the convex hull of the dataset. One way to create such adversarial neighbors is to create adversarial examples in the feature space using a GAN, which does not have a perturbation bound requirement, and can guarantee us to produce samples from a similar distribution as actual nodes in the graph. Robust Defense for NLP. There is a growing interest in understanding vulnerability of NLP models, and improving their performance in noisy conditions (Michel & Neubig, 2018). Adversarial training, as our experiments demonstrated, can improve robustness of NLP models. What is missing is having provable guarantees for performance of NLP models, through robust optimization, which will be a harder problem for discrete inputs. Further, the current robust optimization methods for neural nets are limited to linear and convolutional layers (without pooling). Attention (Bahdanau et al., 2014) is the main building block for most NLP architectures. In 106 order to create robust NLP models, we need to create a dual for an attention layer, as well as other blocks which are less straight-forward to create. Robust NLP models can have applications in problems where optimal precision is required, and certified models would be desired. Data-Poisoning Attacks/Defense for Graphs. One popular task for graph-based machine learning is link prediction, which as we mentioned in chapter V, can be targeted by data poisoning attacks, in which the training set is manipulated. Thus, a defense mechanism for this task should be able to have a built-in mechanism to differentiate between good and bad training instances. We demonstrated how to attack a GCN-based recommender system, which posed the recommendation task as link prediction, but did not explore defense mechanisms for it. Defending against such attacks is an important research problem, especially for link prediction, since it is a popular area of research, and we can find real-world use cases, for which this type of attack is plausible. 107 APPENDIX A DEEP LEARNING BACKGROUND In this chapter, we will give a brief overview of deep learning for NLP and graph neural networks. In particular, we will describe the models that have been used for our experiments. Deep Learning Deep learning refers to deep neural network models, which have found great appeal in the AI community. Advancements in hardware, in particular efficient implementations of matrix operations in GPU, led to the era of deep learning. Compared to the previous machine learning techniques, deep learning focuses on representation learning, and has very simple learning and inference procedures. In the next section, we start by discussing a simple neural network, and then discuss more complex models which are used for our work. Feed-Forward Neural Network. A feed-forward neural network is a type of neural network wherein the neurons are structured in layers, and only connections to subsequent layers are allowed. Inspired by the brain, neural nets have a neuron-like behavior at their primary computational unit. The behavior of a neuron is controlled by its weights W . Hence, the weights are where the information learned by the neuron is stored. More precisely, a neuron uses the weighted sum of its inputs, and squeezes them using a nonlinearity function, such as a rectified linear unit (ReLU) (Nair & Hinton, 2010). h(x) = max(0,W Tx) (A.1) Training is achieved by minimizing the network loss, J(.), on the training set, e.g. the collection of documents which are labeled as spam or not. In order to minimize this function, the gradient ∂Jθ(.) ∂θ needs to be calculated, which can be done using 108 Hidden layer Figure A.30. Neural network with four input neurons, one hidden layer of six units, and 1 output neuron. backpropagation. A typical loss function used for classification problems is negative log-likelihood, which penalizes low probability scores for the correct class. J(X, Y ) = 1 n n∑ i=1 −log g(yi|xi) (A.2) where g is given by the output of the network, and n is the number of training instances. Variants of gradient descent methods, such as Adam (Kingma & Ba, 2015) will update parameters based on these gradients in several epochs. The result of this processe is a set of weights that enables the network to do the desired input-output mapping, as defined by the training data. Word Embeddings. At the core of deep learning techniques for NLP lies the vector-based word or character representations, which map words or characters to an n-dimensional space, e.g. 300, 500. This approach enables us to have multiple dimensions of similarity, and to encode the syntactic and semantic features of the words in a compressed numerical format. Having word vectors as parameters, rather than fixed input, makes neural models flexible in finding different word embeddings for separate tasks (Collobert & Weston, 2008). Furthermore, through a composition model, one can map phrases and sentences, i.e., sequences of these vectors, to the 109 same n-dimensional space. These word vectors are stacked into a word embedding matrix M ∈ Rn×|V | where |V | is the size of the vocabulary. Each word is associated with a vocabulary index into the embedding matrix, from which we retrieve the word’s vector from. We often use pre-trained word embeddings, which are trained on millions of sentences, and aim to make words that appear in similar contexts have more similar representations (Mikolov, Sutskever, et al., 2013). Highway Network. Highway networks (R. K. Srivastava et al., 2015) facilitate training deep networks by carrying some of the information from the lower layers through gates. The output of a highway network is the following: z = t · sigm(WHy + bH) + (1− t) · y (A.3) where t = ReLU(WTy + bT ), and ReLU(.) and sigm(.) are the element-wise rectified linear unit and sigmoid activation functions, respectively. Recurrent Neural Nets. Recurrent neural nets (RNNs) can be regarded as a neural network with feedback, where the output of a layer is fed back to a network. This allows us to model temporal aspect of data and to extract features from a sequence. RNNs are called recurrent because they perform the same computation for every element of a sequence, with the output being dependent on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory, RNNs can make use of information in arbitrarily long sequences; but in practice, they are limited to looking back only a few steps. Figure A.31 shows a RNN being unrolled (or unfolded) into a full network with 6 steps. An unrolled RNN is like a feed-forward network, with shared parameters at layers, where a new input is fed to the network at every layer of the network. The formula that governs the computation happening 110 h1 h2 h3 h4 h5 h6 x1 x2 x3 x4 x5 x6 Figure A.31. A recurrent neural network in a RNN is as follows : ht = σ(Wxhxt +Whhht−1 + bh) where σ is a nonlinearity function such as tanh, Wxh represents the weights between the input and the hidden unit, and Whh represents the weight between the two consecutive hidden units. RNNs suffer from the vanishing gradients problem, which is the effect of multiplying n small gradients to compute gradients of the front layers in an n−layer network, meaning that the gradient decreases exponentially with n and the front layers train very slowly. There are a few ways to combat the vanishing gradient problem: using ReLU instead of either tanh or sigmoid activation functions. The ReLU derivative is a constant of either 0 or 1, so it is not as likely to suffer from vanishing gradients. An even more popular solution is to use Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures. LSTMs (Hochreiter & Schmidhuber, 1997) is the most widely used models in NLP today. GRUs, first proposed by Cho et al. (2014), are simplified versions of LSTMs. RNNs have been used for language modeling (Mikolov, Karafia´t, Burget, Cernocky`, & Khudanpur, 2010), sequence labeling (Kurata, Xiang, Zhou, & Yu, 2016; Sak, Senior, & Beaufays, 2014), and parsing (Dyer, Ballesteros, Ling, Matthews, & Smith, 2015). Long Short-Term Memory (LSTM). In an LSTM, one cell consists of three gates (input, forget, output), and a cell unit. Because of the gating mechanism 111 in LSTMs, the cell can maintain information, and it can prevent the gradients from undesirable changes during backpropagation. Despite this design, gradient exploding still remains a problem, which is mitigated by gradient clipping (Pascanu, Mikolov, & Bengio, 2013). Gates use a sigmoid activation, while input and cell state are transformed with tanh. LSTM cell gates can be defined by the following set of equations: it = sigm(Wxixt +Whiht−1 + bi) ft = sigm(Wxfxt +Whfht−1 + bf ) ot = sigm(Wxoxt +Whoht−1 + bo) State updates can be defined by: ct = ft · ct−1 + it · cint ht = ot · tanh(ct) where cin is the transformed input: cint = tanh(Wxcxt +Whcht−1 + bcin) Because of the gating mechanism, the cell can maintain information for longer periods of time and can prevent the gradient from vanishing during backpropagation. Sequence-to-Sequence. A widespread architecture based on RNNs is a sequence-to-sequence model (Seq2Seq) (Sutskever, Vinyals, & Le, 2014), which is constructed by having one RNN at the source side, and one RNN at the target side. This model can be used for machine translation (Cho et al., 2014; Sutskever et al., 2014), text summarization (Nallapati, Zhou, Gulcehre, Xiang, et al., 2016), 112 h1 h2 h3 h4 h5 h6 x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 h1 h2 h3 h4 h5 h6 Figure A.32. A Simple sequence-to-sequence model paraphrase generation (Prakash et al., 2016), and dialogue generation (J. Li et al., 2017). Figure A.32 shows a Seq2Seq model. In sequence-to-sequence generation tasks, each input sequences, x, is coupled with an output sequence to predict y, both of which are modeled by an RNN. Specifically, for machine translation, the input is a sequence of words in the source language and the output is a sequence of words in the target language. The initial state of the decoding RNN comes from the representation learned by the encoder. But in attention-based translation (Bahdanau et al., 2014), the initial state at the decoder is given by a weighted combination of the entire source, which is known as the context vector. Concretely, given the target hidden state ht, and the source-side context vector ct, produce an attentional hidden state as follows: ht = tanh(Wc[ct;ht]) Character-Level NLP Models Character-level models and those based on sub-word units are able to capture morphological patterns, which leads to improving generalization to unseen words (Luong & Manning, 2016; Sennrich, Haddow, & Birch, 2016). Deep character models have been successfully applied in several NLP tasks (Ballesteros, Dyer, & Smith, 2015; Costa-Jussa & Fonollosa, 2016; Kim et al., 2016; Lample, Ballesteros, Subramanian, Kawakami, & Dyer, 2016; Ling, Lu´ıs, et al., 2015; Ling, Trancoso, Dyer, & Black, 113 2015; Ma & Hovy, 2016; X. Zhang et al., 2015). One can use characters to create word embeddings, or use them stand-alone by representing text as a stream of characters. Extracting features from character sequences can be done with convolutional or recurrent neural nets. Another distinction between character-aware models is whether the input is segmented at the word level. Fully character models in translation with no word-level segmentation have been found to work well (Chung, Cho, & Bengio, 2016; Lee, Cho, & Hofmann, 2017). CharCNN-LSTM Architecture. The architecture we study is based on the one proposed by Kim et al. (2016) for character-level language modeling. Feature extraction is performed by convolutions over characters, which are passed to layers of highway networks, and finally given to stacks of recurrent neural nets for modeling a sequence of words. This architecture is flexible to perform multiple tasks. For example, in sequence labeling (Kim et al., 2016) the output of every recurrent unit is passed to a softmax layer to predict the next word. In this paper, we use this architecture on the encoder side of a seq-to-seq model, as demonstrated by Costa-Jussa and Fonollosa (2016) for machine translation. We also use this architecture for our text classification experiments, wherein the output of the last recurrent unit is passed to a softmax to predict the label of text. We briefly explain this architecture and refer the reader to Kim et al. (2016) for more details. See Figure A.33. Let V be the alphabet, and let xij ∈ {0, 1}|V | denote a one-hot vector representing the j-th character of the i-th word. The character sequence can be represented by x = [(x11,.. x1n);..(xm1,.. xmn)] wherein a semicolon denotes explicit segmentation between words. The number of words is denoted by m, and n is the number of maximum characters allowed for a 114 LSTM LSTM seq-decoder y Character Embeddings Convolutional Network Highway Network This style of painting Figure A.33. The CharCNN-LSTM architecture used for studying two tasks of text classification and translation. word1. The features associated with character n-grams are extracted by convolutional neural networks (CNNs). The one-hot vectors are multiplied by the embedding matrix to get the individual character embeddings. For each word, narrow convolutions between the sequence of the character embeddings and followed by a nonlinearity, are applied to obtain the feature maps. Next, a temporal max-pooling is applied, the output of which is given to 2 layers of highway networks. Finally, the output of the highway networks is given to a 2-layer stacked LSTMs. Graph Neural Networks In this section, we cover related work in the area of graph neural networks. Graph convolutional network (GCN) is a recent development, which aims to model relational data using neural nets. It is a representation learning scheme for graph-structured data; this model has been applied to node classification (Kipf & Welling, 2017), link prediction (Berg et al., 2017; Schlichtkrull et al., 2017), and NLP (Marcheggiani & Titov, 2017) with success. It is a simplification of the model proposed by Defferrard et 1Padding is applied if the number of characters is fewer than the maximum. 115 al. (Defferrard et al., 2016), which approximates smooth filters in the spectral domain, using Chebyshev polynomials with learnable parameters in a neural model. Another difference between the two models is that GCN is used to learn node representations, while the convolutional network in (Defferrard et al., 2016) is used to generate a representation for the whole graph. Graph attention networks (GAT) (Velickovic et al., 2018) and attention-based graph neural network (AGNN) (Thekumparampil et al., 2018) provide a self-attention mechanism, where the weight of each neighbor’s contribution to the embedding comes from an attention function, which uses the nodes’ own representation and the neighbors’. J. Zhang et al. (2018) extend GAT by performing gating on the attention heads of GAT. GeniePath (Z. Liu et al., 2018) uses a combination of attention and memory architecture to adaptively pick features from neighbors in deeper layers. Abu-El-Haija et al. (2018) train multiple instances of GCNs over node pairs discovered at different distances in random walks. Du et al. (2017) define graph convolution operation on the vertex domain as multiplication by polynomials of the graph adjacency matrix. In this model, filters are adaptive to the topology of the graph as they scan the graph to perform convolution. Mean field networks (MFNs) (Y. Li & Zemel, 2014) are hierarchical probabilistic models, in which variables in the hierarchy are connected based on an underlying structure. These networks can be regarded as unrolled instantiations of mean field inference in graphical models, in which the iterative inference procedure is replaced with layers that resemble feed-forward networks. A few works have used probabilistic inference to study inference in graph neural networks. Most notably, Dai, Dai, and Song (2016) propose a latent variable model to learn node representations (embeddings), inspired by mean-field and loopy belief propagation inference 116 procedures. They use the assumption of an injective embedding (Sriperumbudur, Gretton, Fukumizu, Scho¨lkopf, & Lanckriet, 2010) on node embeddings, and they parameterize the embeddings with a neural net. Their model has only one layer which makes it less powerful than GCN. Using similar machinery, Graphite (Grover, Zweig, & Ermon, 2018) performs unsupervised learning of representations over nodes in a graph using deep latent variable generative models. Shallow embedding learning methods have been extensively studied in the past few years. DeepWalk (Perozzi, Al-Rfou, & Skiena, 2014) samples a set of paths from the input graph using truncated random walks, and uses SkipGram modeling (Mikolov, Chen, Corrado, & Dean, 2013) to maximize the co-occurrence probability among the nodes that appear in a walk. Line (Tang et al., 2015) learns half of the dimensions of the embedding by using immediate neighbors of a node, and the other half using the two-hop neighbors. Node2vec (Grover & Leskovec, 2016) performs a more guided neighborhood search which explores more diverse neighborhoods. These methods outperform GCN and its variants in learning representations for planar graphs, but are not designed to perform as well on graphs which have features on nodes. In addition, all these methods require a two-phase pipeline, including learning embeddings and training a model for a downstream task. Yang et al. (2016) address this problem by injecting label information in embedding learning; however, the method is still not competitive with GCN on feature-based graphs. Node classification can be studied in transductive and inductive settings; in the former the model has access to the whole graph, including the test nodes, and uses that as extra knowledge. In the latter, the model does not see test nodes in training, and will only add the test nodes to the graph at test time. See Figure A.34 for an illustration. For our transductive node classification experiments, we use graph 117 Figure A.34. Illustration of transductive node classification (left) and inductive node classification (right). The white nodes are the test nodes, which are to be classified to one of the two classes colored green and gray. convolutional networks (Kipf & Welling, 2017), and for our inductive experiments, we use GraphSage (Hamilton et al., 2017). Graph Convolutional Neural Networks. A Graph Convolutional Neural Network (Kipf & Welling, 2017) learns representations from structured data given by a graph G = (V,E). A GCN can be regarded as an encoder eθ(A,X), which takes the adjacency matrix, A, and a |V | × m feature matrix, X, and produces a |V | × n embedding matrix, also called hidden representations. For example, the representation of the graph at layer l + 1 can be given by: h(l+1) = ReLU(Ah(l)W l) (A.4) where W l is the weights for the lth layer. The multiplication with the adjacency matrix could bring about scaling problems, and hence A is normalized. Concretely, each outgoing edge, Aij, can be normalized by the number of neighbors of that node 1/|Ni|, or by a combination of both nodes involved in the edge, 1/ √|Ni||Nj|. One can view the local graph convolution in Equation A.4 as a differentiable message passing, where vector-valued messages are being passed across the graphs and transformed at nodes. The final hidden representation can be passed to other 118 layers for specific tasks, such as maximum entropy for classification (Kipf & Welling, 2017; Schlichtkrull et al., 2017), or a decoder for link prediction (Berg et al., 2017). If K different types of relations exist in the data, then there will be Ar, r ∈ 1, 2, ..K adjacency matrices, and K different sets of weights for different relations. Sample and Aggregate. One of the drawbacks of GCNs is that they require full-graph batch optimization with a sparse-dense matrix multiplication, which will cause memory problems in large graphs. Concretely, they were initially used for transductive settings, where all the nodes of the graph, including test nodes, are seen in training. To address this issue, several sampling-based approaches have been proposed, which would sample a small number of nodes at each layer during mini- batch training (Chen et al., 2018; Chen & Zhu, 2018; Hamilton et al., 2017), and aggregate them. GraphSAGE (Hamilton et al., 2017) uses a sampling algorithm to select a subset of the neighbors from one-hop and two-hop neighborhoods. Embedding for the node i in a two-layer GraphSage with a mean aggregator can be represented as the following. Hi = W2 [ h1i︷ ︸︸ ︷ ReLU(W1[xi;α ∑ j∼N (i) ajX]);α ∑ j h1j︷ ︸︸ ︷ ReLU(W1[ajX; β ∑ k∼N (j) akX]) ] where aj is the one hot-vector representing the jth neighbor, and α and β are normalization constants. Figure A.35 shows a graph where nodes from first-hop and second-hop neighbors are involved in learning embeddings for the green node. Node two is not sampled and will not contribute to the embedding produced for the green node. The main idea behind these sampling-based approaches is that we only need to create the embedding of the nodes in the mini-batch to be able to do stochastic gradient descent. FastGCN (Chen et al., 2018) subsamples the receptive field in each layer universally, using 119 0 1 2 34 5 6 7 8 9 Figure A.35. Message Passing in GCN or GraphSage, using two-hop neighbors (blue) and one-hop neighbors (red) to the target node (green). In GraphSage not all neighbors contribute to the representation of the target node. For instance, the white node does not contribute to learning the representation of the target node. importance sampling. Chen and Zhu (2018) develop a control variate-based stochastic approximation, which utilizes historical activations of nodes as a control variate. 120 APPENDIX B ROBUST GRAPH NEURAL NETS Loopy-BP-Inspired Message Passing In this section, we describe our modified message passing scheme for our transductive experiments in Chapter V. The datasets are small and are sparsely labeled, and we found that increasing the receptive field by incorporating longer range dependencies, e.g., two-hop neighbors, can be helpful. Note that we still need E computations, where E is the number of edges. Specifically, we combine messages coming from the two-hop neighbors at the corresponding neighbor, before feeding them to the gating function. That is, j will send its neighbor i, the message α(xj + ∑ k∈N (j),k 6=i xk), where α is a normalization constant. This means node j is sending a message to a particular neighbor by combining all of its neighbors’ embeddings and its own embedding, excluding the receiving neighbor’s. This message passing scheme is inspired by loopy belief propagation (BP) in graphical models; in loopy BP, a message from a node a to a neighbor v is the accumulated messages from all other nodes, marginalized over all variables except v. Figure B.36c depicts a setting where nodes at the first layer will send messages to their second-hop neighbors directly. Our best results used the scheme in B.36d, yx z (a) Simple 3-node chain H2 H1 X (b) Standard H2 H1 X (c) Loopy BP inspired H2 H1 X (d) Final model Figure B.36. A graphical representation contrasting standard message passing among immediate neighbors and loopy-bp-inspired message passing. Nodes from lower layers send messages to the higher ones. 121 wherein two-hop-based message passing scheme exists between layer 1 and layer 2, while using no message passing between nodes at layer 0 and layer 1. That is, the first layer of our GCN is simply a one-layer neural net, and a GCN with a bigger receptive field is at the next layer. Robust Optimization for GraphSage The linear program, described in 6.7, can be solved without the use of linear program solvers. Concretely, the dual of the program can be solved by another neural network which is very similar to the backprop pass of the original network. We associate the following dual variables with each of the constraints ri = Wf hˆ i 2 + bf → ν hˆi2 = W2[h i 1;α ∑ j∼N (i) hj1]→ η2 hˆi1 = W1[xˆi;α ∑ j∼N (i) xˆj]→ η1 hˆj1 = W1[xˆj; β ∑ k∼N (j) xˆk]→ γ − hz1,d ≤ 0→ µzd hˆz1,d − hz1,d ≤ 0→ τ zd (uzd − lzd)hz1,d − uzdhˆz1,d ≤ −uzdlzd → λzd −xˆj ≤ −xj + 1 → ξ−1 xˆj ≤ xj + 1 → ξ+1 j ∼ N (i)  −xˆk ≤ −xk + 2 → ξ−2 xˆk ≤ xk + 2 → ξ+2 k ∼ N (j) xˆi = xi → ξ3 (B.1) Now the dual problem becomes: 122 maximize (−νT bf + ∑ d ∑ z λzdu z dl z d + (xj − 1)T ξ−1 − (xj + 1)T ξ+1 + (xk − 2)T ξ−2 − (xk + 2)T ξ+2 − xTi ξ3) subject to: ν = eytarget − ey∗ η2 = W T f ν  ηd1 = 0, d ∈ I−i γd = 0, d ∈ I−j ηd1 = (V T 2 η2)d, d ∈ I+i γd = α(UT2 η2)d, d ∈ I+j d ∈ Ii  ( λid(u i d − lid)− µid − τ id ) = (V T2 η2)d ηd1 = λ i du i d − τ id d ∈ Ij  ( λjd(u j d − ljd)− µjd − τ jd ) = α(UT2 η2)d γd = λjdu j d − τ jd ξ+1 − ξ−1 = V T1 γ + αUT1 η1 ξ+2 − ξ−2 = βUT1 γ ξ3 = V 1 Tη1 λ, τ, µ, ξ+1 , ξ − 1 , ξ + 2 , ξ − 2 , ξ3 ≥ 0 (B.2) We break W into two submatrices V , i.e., operating on the node and U , i.e., operating on neighbors. Concretely, W1[xi;α ∑ j∼N (i) xj] = [V1xi;U1α ∑ j∼N (i) xj] (B.3) 123 We overload the notation and use η2, η1, and γ to refer to both halves of these variables. That is, for ease in exposition, we do not differentiate between the two parts that are being concatenated at each layer. Kolter and Wong (2017) observe that due to the complementary slackness property in convex optimization, either λ (i.e., Lagrange multiplier for the upper bound constraint) or µ+τ (i.e., Lagrange multiplier for either of the lower bound constraints) must be zero. After re-writing some of the constraints, they are able to derive dual variables as positive and negative portions of a given vector, and represent the dual problem as a backward pass with leaky ReLU activation functions. For our problem, I, and the dual variables η1, γ can be represented as following. ηd1 =  0 d ∈ I−i (V T2 η2)d d ∈ I+i uid uid−lid [(V T2 η2)d]+ − δid[(V T2 η2)d]− d ∈ Ii (B.4) γd =  0 d ∈ I−j α(UT2 η2)d d ∈ I+j ujd ujd−ljd [α(UT2 η2)d]+ − δjd[α(UT2 η2)d]− d ∈ Ij (B.5) where δid is defined as uid uid−lid , and []+ and []− return positive and negative values of an input, respectively. Note that we create representations of nodes in the minibatch, 124 and their first-hop neighbors in the first layer. This is the only part of the network that contains nonlinearity, and thus, we will have upper and lower bound for all activations for only these nodes. The upper and lower bound computation follows the procedure described by Kolter and Wong (2017). In addition, GraphSage uses a normalization on the logits before they are given to the classifier. For this normalization, we tried both layer normalization (Ba et al., 2016) and batch-normalization (Ioffe & Szegedy, 2015), the latter of which performed better. Batch normalization can be represented by a linear layer, and hence its dual layer can be easily incorporated (Wong, Schmidt, Metzen, & Kolter, 2018). We omitted that layer for the sake of clarity. The final objective function that we need to maximize is: J(η1, η2, ν, γ) = (− νT bf + − (xj + 1)T [V T1 γ + αUT1 η1]+ + (xj − 1)T [V T1 γ + αUT1 η1]− − (xk + 2)T [βUT1 γ]+ + (xk − 2)T [βUT1 γ]− − xTi V 1Tη1∑ d∈Ii lid[η d 1 ]+ + ∑ d∈Ij ljd[γ d]+ ) (B.6) Which can further be simplified to: J(η1, η2, ν, γ) = (− νT bf + − xTj (V T1 γ + αUT1 η1)− βxTkUT1 γ − xTi V 1Tη1 − 1 ‖ V T1 γ + αUT1 η1 ‖1 −2 ‖ βUT1 γ ‖1∑ d∈Ii lid[η d 1 ]+ + ∑ d∈Ij ljd[γ d]+ ) (B.7) Jacobian Regularization with Layer Normalization Layer Normalization performs instance-wise normalization over inputs to a neural net, as described in by Ba et al. (2016). Concretely, the input to a nonlinearity goes 125 through the following normalization layer: LNorm(x) = x− E(x)√ V ar(x) ∗ ζ + ψ where E(x) and V ar(x) denote the mean and variance over the features of a single instance. When we use layer normalization, the Jacobian will have the following form. Jσ(αU1∑j∼N (i) xj)(xj) = UT1 σ′(LNorm(U1α ∑ j∼N (i) xj)) ζ√ V ar(U1α ∑ j∼N (i) xj) (B.8) So here, both ζ and U1 are going to be regularized. 126 APPENDIX C NOTATION AND SYMBOLS θ Model parameters W Weight matrix X Dataset inputs Y Dataset outputs xi Input data point xij jth character of the ith word yi Output data point h Hidden layer A Adjacency matrix aj One-hot vector with one at the jth position g Neural net output J Loss function J Jacobian L Number of network layers  Perturbation magnitude δ Perturbation vector α Normalization constant β Normalization constant σ Nonlinearity V Graph nodes V Vocabulary E Graph edges Element-wise product operator < . > Dot product operator 127 REFERENCES CITED Abu-El-Haija, S., Kapoor, A., Perozzi, B., & Lee, J. (2018). N-GCN: Multi-scale graph convolution for semi-supervised node classification. arXiv preprint arXiv:1802.08888 . Adamic, L. A., & Glance, N. (2005). The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery (pp. 36–43). Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of International Conference on Machine Learning. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450 . Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations. Ballesteros, M., Dyer, C., & Smith, N. A. (2015). Improved transition-based parsing by modeling characters instead of words with lstms. In Proceedings of Empirical Methods in Natural Language Processing. Belinkov, Y., & Bisk, Y. (2018). Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations. Berg, R. v. d., Kipf, T. N., & Welling, M. (2017). Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263 . Berger, Y. (2017). Israel arrests Palestinian because Facebook translated ’good morning’ to ’attack them’. Brendel, W., & Bethge, M. (2017). Comment on” biologically inspired protection of deep networks from adversarial attacks”. arXiv preprint arXiv:1704.01547 . Carlini, N., & Wagner, D. (2017a). Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th acm Workshop on Artificial Intelligence and Security (pp. 3–14). Carlini, N., & Wagner, D. (2017b). Towards evaluating the robustness of neural networks. In Security and Privacy, IEEE European Symposium on. 128 Chancellor, S., Pater, J. A., Clear, T., Gilbert, E., & De Choudhury, M. (2016). # thyghgapp: Instagram content moderation and lexical variation in pro-eating disorder communities. In Proceedings of Conference on Computer-Supported Cooperative Work and Social Computing (pp. 1201–1213). Chen, J., Ma, T., & Xiao, C. (2018). FastGCN: Fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations. Chen, J., & Zhu, J. (2018). Stochastic training of graph convolutional networks. In International Conference on Learning Representations. Cho, K., Van Merrie¨nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of Empirical Methods in Natural Language Processing. Chung, J., Cho, K., & Bengio, Y. (2016). A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of Association of Computational Linguistics. Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of International Conference on Machine Learning (pp. 160–167). Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2017). Very deep convolutional networks for natural language processing. In Proceedings of European Chapter of Association of Computational Linguistics. Costa-Jussa, M. R., & Fonollosa, J. A. (2016). Character-based neural machine translation. In Proceedings of Association of Computational Linguistics. Dai, H., Dai, B., & Song, L. (2016). Discriminative embeddings of latent variable models for structured data. In Proceedings of International Conference on Machine Learning (pp. 2702–2711). Dai, H., Li, H., Tian, T., Huang, X., Wang, L., Zhu, J., & Song, L. (2018). Adversarial attack on graph structured data. In Proceedings of International Conference on Machine Learning. Dalvi, N., Domingos, P., Sanghai, S., Verma, D., et al. (2004). Adversarial classification. In Proceedings of Conference on Knowledge Discovery and Data Mining (pp. 99–108). Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of International Conference on Machine Learning. 129 Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of Advances in Neural Information Processing Systems (pp. 3844–3852). Dong, Y., Liao, F., Pang, T., Su, H., Hu, X., Li, J., & Zhu, J. (2018). Boosting adversarial attacks with momentum. In Proceedings of Conference on Computer Vision and Pattern Recognition. Douceur, J. R. (2002). The sybil attack. In Proceedings of international Workshop on Peer-to-Peer Systems (pp. 251–260). Du, J., Zhang, S., Wu, G., Moura, J. M., & Kar, S. (2017). Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370 . Dyer, C., Ballesteros, M., Ling, W., Matthews, A., & Smith, N. A. (2015). Transition-based dependency parsing with stack long short-term memory. In Proceedings of Association of Computational Linguistics. Ebrahimi, J., Lowd, D., & Dou, D. (2018). On adversarial examples for character-level neural machine translation. In Proceedings of the 27th international conference on computational linguistics (pp. 653–663). Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018). Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) (Vol. 2, pp. 31–36). Ehlers, R. (2017). Formal verification of piece-wise linear feed-forward neural networks. In Proceedings of International Symposium on Automated Technology for Verification and Analysis (pp. 269–286). Feinman, R., Curtin, R. R., Shintre, S., & Gardner, A. B. (2017). Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410 . Gantner, Z., Rendle, S., Freudenthaler, C., & Schmidt-Thieme, L. (2011). MyMediaLite: A free recommender system library. In Proceedings of RecSys. Garey, M. R., & Johnson, D. S. (2002). Computers and intractability (Vol. 29). W.H. Freeman. Ghosh, S., Viswanath, B., Kooti, F., Sharma, N. K., Korlam, G., Benevenuto, F., . . . Gummadi, K. P. (2012). Understanding and combating link farming in the twitter social network. In Proceedings of World Wide Web (pp. 61–70). Globerson, A., & Roweis, S. (2006). Nightmare at test time: robust learning by feature deletion. In Proceedings of International Conference on Machine Learning (pp. 353–360). 130 Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2014). Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems (pp. 2672–2680). Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations. Grosse, K., Manoharan, P., Papernot, N., Backes, M., & McDaniel, P. (2017). On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 . Grosse, K., Papernot, N., Manoharan, P., Backes, M., & McDaniel, P. (2017). Adversarial perturbations against deep neural networks for malware classification. In International Conference on Learning Representations. Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of Conference on Knowledge Discovery and Data Mining (pp. 855–864). Grover, A., Zweig, A., & Ermon, S. (2018). Graphite: Iterative generative modeling of graphs. arXiv preprint arXiv:1803.10459 . Gu, S., & Rigazio, L. (2015). Towards deep neural network architectures robust to adversarial examples. In International Conference on Learning Representations. Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Proceedings of Advances in Neural Information Processing Systems (pp. 1025–1035). Harper, F. M., & Konstan, J. A. (2016). The movielens datasets: History and context. Transactions on Interactive Intelligent Systems , 5 (4), 19. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9 (8), 1735–1780. Hosseini, H., Kannan, S., Zhang, B., & Poovendran, R. (2017). Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 . Hug, N. (2017). Surprise, a Python library for recommender systems. http://surpriselib.com. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learning. 131 Iyyer, M., Wieting, J., Gimpel, K., & Zettlemoyer, L. (2018). Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of North American Chapter of Association of Computational Linguistics. Jia, R., & Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of Empirical Methods in Natural Language Processing. Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of Empirical Methods in Natural Language Processing. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-aware neural language models. In Proceedings of Association for the Advancement of Artificial Intelligence. Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations. Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations. Kolter, J. Z., & Wong, E. (2017). Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of International Conference on Machine Learning. Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of Conference on Knowledge Discovery and Data Mining (pp. 426–434). Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data, 4 (1). Kurakin, A., Goodfellow, I., & Bengio, S. (2017a). Adversarial examples in the physical world. In International Conference on Learning Representations. Kurakin, A., Goodfellow, I., & Bengio, S. (2017b). Adversarial machine learning at scale. In International Conference on Learning Representations. Kurata, G., Xiang, B., Zhou, B., & Yu, M. (2016). Leveraging sentence-level information with encoder lstm for natural language understanding. In Proceedings of Empirical Methods in Natural Language Processing. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of North American Chapter of Association of Computational Linguistics. 132 Lee, J., Cho, K., & Hofmann, T. (2017). Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics , 5 , 365–378. Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., & Jurafsky, D. (2017). Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547 . Li, Y., & Zemel, R. (2014). Mean-field networks. ICML 2014 Workshop on Learning Tractable Probabilistic Models . Lin, F., & Cohen, W. W. (2010). Semi-supervised classification of network data using very few labels. In Proceedings of International Conference on Advances in Social Networks Analysis and Mining (pp. 192–199). Ling, W., Lu´ıs, T., Marujo, L., Astudillo, R. F., Amir, S., Dyer, C., . . . Trancoso, I. (2015). Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of Empirical Methods in Natural Language Processing. Ling, W., Trancoso, I., Dyer, C., & Black, A. W. (2015). Character-based neural machine translation. arXiv preprint arXiv:1511.04586 . Liu, Y., Chen, X., Liu, C., & Song, D. (2017). Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations. Liu, Z., Chen, C., Li, L., Zhou, J., Li, X., & Song, L. (2018). Geniepath: Graph neural networks with adaptive receptive paths. arXiv preprint arXiv:1802.00910 . Lowd, D., & Meek, C. (2005). Adversarial learning. In Proceedings of Conference on Knowledge Discovery and Data Mining (pp. 641–647). Luong, M.-T., & Manning, C. D. (2016). Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of Association of Computational Linguistics. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In Proceedings of Empirical Methods in Natural Language Processing. Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of Association of Computational Linguistics. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations. 133 Marcheggiani, D., & Titov, I. (2017). Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of Empirical Methods in Natural Language Processing. Mauro, C., Jan, N., Sebastian, S., Luisa, B., Roldano, C., & Marcello, F. (2016). The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation. Metzen, J. H., Genewein, T., Fischer, V., & Bischoff, B. (2017). On detecting adversarial perturbations. In International Conference on Learning Representations. Michel, P., & Neubig, G. (2018). Mtnt: A testbed for machine translation of noisy text. In Proceedings of Empirical Methods in Natural Language Processing. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Mikolov, T., Karafia´t, M., Burget, L., Cernocky`, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of Interspeech. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of Advances in Neural Information Processing Systems (pp. 3111–3119). Miyato, T., Dai, A. M., & Goodfellow, I. (2017). Adversarial training methods for semi-supervised text classification. In International Conference on Learning Representations. Miyato, T., Maeda, S.-i., Koyama, M., Nakae, K., & Ishii, S. (2016). Distributional smoothing with virtual adversarial training. In International Conference on Learning Representations. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Bronstein, M. M. (2017). Geometric deep learning on graphs and manifolds using mixture model cnns. In Conference on Computer Vision and Pattern Recognition. Moosavi-Dezfooli, S.-M., Fawzi, A., & Frossard, P. (2016). Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of Computer Vision and Pattern Recognition (pp. 2574–2582). Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of International Conference on Machine Learning (pp. 807–814). 134 Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of Conference on Computational Natural Language Learning. Nayebi, A., & Ganguli, S. (2017). Biologically inspired protection of deep networks from adversarial attacks. arXiv preprint arXiv:1703.09202 . Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of Conference on Computer Vision and Pattern Recognition (pp. 427–436). Oord, A. v. d., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In Proceedings of International Conference on Machine Learning. Papernot, N., McDaniel, P., & Goodfellow, I. (2016). Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 . Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017). Practical black-box attacks against machine learning. In Proceedings of Asia Conference on Computer and Communications Security (pp. 506–519). Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. In Security and Privacy IEEE European Symposium on (pp. 372–387). Papernot, N., McDaniel, P., Swami, A., & Harang, R. (2016). Crafting adversarial input sequences for recurrent neural networks. In Proceedings of Military Communications Conference (pp. 49–54). Papernot, N., McDaniel, P., Wu, X., Jha, S., & Swami, A. (2016). Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (pp. 582–597). Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of Association of Computational Linguistics (pp. 311–318). Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (pp. 1310–1318). Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of Empirical Methods in Natural Language Processing (pp. 1532–1543). 135 Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of Conference on Knowledge Discovery and Data Mining (pp. 701–710). Prakash, A., Hasan, S. A., Lee, K., Datla, V., Qadir, A., Liu, J., & Farri, O. (2016). Neural paraphrase generation with stacked residual lstm networks. In Proceedings of Computational Linguistics. Raghunathan, A., Steinhardt, J., & Liang, P. (2018). Certified defenses against adversarial examples. In International Conference on Learning Representations. Rawlinson, G. E. (1976). The significance of letter position in word recognition. (Unpublished doctoral dissertation). University of Nottingham. Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of Association for Computational Linguistics (pp. 856–865). Sak, H., Senior, A. W., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Procedings of Interspeech (pp. 338–342). Schlichtkrull, M., Kipf, T. N., Bloem, P., Berg, R. v. d., Titov, I., & Welling, M. (2017). Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103 . Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., & Eliassi-Rad, T. (2008). Collective classification in network data. AI Magazine, 29 (3), 93. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of Association of Computational Linguistics . Shaham, U., Yamada, Y., & Negahban, S. (2015). Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432 . Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 . Sinha, A., Namkoong, H., & Duchi, J. (2018). Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations. 136 Song, Y., Kim, T., Nowozin, S., Ermon, S., & Kushman, N. (2018). Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations. Song, Y., Shu, R., Kushman, N., & Ermon, S. (2018). Generative adversarial examples. In Proceedings of Advances in Neural Information Processing Systems. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Scho¨lkopf, B., & Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11 (Apr), 1517–1561. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 (1), 1929–1958. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387 . Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems (pp. 3104–3112). Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations. Tabacof, P., & Valle, E. (2016). Exploring the space of adversarial images. In 2016 International Joint Conference on Neural Networks (pp. 426–433). Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). Line: Large-scale information network embedding. In Proceedings of World Wide Web (pp. 1067–1077). Teo, C. H., Globerson, A., Roweis, S. T., & Smola, A. J. (2008). Convex learning with invariances. In Advances in Neural Information Processing Systems (pp. 1489–1496). Thekumparampil, K. K., Wang, C., Oh, S., & Li, L.-J. (2018). Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735 . Torkamani, M., & Lowd, D. (2013). Convex adversarial collective classification. In Proceedings of International Conference on Machine Learning (pp. 642–650). Trame`r, F., Kurakin, A., Papernot, N., Boneh, D., & McDaniel, P. (2018). Ensemble adversarial training: Attacks and defenses.. 137 Trame`r, F., Papernot, N., Goodfellow, I., Boneh, D., & McDaniel, P. (2017). The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 . van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. (2016). Conditional image generation with pixelcnn decoders. In Proceedings of Advances in Neural Information Processing Systems (pp. 4790–4798). Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. In International Conference on Learning Representations. Wiseman, S., & Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. In Proceedings of Empirical Methods in Natural Language Processing. Wong, E., Schmidt, F., Metzen, J. H., & Kolter, J. Z. (2018). Scaling provable adversarial defenses. arXiv preprint arXiv:1805.12514 . Wu, Y., Bamman, D., & Russell, S. (2017). Adversarial training for relation extraction. In Proceedings of Empirical Methods in Natural Language Processing (pp. 1778–1783). Xiao, C., Li, B., Zhu, J.-Y., He, W., Liu, M., & Song, D. (2018). Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610 . Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, A. (2017). Adversarial examples for semantic segmentation and object detection. In Proceedings of International Conference on Computer Vision. Xu, H., Caramanis, C., & Mannor, S. (2009). Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10 (Jul), 1485–1510. Yang, Z., Cohen, W. W., & Salakhutdinov, R. (2016). Revisiting semi-supervised learning with graph embeddings. In Proceedings of International Conference on Machine Learning. Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B. Y., & Dai, Y. (2014). Uncovering social network Sybils in the wild. Transactions on Knowledge Discovery from Data, 8 (1). Zhang, J., Shi, X., Xie, J., Ma, H., King, I., & Yeung, D.-Y. (2018). Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294 . 138 Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Proceedings of Advances in Neural Information Processing Systems (pp. 649–657). Zhao, Z., Dua, D., & Singh, S. (2018). Generating natural adversarial examples. In International Conference on Learning Representations. Zitnik, M., & Leskovec, J. (2017). Predicting multicellular function through multi-layer tissue networks. Bioinformatics , 33 (14), i190–i198. Zubiaga, A., Liakata, M., Procter, R., Hoi, G. W. S., & Tolmie, P. (2016). Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS one, 11 (3), e0150989. 139