AUTOMATED ATTACKS ON COMPRESSION-BASED CLASSIFIERS by IGOR BURAGO A THESIS Presented to the Department of Computer and Information Science and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Master of Science June 2014 THESIS APPROVAL PAGE Student: Igor Burago Title: Automated Attacks on Compression-Based Classifiers This thesis has been accepted and approved in partial fulfillment of the requirements for the Master of Science degree in the Department of Computer and Information Science by: Daniel Lowd Chair Dejing Dou Core Member Christopher Wilson Core Member and Kimberly Andrews Espy Vice President for Research & Innovation/Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded June 2014 ii c© 2014 Igor Burago iii THESIS ABSTRACT Igor Burago Master of Science Department of Computer and Information Science June 2014 Title: Automated Attacks on Compression-Based Classifiers Methods of compression-based text classification have proven their usefulness for various applications. However, in some classification problems, such as spam filtering, a classifier confronts one or many adversaries willing to induce errors in the classifier’s judgment on certain kinds of input. In this thesis, we consider the problem of finding thrifty strategies for character-based text modification that allow an adversary to revert classifier’s verdict on a given family of input texts. We propose three statistical statements of the problem that can be used by an attacker to obtain transformation models which are optimal in some sense. Evaluating these three techniques on a realistic spam corpus, we find that an adversary can transform a spam message (detectable as such by an entropy-based text classifier) into a legitimate one by generating and appending, in some cases, as few additional characters as 20% of the original length of the message. iv CURRICULUM VITAE NAME OF AUTHOR: Igor Burago GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, Oregon Far Eastern Federal University, Vladivostok, Russia DEGREES AWARDED: Master of Science, Computer & Information Science, 2014, University of Oregon Bachelor of Science, Applied Mathematics and Informatics, 2011, Far Eastern Federal University AREAS OF SPECIAL INTEREST: Artificial Intelligence Machine Learning Adversarial Machine Learning PROFESSIONAL EXPERIENCE: Part-time software developer and system administrator, Pacific Fisheries Research Center (TINRO-Center), Vladivostok, Russia, 2007–2009 Volunteer jury member of the Far-Eastern regional stages of individual and team All- Russian high school programming competitions, Far Eastern Federal University, Vladivostok, Russia, 2006–2011 Volunteer software developer, Department of Computer Science, Institute of Mathematics and Computer Science, Far Eastern Federal University, Vladivostok, Russia, 2008 GRANTS, AWARDS AND HONORS: Graduate Research Fellowship, Information Services, 2012–2014 Summa cum Laude, Far Eastern Federal University, 2011 v PICES Ocean Monitoring Service Award, North Pacific Marine Science Organization (along with Igor I. Shevchenko and Olga N. Vasik, TINRO-Center), 2009 Scholarship of the President of the Far Eastern Federal University, 2009 Scholarship of the Administration of Vladivostok, 2009 Scholarship of the Governor of Primorsky Krai, 2009 Scholarship of the President of the Russian Federation, 2009 Scholarship of the President of the Far Eastern Federal University, 2008 Scholarship of the Governor of Primorsky Krai, 2008 Award of the President of the Russian Federation for Support of Talented Youth, 2006 PUBLICATIONS: Burago, I. V., Vasik, O. N., Moiseenko, G. S., & Shevchenko, I. I. (2013). Metadata Exchange as a Preliminary Step in Creating a Common Information Space. (In Russian). In Proceedings of the industry seminar “Mathematical Modeling and Information Technology in the Research of Biological Resources of the World’s Oceans,” September 25–27, 2013 (pp. 53–55). Vladivostok, Russia: TINRO- Center. Burago, I. V. & Shevchenko, I. I. (2010). Automatic Generation of Enumeration Problems. In The First Russia and Pacific Conference on Computer Technology and Applications (RPC 2010), September 6–9, 2010. Vladivostok, Russia: IACP FEB RAS. ISBN: 978-0-9803267-3-4 (CD). Burago, I. V. & Shevchenko, I. I. (2009). Automatic generation of problems using the method of constraint propagation. (In Russian). In The XXXIV Academician E. V. Zolotov Far Eastern Mathematical School-Seminar, “Fundamental Problems of Mathematics and Information Sciences,” June 25–30, 2009 (pp. 162–163). Khabarovsk, Russia: Pacific National University Press. Burago, I. V., Vasik, O. N., Moiseenko, G. S., & Shevchenko I. I. (2007). An infrastructure for the Metadata Exchange of Ecosystem Observations. (In Russian). In Proceedings of the industry seminar “Mathematical Modeling and Information Technology in the Research of Biological Resources of the World’s Oceans,” October 1–3, 2007 (pp. 22–24). Vladivostok, Russia: TINRO-Center. vi ACKNOWLEDGEMENTS I would like to thank my advisor, Daniel Lowd, for the amount of time he dedicated to helpful disscussions on the subject of this research. vii TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 II. BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . . . . 4 III. METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1. Classifier Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2. Adversary Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3. Instrumental Sampling Approach . . . . . . . . . . . . . . . . . . . . 15 3.4. Importance Sampling Approach . . . . . . . . . . . . . . . . . . . . . 19 3.5. Likelihood-Based Criterion . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6. Generalization for Multiple Base Messages . . . . . . . . . . . . . . 35 IV. EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 V. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 48 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 viii LIST OF FIGURES Figure Page 4.1. Histogram of lengths of all messages in the full dataset. . . . . . . . . . . . 39 4.2. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the parameters τ (E) optimized for the entropy-based criterion (3.48) on a 1% dataset; (b) its cumulant. . . . . . . . . . . . . . 41 4.3. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the parameters τ (P) optimized for the probability- based criterion (3.60) on a 1% dataset; (b) its cumulant. . . . . . . . . . 41 4.4. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the optimal parameters τ (L) (3.104) for the likelihood-based criterion (3.84) on a 1% dataset; (b) its cumulant. . . . 42 4.5. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the baseline parameters τ (H) estimated from θ (H) on a 1% dataset; (b) its cumulant. . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.6. Examples of original spam messages zl (white background) and several appendices xk corresponding to each zl that are generated using parameters τ(E) optimized on a 1% dataset (gray background). . . . . . 44 4.7. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the optimal parameters τ (L) (3.104) for the likelihood-based criterion (3.84) on the full dataset; (b) its cumulant. . 46 4.8. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the baseline parameters τ (H) estimated from θ (H) on the full dataset; (b) its cumulant. . . . . . . . . . . . . . . . . . . . . . . . 46 4.9. Histograms of the distances between parameters obtained through different adversary algorithms as well as the baseline parameters. . . . . . . . . . 47 ix LIST OF TABLES Table Page 4.1. Performance of the classifier on the full dataset. . . . . . . . . . . . . . . . . 39 4.2. Summary statistics for the performance of different generation parameters on 1% datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3. Summary statistics for the performance of the generation parameters optimal for the likelihood-based criterion on the full datasets. . . . . . 45 x CHAPTER I INTRODUCTION Automatic text classification is a widespread problem occurring in different applications. An important example of such application originates from the necessity of filtering unsolicited or undesired messages in email or other messaging systems. One of the effective methods for filtering messages is based on the observation that different kinds of texts often have distinctive compression characteristics. In this approach, messages are seen as finite chunks of characters originating from streams that have probabilistic nature. More strictly, each kind of text, or class, is assumed to be independently defined by a probability distribution that describes the chances of a character to appear at each of the states of the stream producing texts of this class. Then, if that class distributions are known or can be reconstructed with sufficient accuracy from an available sample, any text is classified by comparing its compression characteristics for each of the class distributions. One metric directly connected with compressibility is the entropy which is usually calculated per character to make strings of unequal lengths comparable. If, for the text in question, the entropy per character, under the assumption that this text originated from one source, estimates to be lower than the same entropy for another source, then it is assumed that this text is more likely to belong to the former class rather than the latter. This way, the objective of a compression-based classifier resolves into statistical problem of learning unknown class models by observing messages coming on the input of classifier. In the thesis, we make an attempt to look at the classifier problem from the adversarial point of view. In abstract, the goal of an adversary consists in tricking classifier into making decisions advantageous for a certain adversarial objective, e.g. 1 finding a way to modify a spam message so that it would not be classified as such, and would be label as legible. Unsurprisingly, the adversarial problem is, in some sense, inverse to the classifier problem: While the goal of the classifier is to model class sources by analyzing the input stream of texts, the goal of an adversary is to affect the input stream of the classifier to skew the resulting class models. That is, what was the output for the classifier becomes input for the adversary, and vice versa. We approach the adversary problem as a statistical one. That is, for the example of message filtering, we assume that the adversary does not have a goal of getting a beneficial verdict of the classifier once for a single message, but rather wants to find a whole family of messages that would be classified in a certain way (while, optionally, satisfying some additional conditions aligned with adversary’s goals). In other words, the adversary is interested in methodically changing statistical properties of the classifier’s input stream. To validate the methods discussed in this work, we concentrate on a typical application of entropy-based text classification—the problem of spam filtering. It allows us to evaluate our algorithms on a sufficient amount of real data, while remaining in the smallest case of binary decisions. Considering potential interests of an adversary in this problem, we introduce three different adversary problem settings that meaningfully formalize objectives of a spammer. In doing so, we tried to keep the balance between generality of a mathematically stated objective and feasibility of its analysis. We evaluate effectiveness of each strategy on the SpamAssassin public corpus of legitimate and spam email messages (Apache SpamAssassin Project, 2005) that is used widely for this purpose. The remainder of the thesis is structured as follows. Chapter II provides background on the problem of entropy-based classification and related work on some of the 2 corresponding adversarial problems for this type of classifiers. Chapter III introduces the formal definition of the adversary problem and describes our approach to solving it. Chapter IV describes the experimental results obtained from evaluation of our method on a public email corpus. Finally, Chapter V concludes. 3 CHAPTER II BACKGROUND AND RELATED WORK The starting point of our research is the problem of compression-based text classification. Fundamentally, it rests on the assumption that when a pair of texts compress well together and, consequently, share some structural homogeneity, it is more likely that they belong to the same category. This assumption forms the basis for the whole class of methods using measures of compressibility from various compression models as measures of text similarity. This approach has been studied in a number of works targeting particular models, specific algorithms for building them, and their efficiency for estimating similarity of different types of texts (Cormack & Horspool, 1987; Frank, Chui, & Witten, 2000; Goodman, Heckerman, & Rounthwaite, 2005; Bratko, Cormack, Filipicˇ, Lynam, & Zupan, 2006; Bratko, Filipicˇ, & Zupan, 2006). The focus of our research is specifically on the entropy-based classifiers that define similarity measure to be the cross entropy. At the same time, our approach is mostly agnostic to particular choice of algorithm to be used for learning probability models of classes. For the purposes of evaluation, to estimate class probability distributions, we use the algorithm of prediction by partial matching (PPM) (Cleary & Witten, 1984; Moffat, 1990; Cleary & Teahan, 1997; Teahan, 1995), which has been shown to suit well the special case of text classification—spam filtering (Bratko, Cormack, et al., 2006; Bratko, Filipicˇ, & Zupan, 2006). While a significant amount of effort has been applied to study the effectiveness of compression models in selected applications of text classification, there is still no complete understanding of how robust such algorithms are to different kinds of adversarial noise. 4 In (Lowd & Meek, 2005b), it has been shown that a word-based attack is effective against a maximum entropy spam filter (and a naive Bayes one, as well). The authors propose the attack of inserting and appending freestanding words to spam messages that are detectable as such by the classifier. To guide the selection of words, they consider two sets of heuristics depending on whether the adversary is able to determine that a modified message has been filtered out (active attack) or not (passive attack). In the former case, the knowledge of classifier’s decisions is used as a supervisory signal. In principle, random words that, being included in a barely-spam message, make it legitimate, are promoted for future additions; the ones that make a barely-legitimate message spam, are impeded. In the case of passive attack, the authors use frequency- based heuristics requiring the availability of training samples. According to this strategy, the words with higher frequency of occurrence in legitimate messages, or, alternatively, with higher ratio of frequencies in legitimate messages over spam messages, are assigned with higher probability of being used for an attack. Although these strategies prove the feasibility of attacking maximum entropy classifiers, they are too simplistic. In both active and passive cases, the information about the message being modified is unnecessary for approaching the optimal transformation procedure; it is global due to the nature of the considered classifiers. The work (Lowd & Meek, 2005a) expands the subject to devising an attack on a binary linear classifiers in the case of incomplete or lacking information about the parameters of probabilistic models of those classifiers. This time, however, the focus is not on the ways of constructing an attack, but rather on methods of retrieving sufficient knowledge about the classifier to make a construction possible. The authors consider the cases of Boolean and continuous features for a few adversarial cost functions estimating the cost, or penalty, for each instance sent by an adversary to the input of a classifier. 5 For both types of features, they propose algorithms that for a linear cost function is able to discover classifier’s weights in polynomial number of queries to the classifier. In (Biggio, Nelson, & Laskov, 2011, 2012), the question of robustness of another linear classifier, support vector machines, is researched. The authors propose a poisoning algorithm iteratively improving the attack point in attempt to optimize the classification error which, as they show in evaluation, can be made as large as about one third. The problem considered in these works, however, is less relevant to the one we are working with in this thesis. The key distinction consists in that the authors concentrate on a different kind of adversarial position, where an attacker is supposed to be able to inject prepared inputs into the classifier’s training data, thus poisoning them. In contrast, we focus on the adversary’s task of disguising their activity in the input of the classifier given a fixed training data which is assumed to be unchanged during the time of the interaction. Additionally, the obtained results significantly depend on properties of support vector machines, and cannot be directly applied to maximum entropy classifiers. 6 CHAPTER III METHOD 3.1. Classifier Problem 3.1.1. Preliminaries Let X ⊆ A∗ be a space of arbitrary text strings over some finite alphabet A. On this space, we consider sources or classes of strings that are defined by probability distributions over the set X . In particular, from now on, whenever we discuss a classification problem, we assume that there exists a single input source of strings from X that come on the input of the classifier. The input source is described by the probability g(x)≡ P(ξ = x) assigned to values x ∈ X of the discrete random variable ξ standing for the input strings. The classifier reconstructs the probability distributions f (κ)(x) corresponding to one or more classes κ. Formally, we define probability f (κ)(x) ≡ P(ξ = x | C (κ)) for C (κ) being the event {ξ belongs to class κ}. In this work we concentrate on the case of two classes of strings: legitimate Ham messages and unsolicited Spam messages that are designated with κ = H and κ= S, respectively. 3.1.2. Finite Memory Markov Model The probability of a string x ∈ X originating from the class κ is equal to f (κ)(x) = |x |∏ l=1 P(x l | x l−11 ,κ), (3.1) 7 where x l denotes the l-th character of the string x , and x l k the substring of x starting from the k-th and ending with the l-th character (if k > l, x lk is empty). For the sake of brevity, P(x l | x l−11 ,κ) stands for the probability P(ξl = x l | ξl−11 = x l−11 , C (κ)) of character x l following the context x l−1 1 . Naturally, we can parametrize distributions f (κ)(x) using these probabilities: f (κ)(x) = f (x ,θ (κ)) = |x |∏ l=1 θ (κ) i(x l−11 ), j(x l ) , (3.2) where i(x l−11 ) and j(x l) denote the ordinal numbers of the context x l−1 1 = ci ∈ A∗ and the character x l = a j ∈ A for some orderings on the sets A and A∗, and parameters θ (κ)i j are the probabilities P(ξl = a j | ξl−11 = ci, C (κ)). From this point on, we will also assume that each class κ can be modelled as a stationary and ergodic Markov chain which memory is bounded by certain order K ≥ 1. Under the assumption that limited memory K is sufficient for evaluating probability (3.2), we can rewrite it for our convenience as f (x ,θ ) = |x |∏ l=1 θi(x l−1l−K ), j(x l ) = ∏ ci∈AK a j∈A θ ni j(x) i j (3.3) for ni j(x) being the number of times character a j follows context ci in string x (or, alternatively, substring cia j occurs in x), where ∑ a j∈A θi j = 1, for all ci ∈ AK , (3.4)∑ ci∈AK a j∈A ni j(x) = |x |. (3.5) 8 This way, any string x is viewed as a set of overlapping (K + 1)-grams with frequencies ni j(x), and parameters θi j characterize a class of strings as a whole. Reasoning completely analogously for the probability g(x), we obtain the same parametrized form: g(x ,τ) = ∏ ci∈AK a j∈A τ ni j(x) i j . (3.6) To avoid confusion, we use the letter τ to denote the vector of parameters of the input source as distinguished from vectors of class parameters θ (κ). 3.1.3. Problem Statement The above parametrization following from finite memory Markov models allows us to view the mathematical problem of inferring a class model as an optimization problem in the space of parameters: R(θ ) = Eξ  r(ξ,θ ) →min θ (3.7) for some measure function r(ξ,θ) evaluating the “loss” or “penalty” of classifying message ξ as belonging to the class described by the probability distribution with parameters θ . In other words, the objective of the problem (3.7) for each class κ is to find parameters θ (κ) giving the least losses on average according to r(ξ,θ (κ)). The expectation is taken over the probability distribution g(x) of strings ξ from input source. Generally, probability g(x) is supposed to be unknown for all classes. For this reason, a version of the problem (3.7) for empirical averaging is considered: bR(θ ) = ∑ xk∈T r(xk,θ )→min θ , (3.8) 9 where T stands for a training sample of messages corresponding to the class in question. Hereinafter, for consistency, training samples of Ham and Spam classes are labeled as T (H) and T (S) accordingly. When the inference problem is solved and the vectors of parameters θ (H) and θ (S) are estimated for each class, they can be used to make classifying decision based on the same principle of least loss: q(x ,θ ) = r(x ,θ (H))− r(x ,θ (S)), (3.9) κ(x) =  H, if q(x ,θ )< α(H); S, if q(x ,θ )≥ α(S). (3.10) In case of α(H) ≤ q(x ,θ) < α(S), additional measures are needed to decide the class (for example, increasing the length of the message in question). Most commonly, both parameters are set the same value, α(H) = α(S) = α. The choice of parameter α is guided by its influence on the number of type I and type II errors. 3.1.4. Entropy Classification Let us consider the measure function r(ξ,θ) = − 1|ξ| log f (ξ,θ). As it is obvious from the above definitions, the general criterial function (3.7) specializes to the cross entropy R(θ ) = H(θ )≡ −Eξ • 1 |ξ| log f (ξ,θ ) ˜ = −∑ x∈X 1 |x | g(x) log f (x ,θ )→minθ . (3.11) We will refer to this specialization of the problem (3.7) as the classifier problem. 10 Similarly, the empiric version (3.8) becomes bR(θ ) = ÒH(θ )≡ −∑ xk∈T 1 |xk| log f (xk,θ )→minθ , (3.12) where T , of course, is assumed to be a sample of strings distributed according to g(x). Decision rule (3.10) can be rewritten as follows. q(x ,θ ) = 1 |x | log f (S)(x)− 1|x | log f (H)(x) = 1 |x | log f (S)(x) f (H)(x) , (3.13) κ(x) =  H, if q(x ,θ )< α; S, if q(x ,θ )≥ α. (3.14) In practice, parameter α is often set to zero. It is well known that if the function g(x) is given and f (x ,θ) > 0 for all x such that g(x)> 0, then f (x ,θ )∝ g(x)|x | (3.15) is an exact solution of the problem (3.11). Because, as we have seen above, both f (x) and g(x) can be parametrized identically, at least in the case when all texts x have the same length (or the variation in lengths can be neglected), f (x ,θ ) can be constructed from g(x ,τ) by letting θ = τ. The parameters τ, in turn, can be directly found by estimating conditional probabilities P(x l | x l−1l−K) on some training sample T . This observation forms the basis of the technique called Prediction by Partial Matching (PPM). Aside from differences in strategies of approximating probabilities for character-context pairs that do not occur in a given sample, PPM algorithms work as 11 simple frequency estimators setting θi j ≈ Ni jNi , (3.16) for Ni j = ∑ x∈T ni j(x), (3.17) Ni = ∑ a j∈A Ni j. (3.18) For versions of PPM estimators and the details of their implementation, see (Cleary & Witten, 1984; Teahan, 1995; Cleary & Teahan, 1997; Moffat, 1990). 3.2. Adversary Problem As we just seen above, in the classifier problem (3.11) the goal was to find an optimal statistical model f (x ,θ ) for messages of some class, given a fixed input source defined by probabilities g(x) which manifest itself in a sample T . To put it more strictly, the function g(x) was fixed (although unknown), while the probability distribution f (x ,θ ) was known up to the vector θ which were the parameters in question. It is also of interest to consider the inverse problem statement where given fixed statistical model f (x ,θ ) of some class, it is required to find the source distribution g(x) which is the most favourable for certain classification outcome. In this setting, g(x) becomes the function in question, while f (x ,θ) is fixed through a known vector of parameters θ . One example of such inverse objective is the problem of determining g(x) generating messages that are as close to Ham messages as possible in terms of probability of passing the spam filter. Another version of the problem that also falls into this category is the following adversary problem (or, in case of spam filtering, the spammer problem). 12 For a given string z from some set of base messages Z , find probability distribution for generating strings x t such that the result of some transformation ψ(z, x t) combining them complies with some statistical requirement, e.g. being classified as Ham on average. This setting is especially practical for a spammer when z by itself has low chances of passing the filter. To state the spammer problem more formally, we assume the following. There is a generator algorithm which plays the role of a source of strings x t(τ) for a specified vector of parameters τ. Strings x t(τ) are considered to be generated randomly and independently, and have the same distribution in the space of strings X . The generated strings x t(τ) are then used to obtain new messages ut = ψ(z, x t(τ)) from a given message z according to the predetermined transformation ψ. In general, the function ψ(z, x) can associate a pair of strings with any string. One such transformation that is simple but still keeps the problem non-trivial is string concatenation, ψ(z, x) = zx . Even though our method does not sufficiently depend on a particular transformation, for illustration purposes, when necessary, we will not be using other definitions of ψ except for the concatenation. The objective of the inverse problem itself remains the same: G(τ)≡ −∑ x∈X 1 |x | g(x ,τ) log f (x)→minτ , (3.19) with the exception of that optimization is done for the parameters τ of the source distribution g(x ,τ), not the class distribution f (x ,θ ) = f (x). The decision to search in the parametrized space of distributions g(x ,τ) is justified by the necessity to obtain a generative (rather than discriminative) model of the desired message source. As in the case of the classifier problem (3.11), it is well known that, in non- parametrical form, the inverse problem (3.19) also has an analytical solution. Any 13 function g(x) such that ∑ x∈X fmax g(x) = 1, (3.20) g(x) = 0, for all x ∈ X \ X fmax , (3.21) where X fmax = argmax x f (x) |x | , (3.22) minimizes the cross entropy for a given f (x). These solutions for the non-parametric problem, however, does not solve the spammer problem. None of the functions g(x) satisfying the above properties is guaranteed to be represented in the space of parametrized functions g(x ,τ) which makes them useless for generating x t(τ). Moreover, even if this difficulty did not exist, the diversity of the generated messages would be extremely low, because any of such g(x) leads to generating the same few messages from X fmax over and over again which makes spammer easily detectable. Empirical analog of the criterion (3.19) is bG(τ)≡∑ xk∈T 1 |x | g(xk,τ)→minτ , (3.23) where the sample T is obtained from a distribution P(ξ = x)∝ log 1f (x) . Therefore, in order to approach the inference problem in the form (3.23), it is necessary to have an auxiliary instrumental sample which, unlike training samples for the classes or the combined sample for the input source, cannot be observed in practice. 14 3.3. Instrumental Sampling Approach Let us introduce new parameters wi j such that τi j = exp(wi j)∑ a j∈A exp(wi j) , (3.24) where, as before, subscripts i and j correspond to some context ci ∈ AK and character a j ∈ A, respectively. For any values of wi j, the required conditions on τi j hold automatically: 0< τi j < 1 and ∑ a j∈A τi j = 1 (3.25) (0≤ τi j ≤ 1, if wi j = ±∞ are allowed). For the new parameters, the probability g(x ,τ) = ∏ ci∈AK a j∈A τ ni j(x) i j (3.26) changes to g(x ,τ(w)) = ∏ ci∈AK a j∈A  exp(wi j)∑ a j′∈A exp(wi j′) ni j(x) = ∏ ci∈AK 1 Zni(x)i exp ∑ a j∈A wi jni j(x)  = ∏ ci∈AK  1 Zi exp ∑ a j∈A wi j ni j(x) ni(x) ni(x) , (3.27) 15 where ni j(x) is, as usual, the number of occurrences of a substring cia j in x , and ni(x) = ∑ a j∈A ni j(x), (3.28) Zi(w) = ∑ a j∈A exp(wi j). (3.29) Now, let us calculate the gradient of the function g(x ,τ(w)) using the equality ∂ g(x ,τ(w)) ∂ wlk = g(x ,τ(w)) ∂ ∂ wlk log g(x ,τ(w)), (3.30) where log g(x ,τ(w)) = ∑ ci∈AK a j∈A ni j(x) log  exp(wi j) Zi(w)  = ∑ ci∈AK a j∈A wi jni j(x)− ∑ ci∈AK a j∈A ni j(x) log Zi(w). (3.31) Then, ∂ g(x ,τ(w)) ∂ wlk = g(x ,τ(w)) ∂ ∂ wlk •∑ ci∈AK a j∈A wi jni j(x)− ∑ ci∈AK a j∈A ni j(x) log Zi(w) ˜ = g(x ,τ(w))  nlk(x)− nl(x)Zl exp(wlk) ‹ = g(x ,τ(w))nl(x) bτlk(x)−τlk(w), (3.32) where bτlk(x) = nlk(x)nl(x) . (3.33) 16 Now, consider a problem of the form Eξ  F(ξ, w) ≈∑ xk∈T F(xk, w)→minw , (3.34) where the random variable ξ(w) is distributed and strings xk from an instrumental sample T are generated according to the probabilities g(x ,τ(w)). The problem (3.23) that has motivated us to consider this approach is a special case for F(ξ, w) = F(x) = 1 |x | log 1 f (x) . (3.35) Given that ∂ ∂ wlk Eξ  F(ξ, w)  = ∂ ∂ wlk •∑ x∈X F(x , w) g(x ,τ(w)) ˜ = ∑ x∈X ∂ ∂ wlk  F(x , w) g(x ,τ(w))  = ∑ x∈X • ∂ ∂ wlk F(x , w) g(x ,τ(w)) + F(x , w) ∂ ∂ wlk g(x ,τ(w)) ˜ = ∑ x∈X • ∂ ∂ wlk F(x , w) + F(x , w)nl(x) bτlk(x)−τlk(w)˜ g(x ,τ(w)) = Eξ • F(ξ, w)  ∂ ∂ wlk  log F(ξ, w)  + nl(ξ) bτlk(ξ)−τlk(w)‹˜ = Eξ • ∂ ∂ wlk F(ξ, w) + F(ξ, w)nl(ξ) bτlk(ξ)−τlk(w)˜, (3.36) from the necessary condition of extremum, we see that optimal w satisfies the equation Eξ • ∂ ∂ wlk F(ξ, w) + F(ξ, w)nl(ξ) bτlk(ξ)−τlk(w)˜= 0, (3.37) 17 which, in the case when F(x , w) = F(x) is independent of parameters, simplifies to Eξ ” F(ξ)nl(ξ) bτlk(ξ)−τlk(w)—= 0, (3.38) for all cl ∈ AK , ak ∈ A. (Both nl(x) and nlk(x) are random variables and, consequently, cannot be factored out of the expectation.) Since ξ ∼ g(x ,τ(w)), as the size of instrumental sample T grows, frequencies bτlk(x) converge to the current estimations τlk(w) that were used to generate the sample in the first place. For this reason, any attempt of iterative optimization of (3.34) turns into a random walk around initial values of wi j. Moreover, for many practical generation procedures it is true that Eξ bτlk(ξ)= τlk(w). (3.39) In a simplified case of both F(x) and nl(x) being independent of parameters w, which takes place when, for example, generation procedure stops after reaching the same length of x chosen beforehand, the equation (3.38) simply degenerates, and the problem becomes meaningless. If the function F preserves some dependence on parameters—either in the general form F(x , w), or in a weaker variant F(x(w))—the problem (3.38) is not strictly meaningless. However, for sufficiently long samples, as the difference |bτlk(ξ)−τlk(w)| approaches zero, the influence of the F(x , w)-multiplier becomes effectively eliminated making the expectation (3.38) almost independent of F . For this reason, we consider approaching problem (3.19) as (3.34) unpromising. 18 3.4. Importance Sampling Approach Formally, we consider a vector of parameters τ to be a solution to the inverse problem, if FD[q(u,θ )]≡F [q(u,θ ) | u =ψ(z, x), x ∈ D]→max τ , (3.40) where D is a set of text strings, the domain, and F (·) is an ensemble operation defined on D. For example, the domain D might be the set of all strings of some bounded length, or some subset of that set. An empirical sample of strings produced by the generator used by the adversary can also be taken as the domain Dτ = {x t(τ)}t . The choice of ensemble operation depends on what criterion of success aligns best with the goals of the spammer in a particular problem setting. Let us consider some of them. (a) For all x ∈ Dτ, messages u =ψ(z, x) are successfully pass the spam filter: FDτ  q(u,θ )  = min x∈Dτ 1(H)(u)→max τ , (3.41) where 1(H)(u)≡ 1q(u,θ )< α= 1log f (S)(u)− log f (H)(u)< α|u|. (3.42) (b) As many messages x ∈ Dτ,l = {x ∈ Dτ | |x | ≤ l} of a bounded length l are successfully pass the spam filter: FDτ,l  q(u,θ )  = ∑ x∈Dτ,l 1(H)(u)→max τ . (3.43) 19 (c) Empirical frequency of passing the spam filter successfully estimated over a sample Dτ is maximal: FDτ  q(u,θ )  = 1 |Dτ| ∑ x∈Dτ 1(H)(u)→max τ . (3.44) (d) The average logarithmic ratio of probabilities q(u,θ ) estimated over a sample Dτ is as minimal as possible: FDτ  q(u,θ )  = −∑ x∈Dτ q(u,θ )→max τ , (3.45) or ∑ x∈Dτ q(u,θ )→min τ . (3.46) Criterion (a) is too optimistic and requires the acceptance of the implicit assumption that there exists a vector τ which guarantees that all messages pass the filter. Solution of the problem in the sense of this criterion, generally, is unlikely to exist (unless the generator algorithm is complemented with constraints significantly restricting diversity of generated strings). Criterion (b) does not take probabilities of strings x into account, while it would be worthwhile to ignore strings with zero or close to zero probabilities. We consider criteria (c) and (d) to be more appropriate. Let us discuss the latter objective before the former. 3.4.1. Entropy-Based Criterion Empirical criterion (3.45) is equivalent to the optimization problem Fq(u,θ )=∑ x∈X q(u,θ ) g(x | z,τ) = Eξ  q(u,θ ) →min τ , (3.47) 20 where the expected value is taken over the probability distribution g(x | z,τ) of text x being generated for the base string z and parameters τ. Let us now rearrange the sum in (3.47) using the well known technique of importance sampling: R(τ | z) = Eξ  q(u,θ )  = ∑ x∈X q(u,θ ) g(x | z,τ)  γ(H) p(H) f (H)(x) p(H) f (H)(x) + γ(S) p(S) f (S)(x) p(S) f (S)(x)  = ∑ κ∈{H,S} p(κ) E(κ) ξ  γ(κ) q(ξ | z,θ ) g(ξ | z,τ) p(κ) f (κ)(ξ)  = Eξ,κ  q(ξ | z,θ ) g(ξ | z,τ) p(κ) f (κ)(ξ)  = Eξ,κ ” W (κ)(ξ | z,θ ) g(ξ | z,τ)—→min τ , (3.48) for W (κ)(x | z,θ ) = γ(κ) q(x | z,θ ) p(κ) f (κ)(x) , (3.49) where for each class κ ∈ {H,S}, expected value E(κ) ξ [ · ] denotes conditional expectation Eξ[ · | ξ ∼ f (κ)(x)], p(κ) stands for the a priori probability of the class κ, and γ(H), γ(S) are arbitrary splitting weights such that γ(H) + γ(S) = 1 (for example, γ(H) = γ(S) = 12 or γ(H) = p(H), γ(S) = p(S)). In this problem setting, all statistical information that can be available to the adversary—that is, both samples T (H) and T (S) of Ham and Spam messages—are used: R(τ | z)≈ bR(τ | z) =∑ (xk ,κk)∈T W (κk)(xk | z, θ ) g(xk | z,τ)→min τ . (3.50) 21 Here κk are true labellings of messages xk from the sample T which is the union of samples T (H) and T (S) that are assumed to be drawn from distributions f (H)(x) and f (S)(x), respectively. Due to the necessary condition of extremum, we have the equation: ∂ ∂ τi j R(τ | z) = ∂ ∂ τi j Eξ,κ ” W (κ)(ξ | z,θ ) g(ξ | z,τ)— = Eξ,κ ” W (κ)(ξ | z,θ ) ∂ ∂ τi j g(ξ | z,τ)—= 0. (3.51) Since it has the form E[ · ] = 0, it is natural for us to apply the method of stochastic optimization (Robbins & Monro, 1951). Switching to the parameters wi j that introduced in (3.24), we obtain the stochastic algorithm w(t+1)i j = w (t) i j − γtW (κk(t))(z, xk(t)) g(xk(t) | z,τ(w(t))) · ni(xk(t) | z) bτi j(xk(t) | z)−τi j(w(t)), (3.52) where γt is a series satisfying the properties γt ≥ 0, for all t ≥ 0, (3.53) ∞∑ t=0 γt =∞, (3.54) ∞∑ t=0 γ2t <∞, (3.55) and xk(t) and κk(t) run through the sample T (potentially repeatedly) in some order defined by k(t). 22 3.4.2. Probability-Based Criterion Obviously, objective function (3.44) is the empirical version of the criterion Fq(u,θ )=∑ x∈X 1(H)(u) g(x | z,τ) = Eξ  1(H)(u) →max τ , (3.56) where ξ ∼ g(x | z,τ) and, as in the previous section, g(x | z,τ) is generational probability distribution for base string z and parameters τ. This criterion, in turn, makes the problem be equivalent to maximizing the probability of the transformed message ψ(z,ξ) passing the spam filter: R(τ) = Pr  1(H)(ψ(z,ξ)) | z,τ→max τ . (3.57) Since we have only two classes, maximization of the criterion (3.56), R(H)(τ) = ∑ x∈X 1(H)(ψ(z, x)) g(x | z,τ), (3.58) is equivalent to minimization of the dual criterion R(S)(τ) = ∑ x∈X 1(S)(ψ(z, x)) g(x | z,τ), (3.59) where 1(S)(u) = 1− 1(H)(u). Let us combine both of them into a single problem R(τ)≡ γ(H)R(H)(τ)− γ(S)R(S)(τ) (3.60) = ∑ x∈X € γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))Šg(x | z,τ)→max τ , (3.61) where γ(H) and γ(S) are some splitting weights such that γ(H) + γ(S) = 1. 23 3.4.2.1. Supervised Learning Formally rearranging the criterion function (3.60) into two sums and applying the importance sampling for the distribution of the pair (ξ,κ), we see that R(τ) = ∑ x∈X € γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))Šg(x | z,τ) = ∑ κ∈{H,S} p(κ) ∑ x∈X W (κ)(z, x) f (κ)(x) g(x | z,τ) = Eξ  W (κ)(z,ξ) g(ξ | z,τ)→max τ , (3.62) for the random variable ξ distributed according to f (κ)(x) and W (κ)(z, x) = γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x)) f (κ)(x) = 1(H)(ψ(z, x))− γ(S) f (κ)(x) . (3.63) Assuming that a sample of messages xk ∈ T is available together with true labeling of classes κk = κ(xk), the criterion R(τ) can be estimated as R(τ)≈ bR(τ)≡∑ (xk ,κk)∈T W (κk)(z, xk) g(xk | z,τ). (3.64) For the parameters wi j that have been introduced earlier this chapter through the equality τi j = exp(wi j)∑ a j∈A exp(wi j) , (3.65) 24 we obtain that ∂ R(w) ∂ wi j = Eξ  W (κ)(z,ξ) g(ξ | z,τ(w))  ni j(ξ | z)− ni(ξ | z)exp(wi j)∑ al∈A exp(wil)  = Eξ ” W (κ)(z,ξ) g(ξ | z,τ(w))ni(ξ | z) bτi j(ξ | z)−τi j(w)—, (3.66) where, as in previous sections, bτlk(x) = nlk(x)nl(x) , (3.67) and ni(x | z) and ni j(x | z) stand for the number of occurrences of the context ci ∈ AK followed by any character and followed by the character a j ∈ A, respectively, in the text x appended to the message z. Thus, the stochastic approximation algorithm for the necessary condition of the extremum, ∂ R(w) ∂ wi j = 0, (3.68) takes the form w(t+1)i j = w (t) i j + γtW (κk(t))(z, xk(t)) g(xk(t) | z,τ(w(t))) · ni(xk(t) | z) bτi j(xk(t) | z)−τi j(w(t)). (3.69) 3.4.2.2. Unsupervised Learning In case when true labeling κk of messages xk from sample T is unknown, we can alternatively do the importance sampling for the distribution f (x) = p(H) f (H)(x) + p(S) f (S)(x). (3.70) 25 Then R(τ) = ∑ x∈X γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x)) f (x) f (x) g(x | z,τ) = Eξ  W (z,ξ) g(ξ | z,τ)→max τ , (3.71) where the random variable ξ is distributed in accordance with f (x), and W (z, x) = γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x)) f (x) = 1(H)(ψ(z, x))− γ(S) p(H) f (H)(x) + p(S) f (S)(x) . (3.72) Since the criteria (3.71) and (3.62) differ only in definition of the weights W (z, x) which are independent of wi j, the resulting stochastic optimization algorithm is exactly the same as in (3.69) (again, up to differences between W (z, x) and W (κ)(z, x)). 3.5. Likelihood-Based Criterion Let us again consider the transformation u =ψ(z, x) of a message z with an arbitrary string x . Entropy per character of the resulting string u can be estimated empirically as H(u | τ) = − 1|u| log  |u|∏ l=1 g(ul | ul−1l−K ,τ) ‹ = − 1|u| ∑ ci∈AK a j∈A ni j(u) logτi j. (3.73) Averaged over random transformed messages u, it is equal to H(τ) = Eu  H(u | τ)= −∑ ci∈AK pi ∑ a j∈A p j|i logτi j = − ∑ ci∈AK pi Hi(τ), (3.74) 26 where Hi(τ) = ∑ a j∈A p j|i logτi j, (3.75) pi is the probability of the context ci occurring in a random transformed message u, and p j|i is the conditional probability of character a j occurring in u after the context ci. The function Hi(τ), in turn, can be estimated on a single message u as Hi(τ)≈ ÒHi(u | τ) =∑ a j∈A ni j(u) ni(u) logτi j. (3.76) Assuming that we have available a sample T of messages x out of the universal space X , we can split T into auxiliary samples depending on to what class ψ(z, x) is assigned by the classifier: T (κ) = {xk ∈ T | 1(κ)(ψ(z, xk) | θ ) = 1}, (3.77) where, as in previous sections, 1(H)(x | θ ) = 1[q(x ,θ )< α], (3.78) 1(S)(x | θ ) = 1[q(x ,θ )≥ α]. (3.79) That is, T (H) and T (S) consist of messages xk ∈ T that make the base message z being recognized as Ham and Spam, respectively. 27 Considering these samples, we can generalize the estimate ÒH(u | τ) to the estimates over samples T (H) and T (S): R(H)(τ | z) = − 1|T (H)| ∑ xk∈T (H) ∑ ci∈AK pi ÒHi(ψ(z, xk) | τ), (3.80) R(S)(τ | z) = − 1|T (S)| ∑ xk∈T (S) ∑ ci∈AK pi ÒHi(ψ(z, xk) | τ). (3.81) Then, we can state our goal in a new way: Find parameters τ such that the entropy estimate R(H)(τ | z) becomes low, while the estimate R(S)(τ | z) remains high. One way to achieve these goals simultaneously is to formalize them as a problem of minimization the difference of the above objective functions: R(τ) = R(H)(τ | z)− R(S)(τ | z)→min τ , (3.82) subject to usual normalization requirements τi j ≥ 0 and ∑ a j∈A τi j = 1, (3.83) for all contexts ci ∈ AK and all characters a j ∈ A. 28 Substituting the entropy estimation (3.76) definition into the criterion (3.82), we have: R(τ | z) = R(H)(τ | z)− R(S)(τ | z) = − 1|T (H)| ∑ uk∈U (H) ∑ ci∈AK pi ÒHi(u | τ) + 1|T (S)| ∑ uk∈U (S) ∑ ci∈AK pi ÒHi(u | τ) = −∑ ci∈AK pi  1 |T (H)| ∑ uk∈U (H) ÒHi(u | τ)− 1|T (S)| ∑ uk∈U (S) ÒHi(u | τ)‹ = −∑ ci∈AK pi ∑ a j∈A (ν(H)i j − ν(S)i j ) logτi j →minτ , (3.84) for U (κ) = {ψ(z, xk) | xk ∈ T (κ)}, (3.85) ν (κ) i j = 1 |T (κ)| ∑ uk∈U (κ) ni j(uk) ni(uk) . (3.86) Since parameters τi j occur only in summands for the context ci, optimization (3.84) naturally falls into |AK | smaller problems: Ri(τ | z) = R(H)i (τ | z)− R(S)i (τ | z)→ min{τi1,τi2,...,τi|A|}, (3.87) where R(κ)i (τ | z) = − ∑ a j∈A ν (κ) i j logτi j. (3.88) Lemma 1. The objective function Ri(τ) = − ∑ j∈J νi j logτi j, (3.89) 29 where weights νi j ≥ 0 for any j ∈ J, and parameters τi j are subject to constraints τi j ≥ 0 and ∑ j∈J τi j = s > 0, (3.90) reaches its minimum value at τ∗i j = s νi j νi , (3.91) where νi = ∑ j∈J νi j. (3.92) Proof. Considering that logε≤ (ε− 1) for any ε > 0, we see that for an arbitrary vector of parameters τ, Ri(τ ∗)− Ri(τ) = − ∑ j∈J νi j log sνi j νi + ∑ j∈J νi j logτi j = ∑ j∈J νi j log τi jνi sνi j ≤∑ j∈J νi j  τi jνi sνi j − 1 ‹ = νi s ∑ j∈J τi j − ∑ j∈J νi j = 0, (3.93) by definition of νi and normalization requirements on τ. From the obtained relation it immediately follows that Ri(τ∗)≤ Ri(τ) for any τ. Lemma 2. The objective function Ri(τ) = − ∑ j∈J νi j log 1 τi j , (3.94) 30 where weights νi j ≥ 0 for any j ∈ J, and parameters τi j are subject to constraints τi j ≥ 0 and ∑ j∈J τi j = s > 0, (3.95) reaches its minimum value at τ∗i j =  s |Jmini | , if j ∈ J min i ; 0, if j ∈ J \ Jmini ; (3.96) where Jmini = { j ∈ J | νi j = minj∈J νi j}. (3.97) Proof. Let us consider the following values of parameters under the temporary assumption that τi j ≥ " for some arbitrarily small " > 0 and all j ∈ J . τ∗i j(") =  s 1− (|J | − |Jmini |)" |Jmini | , if j ∈ J min i ; s", if j ∈ J \ Jmini . (3.98) We can assume that ∑ j∈J\Jmini νi j > 0, which is always true unless J min i = J . It is clear that for smaller values of " criterion function Ri(τ∗(")) also gets smaller values. Ri(τ ∗(")) = − ∑ j∈Jmini νi j log |Jmini | s 1− (|J | − |Jmini |)"  − ∑ j 6∈Jmini νi j log 1 s" . (3.99) 31 Therefore, passing to the limit, we can make the criterion arbitrarily small while approaching the desired solution τ∗: lim "→0 Ri(τ ∗(")) = −∞, (3.100) lim "→0τ ∗(") = τ∗. (3.101) Generally speaking, this solution is not unique: any distribution of the probability mass across τ∗i j for j ∈ Jmini minimizes the criterion. However, one solution is sufficient for our purposes. Theorem 3. The criterion function Ri(τ | z) = − ∑ a j∈A (ν(H)i j − ν(S)i j ) logτi j (3.102) subject to constraints τi j ≥ 0 and ∑ a j∈A τi j = 1, (3.103) reaches its minimum value at τ∗i j = µi j µi , (3.104) where µi j = max{0,ν(H)i j − ν(S)i j }, (3.105) µi = ∑ a j∈A µi j. (3.106) Proof. Let us divide the sum in the objective function Ri(τ | z) into the following three sums over disjoint subsets of indices according to the sign of the difference δi j ≡ 32 ν (H) i j − ν(S)i j : Ri(τ | z) = − ∑ a j∈A δi j logτi j = −∑ j∈J+1i δi j logτi j − ∑ j∈J−1i δi j logτi j − ∑ j∈J0i δi j logτi j = −∑ j∈J+1i δi j logτi j − ∑ j∈J−1i (−δi j) log 1 τi j , (3.107) where Jσi = { j | a j ∈ A∧ sgn(ν(H)i j − ν(S)i j ) = σ}. (3.108) Similarly to Jσi , let sσi = ∑ j∈Jσi τi j, (3.109) s+i + s − i + s 0 i = 1. (3.110) Clearly, the problem of finding optimal τi j can be solved separately for each of the sums in (3.107). – For the first sum R+i (τ) = − ∑ j∈J+1i δi j logτi j, (3.111) conditions of the Lemma 1 hold for J = J+1i , νi j = δi j, and s = s + i . Consequently, the function R+i (τ) is minimized for τ∗i j = s+i δi j∑ l∈J+1i δil , j ∈ J+1i . (3.112) Notice that the greater the sum s+i becomes, the lesser is the minimal value R + i (τ ∗). 33 – For the second sum R−i (τ) = − ∑ j∈J+1i (−δi j) log 1 τi j , (3.113) conditions of the Lemma 2 hold for J = J−1i , νi j = −δi j, and s = s−i . As we have shown in Lemma 2, when parameters are bounded below by some arbitrarily small " > 0, the function R−i (τ) is minimized for τ∗i j(") =  s−i 1− (|A| − |Jmini |)" |Jmini | , if j ∈ J min i ; s−i ", if j ∈ J−1i \ Jmini ; (3.114) Jmini = { j ∈ J−1i | −δi j = min l∈J−1i (−δil) = −max l∈J−1i (δil)}. (3.115) Notice that, since τi j occurs in R − i (τ) inversed, unlike in R + i (τ), the lesser the sum s−i becomes, the lesser is the minimal value R − i (τ ∗). – For the indices j ∈ J0i , the choice of τi j is irrelevant and does not change the value of Ri(τ | z) regardless of the magnitude of s0i . In order to combine the independent solutions (3.112) and (3.114) optimizing the separate sums, it is necessary to determine in which proportion should the probability mass be distributed between parameters belonging to J+1i , J −1 i , and J 0 i . As we have seen above, for the minimal value (as a function of the bound ") to be the smallest, s+i has to be as large as possible, while both s−i and s 0 i , to the contrary, have to be as small as 34 possible. Therefore, the optimal proportion for the parameters bounded below by " is s0i = |J0i |", (3.116) s−i = |J−1i |", (3.117) s+i = 1− s−i − s0i . (3.118) The corresponding parameters are then τ∗i j(") =  δi j∑ l∈J+1i δil 1− (|A| − |J+1i |)"  , if j ∈ J+1i ; ", if j ∈ J−1i ∪ J0i . (3.119) Passing to the limit for "→ 0, we finally obtain the parameters that deliver minimum to the function Ri(τ | z): τ∗i j = lim"→0τ ∗ i j(") =  δi j∑ l∈J+1i δil , if j ∈ J+1i ; 0, if j ∈ J−1i ∪ J0i ; = max{0,δi j}∑ a j∈A max{0,δi j} = µi j µi . (3.120) 3.6. Generalization for Multiple Base Messages Throughout this chapter we have considered the method for a single arbitrary base message z that is chosen beforehand. However, with minor modifications, the presented reasoning holds for the same criteria but averaged over multiple base messages zl ∈ Z . Indeed, for both approaches described in section 3.4 resulting in stochastic optimization, 35 the only change averaging over Z makes is that the variable z, as well as x , runs over a sample on iterations: w(t+1)i j = w (t) i j + γtW (κk(t))(zl(t), xk(t)) g(xk(t) | zl(t),τ(w(t))) · ni(xk(t) | zl(t)) bτi j(xk(t) | zl(t))−τi j(w(t)). (3.121) The same is true for the likelihood-based approach discussed in section 3.5. If the criterion (3.82) is averaged over a set of base messages Z , the order of summation can be seamlessly changed so that the new outer sum over zl ∈ Z , together with the sum over uk ∈ U , is taken before the sum over ci ∈ AK and a j ∈ A. Consequently, the resulting objective function takes the same form (3.84) but for ν (κ) i j = 1 |Z | |T (κ)| ∑ zl∈Z ∑ uk∈T (κ) ni j(ψ(zl , xk)) ni(ψ(zl , xk)) . (3.122) 36 CHAPTER IV EVALUATION 4.1. Methodology In order to validate the method proposed in this research, first we implemented the entropy classifier for the problem of spam filtering. Following the definition of the problem given in section 3.1.4, our implementation uses the algorithm of prediction by partial matching (PPM) to learn the finite memory Markov models for each of the two classes, and then makes classifying decisions depending on for which class entropy per character is the minimal. We also implemented all three of the algorithms proposed in sections 3.4.1, 3.4.2, and 3.5. Our numerical experiments were organized as follows. For each run of evaluation, first, a combined sample T of both legitimate (T (H)) and spam (T (S)) messages was drawn out of the SpamAssassin public corpus (Apache SpamAssassin Project, 2005). Each message in xk ∈ T was accompanied with the true class labelling κk ∈ {H,S}. The sample T was additionally temporarily split at random in proportion seven to three into the training and testing samples, respectively. The former was used to train the classifier, the latter was used to ensure that performance of the classifier is within the expected boundaries (as compared, for example, to (Bratko, Cormack, et al., 2006)). All of the spam messages in T that were recognized as such according to the obtained class parameters θ (H) and θ (S), were remembered and declared to be the set of base messages Z . Then, our algorithms (3.52), (3.69), and (3.104) were run on the combined sample T in order to obtain transformation parameters τ(E), τ(P), and τ(L), correspondingly. The 37 first two algorithms based on the stochastic optimization were repeatedly run over all pairs (zl , xk) ∈ Z × T , where the index k was incremented first. To control the convergence, after each pass over T (i.e. every |T | iterations), the value of the criterion function corresponding to the current algorithm was estimated using a ten percent subsample of Z × T . This estimation together with the total number of iterations performed by the moment were used to make a stopping decision. Once in a several passes over T (between |T | and 10 |T | iterations, depending on the size of the problem), the current parameters τ(t) = τ(w(t)) were supplied to the Markov chain generator. For each base message z ∈ Z , the generator produced a continuation stream of characters distributed according to the distributions g(x ,τ(t)) that were stopped when the string ex of characters produced so far was enough to get the transformed message u =ψ(z, ex) = zex past the classifier’s spam filter. If the length of ex exceeded 20 · |z|, the generator was forcefully stopped. This way, for each z, a thousand of continuations ex were generated to estimate a secondary evaluation measure, the average length of ex required to make z legitimate to the classifier. The third algorithm (3.104) required less work since it provided the analytical solution as long as the values µi j were calculated. To do so, a single pass of averaging ni j(ψ(zl , xk)) over the samples Z × T (H) and Z × T (S) was done. After that, the same generation procedure described above was done, so there was an auxiliary measure for comparing this algorithm with the other two and the baseline strategy. The role of a baseline strategy in our experiments was played by the same generation procedure called for the vector of parameters τ(H) = θ (H) that were estimated during the training of the classifier on the sample of legitimate messages T (H). That same vector θ (H) also served as an initial estimate for the stochastic optimization. 38 TABLE 4.1. Performance of the classifier on the full dataset. True class Classified as Ham Spam Ham 68.0% (1645) 0.5% (13) Spam 1.5% (36) 30.0% (725) 0 1 2 3 4 5 ·104 0 0.1 0.2 0.3 0.4 0.5 0.6 Length in bytes Fr eq ue nc y FIGURE 4.1. Histogram of lengths of all messages in the full dataset. 4.2. Results Due to limited computational resources, during all evaluation runs, Markov models’ memory was fixed to be three characters for the classifier as well as the adversary. In practice, entropy-based spam filters demonstrate the best performance for orders of Markov models between six and eight characters (Bratko, Cormack, et al., 2006)). However, even for K = 3 our implementation of the classifier based on the algorithm of prediction by partial matching has error rate of approximately 2% on the SpamAssassin dataset. Table 4.1 shows statistics of one such run when all 6046 bodies of email messages were split into 3627 training and 2419 testing messages. The distribution of lengths of the messages is shown in Figure 4.1. For the order K = 3, the space of parameters τ, representing conditional probabilities of a one-byte character given a context of at most K another one-byte 39 characters, is bounded by K∑ k=0 |A|k+1 = 2561 + 2562 + 2563 + 2564 ≈ 232. (4.1) The total number of character-context combinations for K = 3 that actually occur in all messages from the SpamAssassin dataset is approximately 524 000. To avoid memory pressure and achieve faster convergence, the algorithms (3.52) and (3.69) requiring stochastic optimization, were run on a series of small subsets of the original dataset. Each time, approximately one percent of messages were sampled at random from the full dataset. Let us present evaluations for a typical such run on a 1% dataset done for the three algorithms, as it was described in the previous section. The failure rate of the chosen concatenation-based transformation was zero for all spam messages and parameters τ obtained from all three algorithms as well as the Ham baseline τ(H). That is, it was possible to generate an appendix xk for each base spam message zl such that their concatenation uk =ψ(zl , xk) = zl xk was classified as legitimate. For this reason, to compare performance for different parameters, we used a supplementary index of the ratio |uk|/|zl | between the lengths of each transformed message. Note that none of the methods proposed in this work was constructed to directly optimize this length ratio. Figures 4.2, 4.3, and 4.4 depict distributions of length ratios averaged over transformation appendices xk generated according to the parameters τ (E), τ(P), τ(L) that were optimized for the entropy-based, probability-based, and likelihood-based criteria, respectively. As it is easy to see from the cumulants provided to the right of each histogram, the algorithms using probability-based and likelihood-based criteria are more preferable to the one using entropy-based criterion in terms of the length ratio. However, comparing these graphs with Figure 4.5, showing the histogram and cumulant 40 1 1.5 2 2.5 3 0 0.03 0.06 0.09 0.12 Averaged length ratio Fr eq ue nc y (a) 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1 Averaged length ratio Fr eq ue nc y (b) FIGURE 4.2. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the parameters τ (E) optimized for the entropy-based criterion (3.48) on a 1% dataset; (b) its cumulant. 1 1.5 2 2.5 3 0 0.05 0.1 0.15 0.2 Averaged length ratio Fr eq ue nc y (a) 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1 Averaged length ratio Fr eq ue nc y (b) FIGURE 4.3. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the parameters τ (P) optimized for the probability-based criterion (3.60) on a 1% dataset; (b) its cumulant. 41 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 0 0.05 0.1 0.15 Averaged length ratio Fr eq ue nc y (a) 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 0 0.2 0.4 0.6 0.8 1 Averaged length ratio Fr eq ue nc y (b) FIGURE 4.4. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the optimal parameters τ (L) (3.104) for the likelihood-based criterion (3.84) on a 1% dataset; (b) its cumulant. 1 1.5 2 2.5 3 3.5 4 0 0.05 0.1 0.15 Averaged length ratio Fr eq ue nc y (a) 1 1.5 2 2.5 3 3.5 4 0 0.2 0.4 0.6 0.8 1 Averaged length ratio Fr eq ue nc y (b) FIGURE 4.5. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the baseline parameters τ (H) estimated from θ (H) on a 1% dataset; (b) its cumulant. 42 TABLE 4.2. Summary statistics for the performance of different generation parameters on 1% datasets. Index Optimization based on Ham baselineEntropy Probability Likelihood Failure rate 0% 0% 0% 0% Averaged length ratio Minimum 1.21 1.15 1.16 1.32 5% quantile 1.29 1.26 1.26 1.52 Median 1.58 1.49 1.52 2.06 Mean 1.65 1.59 1.57 2.13 95% quantile 2.26 2.35 2.13 3.27 Maximum 2.89 3.14 2.61 4.27 for the parameters τ(H), all three techniques provide a noticeably better performance compared to the baseline of generating Ham-like appendix. A short summary of the statistics from these figures is given in Table 4.2. Figure 4.6 shows several examples of generated transformation texts for a few short spam messages from the dataset. Each of the original spam messages (typeset on white background) is followed by strings produced by the generation procedure according to optimal parameters (highlighted with gray background). Any of the presented appendices is sufficient to make the corresponding spam message look as a legitimate text, and cannot be shortened without changing the class of the transformed message back to spam. For the purpose of comparing the systems of probability distributions resulting from the aforementioned attack techniques, we measured the the difference between two probability distributions defined by parameters τ(1)i and τ (2) i for each context ci ∈ AK . To do so, we used the distance function d(τ(1)i ,τ (2) i ) = ∑ a j∈A |τ(1)i j −τ(2)i j |. (4.2) 43 Hi we are luke’s secret following we ,→ love luke fictitious! We are also your long lost friend! Hi This email has nothing to do with ,→ lukefictitious.com We wil be putting up our very own fan ,→ site soon and wanted to let you know in advance! Have a beautifull day! Joseph Regard E __________ Exm Hey, I just wanted to tell you about a ,→ GREAT website. ,→ http://www.metrojokes.com Features ,→ lots of jokes! Extremly unique ,→ features and classified in categories. ,→ I appriciate your time. Thank you your loved one out of your typical diam - No, the out their until your typi from you’re decent? I ass, but the sun doesn’t fat able to be and wonderful >> (suddenlysusan@Stoolmail.zzn.com) on ,→ Tuesday, July 30, 2002 at 17:07:56 : Why Spend upwards of $4000 on a DVD ,→ Burner when we will show you an ,→ alternative that will do the exact same ,→ thing for just a fraction of the cost? ,→ Copy your DVD’s NOW. This? I It spamassin-dev -- "If you." This a multi-part, surround you’re du This a must IM. Build searcharself with think? This a deady to be looking of some ,→ merge.net DON’T MISS OUT ON AN AMAZING BUSINESS ,→ OPPORTUNITY AND WEIGHT LOSS PRODUCT! PLEASE VISIT ,→ www.good4u.autodreamteam.com THERE IS NO OBLIGATION AND IT’S WORTH A LOOK! Remore > OK guys -- I r md: rules. > > with smart_0088 [evel Yet emailename="smime > OK guys -- I reck_f > > BSMTP-support people > OK guy FIGURE 4.6. Examples of original spam messages zl (white background) and several appendices xk corresponding to each zl that are generated using parameters τ (E) optimized on a 1% dataset (gray background). 44 TABLE 4.3. Summary statistics for the performance of the generation parameters optimal for the likelihood-based criterion on the full datasets. Index Likelihood optimization Ham baseline Failure rate 0% 0.16% Averaged length ratio Minimum 1.001 1.001 5% quantile 1.068 1.248 Median 1.184 1.704 Mean 1.232 1.976 95% quantile 1.594 4.101 Maximum 2.929 8.498 Figure 4.9 features distributions of distances d(τ(1)i ,τ (2) i ) for all pairs of parameters τ (1) i ,τ (2) i ∈ {τ(E)i ,τ(P)i ,τ(L)i ,τ(H)i } that were fitted on a 1% dataset using all three algorithms, as well as the baseline Ham parameters. The minimal distance of d(τ(1)i ,τ (2) i ) = 0 indicates that distributions τ (1) i and τ (2) i are identical. The maximal distance of d(τ(1)i ,τ (2) i ) = 1 corresponds to greatest difference between τ (1) i and τ (2) i . Since the likelihood-based algorithm (3.104) does not have as high computational requirements as the other two algorithms resorting to stochastic optimization, it was possible for us to run it on the full dataset. Resulting distributions of the length ratio for the parameters optimal for the likelihood-based criterion and the baseline Ham parameters are presented in Figures 4.7 and 4.8 in a similar fashion to the case of a 1% dataset shown above. Table 4.3 lists the same five quantiles of the length ratio as well as its mean values. Comparing these statistics with the ones in Table 4.2, we can conclude that the breach between the Ham-like generation and the likelihood-based algorithm is even greater (≈ 70% vs. ≈ 50%). 45 1 1.5 2 2.5 3 0 0.05 0.1 0.15 0.2 0.25 Averaged length ratio Fr eq ue nc y (a) 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1 Averaged length ratio Fr eq ue nc y (b) FIGURE 4.7. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the optimal parameters τ (L) (3.104) for the likelihood-based criterion (3.84) on the full dataset; (b) its cumulant. 2 4 6 8 10 0 0.1 0.2 0.3 Averaged length ratio Fr eq ue nc y (a) 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 Averaged length ratio Fr eq ue nc y (b) FIGURE 4.8. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices xk generated using the baseline parameters τ (H) estimated from θ (H) on the full dataset; (b) its cumulant. 46 0.25 0.5 0.75 1 0 0.1 0.2 0.3 Distance d(τ(E)i ,τ (H) i ) Fr eq ue nc y Entropy vs. Ham 0.25 0.5 0.75 1 0 0.1 0.2 0.3 Distance d(τ(P)i ,τ (H) i ) Fr eq ue nc y Probability vs. Ham 0.25 0.5 0.75 1 0 0.1 0.2 0.3 Distance d(τ(L)i ,τ (H) i ) Fr eq ue nc y Likelihood vs. Ham 0.25 0.5 0.75 1 0 0.1 0.2 0.3 Distance d(τ(E)i ,τ (P) i ) Fr eq ue nc y Entropy vs. Probability 0.25 0.5 0.75 1 0 0.1 0.2 0.3 Distance d(τ(E)i ,τ (L) i ) Fr eq ue nc y Entropy vs. Likelihood 0.25 0.5 0.75 1 0 0.1 0.2 0.3 Distance d(τ(P)i ,τ (L) i ) Fr eq ue nc y Probability vs. Likelihood FIGURE 4.9. Histograms of the distances between parameters obtained through different adversary algorithms as well as the baseline parameters. 47 CHAPTER V CONCLUSIONS AND FUTURE WORK We introduced three formalizations of possible adversarial objectives for classifiers using cross-entropy as the deciding criterion. Each of the three approaches has proved its efficiency as compared to the baseline approach of using the probability distributions estimated only on legitimate messages to define a transformation source. The third technique showed itself as the most efficient of three, both in terms of transformation and computational requirements. Although the first two techniques have same implementation difficulties, after appropriate calibration, they showed comparable performance. Together, all three methods have shown the feasibility of statistically attacking compression-based classifiers using relatively limited extent of transformation (for the SpamAssassin corpus, on average, approximately 20% of original message’s length is to be generated and appended). Future work includes three directions of possible extension of this research. To begin with, it is of interest to explore how methods of parametrized optimization, examples of which are considered in this work, compare with other algorithms allowing to optimize the contents of transformation texts directly on a per character basis. This problem setting can potentially increase the number of different criterion functions that can be considered to formalized the adversary problem. The methods that might be applied in that setting include Markov chain Monte Carlo methods, genetic algorithms, simulated annealing, and others. Another promising direction consists in analyzing the dynamics of classifier- adversary system for the particular case of entropy-based classification considered in this thesis. Although it is hard to attack this problem in general, it might be feasible to 48 derive some useful properties for the specific algorithms that we have discussed in our work. Last but not least, it is important to research ways of improving performance of compression-based classifiers that might be possible given the knowledge of potential adversary attacks. 49 REFERENCES CITED Apache SpamAssassin Project. (2005). The SpamAssassin public mail corpus. Available at https://spamassassin.apache.org/publiccorpus/. Biggio, B., Nelson, B., & Laskov, P. (2011). Support vector machines under adversarial label noise. In Journal of Machine Learning Research: Workshop and Conference Proceedings (Vol. 20, pp. 97–112). MIT. Biggio, B., Nelson, B., & Laskov, P. (2012). Poisoning attacks against support vector machines. In Proceedings of The 29th International Conference on Machine Learning. Omnipress. Bratko, A., Cormack, G. V., Filipicˇ, B., Lynam, T. R., & Zupan, B. (2006). Spam filtering using statistical data compression models. The Journal of Machine Learning Research, 7, 2673–2698. Bratko, A., Filipicˇ, B., & Zupan, B. (2006). Towards practical PPM spam filtering: Experiments for the TREC 2006 Spam Track. In Proceedings of the 15th Text REtrieval Conference (TREC 2006). Gaithersburg, MD. Cleary, J. G. & Teahan, W. J. (1997). Unbounded length contexts for PPM. The Computer Journal, 40(2 and 3), 67–75. Cleary, J. G. & Witten, I. H. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4), 396–402. Cormack, G. V. & Horspool, R. N. S. (1987). Data compression using dynamic Markov modelling. The Computer Journal, 30(6), 541–550. Frank, E., Chui, C., & Witten, I. H. (2000). Text categorization using compression models. In Proceedings of DCC-00, IEEE Data Compression Conference (pp. 200–209). Snowbird, US: IEEE Computer Society Press, Los Alamitos, US. Goodman, J., Heckerman, D., & Rounthwaite, R. (2005). Stopping spam. Scientific American, 292(4), 42–49. Lowd, D. & Meek, C. (2005a). Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 641–647). ACM. 50 Lowd, D. & Meek, C. (2005b). Good word attacks on statistical spam filters. In Proceedings of the Second Conference on Email and Anti-Spam. Palo Alto, CA. Moffat, A. (1990). Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11), 1917–1921. Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22, 400–407. Teahan, W. J. (1995). Probability estimation for PPM. In New Zealand Computer Science Research Students’ Conference, 1995. University of Waikato, Hamilton, New Zealand. 51