AUTOMATED ATTACKS ON COMPRESSION-BASED CLASSIFIERS
by
IGOR BURAGO
A THESIS
Presented to the Department of Computer and Information Science
and the Graduate School of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Master of Science
June 2014
THESIS APPROVAL PAGE
Student: Igor Burago
Title: Automated Attacks on Compression-Based Classifiers
This thesis has been accepted and approved in partial fulfillment of the requirements for
the Master of Science degree in the Department of Computer and Information Science
by:
Daniel Lowd Chair
Dejing Dou Core Member
Christopher Wilson Core Member
and
Kimberly Andrews Espy Vice President for Research & Innovation/Dean of
the Graduate School
Original approval signatures are on file with the University of Oregon Graduate School.
Degree awarded June 2014
ii
c© 2014 Igor Burago
iii
THESIS ABSTRACT
Igor Burago
Master of Science
Department of Computer and Information Science
June 2014
Title: Automated Attacks on Compression-Based Classifiers
Methods of compression-based text classification have proven their usefulness for
various applications. However, in some classification problems, such as spam filtering, a
classifier confronts one or many adversaries willing to induce errors in the classifier’s
judgment on certain kinds of input. In this thesis, we consider the problem of finding
thrifty strategies for character-based text modification that allow an adversary to revert
classifier’s verdict on a given family of input texts. We propose three statistical statements
of the problem that can be used by an attacker to obtain transformation models which
are optimal in some sense. Evaluating these three techniques on a realistic spam corpus,
we find that an adversary can transform a spam message (detectable as such by an
entropy-based text classifier) into a legitimate one by generating and appending, in
some cases, as few additional characters as 20% of the original length of the message.
iv
CURRICULUM VITAE
NAME OF AUTHOR: Igor Burago
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, Oregon
Far Eastern Federal University, Vladivostok, Russia
DEGREES AWARDED:
Master of Science, Computer & Information Science, 2014, University of Oregon
Bachelor of Science, Applied Mathematics and Informatics, 2011, Far Eastern
Federal University
AREAS OF SPECIAL INTEREST:
Artificial Intelligence
Machine Learning
Adversarial Machine Learning
PROFESSIONAL EXPERIENCE:
Part-time software developer and system administrator, Pacific Fisheries Research
Center (TINRO-Center), Vladivostok, Russia, 2007–2009
Volunteer jury member of the Far-Eastern regional stages of individual and team All-
Russian high school programming competitions, Far Eastern Federal University,
Vladivostok, Russia, 2006–2011
Volunteer software developer, Department of Computer Science, Institute of
Mathematics and Computer Science, Far Eastern Federal University, Vladivostok,
Russia, 2008
GRANTS, AWARDS AND HONORS:
Graduate Research Fellowship, Information Services, 2012–2014
Summa cum Laude, Far Eastern Federal University, 2011
v
PICES Ocean Monitoring Service Award, North Pacific Marine Science Organization
(along with Igor I. Shevchenko and Olga N. Vasik, TINRO-Center), 2009
Scholarship of the President of the Far Eastern Federal University, 2009
Scholarship of the Administration of Vladivostok, 2009
Scholarship of the Governor of Primorsky Krai, 2009
Scholarship of the President of the Russian Federation, 2009
Scholarship of the President of the Far Eastern Federal University, 2008
Scholarship of the Governor of Primorsky Krai, 2008
Award of the President of the Russian Federation for Support of Talented Youth,
2006
PUBLICATIONS:
Burago, I. V., Vasik, O. N., Moiseenko, G. S., & Shevchenko, I. I. (2013). Metadata
Exchange as a Preliminary Step in Creating a Common Information Space. (In
Russian). In Proceedings of the industry seminar “Mathematical Modeling and
Information Technology in the Research of Biological Resources of the World’s
Oceans,” September 25–27, 2013 (pp. 53–55). Vladivostok, Russia: TINRO-
Center.
Burago, I. V. & Shevchenko, I. I. (2010). Automatic Generation of Enumeration
Problems. In The First Russia and Pacific Conference on Computer Technology and
Applications (RPC 2010), September 6–9, 2010. Vladivostok, Russia: IACP FEB
RAS. ISBN: 978-0-9803267-3-4 (CD).
Burago, I. V. & Shevchenko, I. I. (2009). Automatic generation of problems using
the method of constraint propagation. (In Russian). In The XXXIV Academician
E. V. Zolotov Far Eastern Mathematical School-Seminar, “Fundamental Problems
of Mathematics and Information Sciences,” June 25–30, 2009 (pp. 162–163).
Khabarovsk, Russia: Pacific National University Press.
Burago, I. V., Vasik, O. N., Moiseenko, G. S., & Shevchenko I. I. (2007). An
infrastructure for the Metadata Exchange of Ecosystem Observations. (In
Russian). In Proceedings of the industry seminar “Mathematical Modeling and
Information Technology in the Research of Biological Resources of the World’s
Oceans,” October 1–3, 2007 (pp. 22–24). Vladivostok, Russia: TINRO-Center.
vi
ACKNOWLEDGEMENTS
I would like to thank my advisor, Daniel Lowd, for the amount of time he dedicated
to helpful disscussions on the subject of this research.
vii
TABLE OF CONTENTS
Chapter Page
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II. BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . . . . 4
III. METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1. Classifier Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2. Adversary Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3. Instrumental Sampling Approach . . . . . . . . . . . . . . . . . . . . 15
3.4. Importance Sampling Approach . . . . . . . . . . . . . . . . . . . . . 19
3.5. Likelihood-Based Criterion . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6. Generalization for Multiple Base Messages . . . . . . . . . . . . . . 35
IV. EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
V. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 48
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
viii
LIST OF FIGURES
Figure Page
4.1. Histogram of lengths of all messages in the full dataset. . . . . . . . . . . . 39
4.2. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the parameters τ
(E) optimized for the entropy-based
criterion (3.48) on a 1% dataset; (b) its cumulant. . . . . . . . . . . . . . 41
4.3. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the parameters τ
(P) optimized for the probability-
based criterion (3.60) on a 1% dataset; (b) its cumulant. . . . . . . . . . 41
4.4. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the optimal parameters τ
(L) (3.104) for the
likelihood-based criterion (3.84) on a 1% dataset; (b) its cumulant. . . . 42
4.5. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the baseline parameters τ
(H) estimated from θ (H) on
a 1% dataset; (b) its cumulant. . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6. Examples of original spam messages zl (white background) and several
appendices xk corresponding to each zl that are generated using
parameters τ(E) optimized on a 1% dataset (gray background). . . . . . 44
4.7. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the optimal parameters τ
(L) (3.104) for the
likelihood-based criterion (3.84) on the full dataset; (b) its cumulant. . 46
4.8. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the baseline parameters τ
(H) estimated from θ (H) on
the full dataset; (b) its cumulant. . . . . . . . . . . . . . . . . . . . . . . . 46
4.9. Histograms of the distances between parameters obtained through different
adversary algorithms as well as the baseline parameters. . . . . . . . . . 47
ix
LIST OF TABLES
Table Page
4.1. Performance of the classifier on the full dataset. . . . . . . . . . . . . . . . . 39
4.2. Summary statistics for the performance of different generation parameters
on 1% datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3. Summary statistics for the performance of the generation parameters
optimal for the likelihood-based criterion on the full datasets. . . . . . 45
x
CHAPTER I
INTRODUCTION
Automatic text classification is a widespread problem occurring in different
applications. An important example of such application originates from the necessity of
filtering unsolicited or undesired messages in email or other messaging systems. One of
the effective methods for filtering messages is based on the observation that different
kinds of texts often have distinctive compression characteristics.
In this approach, messages are seen as finite chunks of characters originating from
streams that have probabilistic nature. More strictly, each kind of text, or class, is
assumed to be independently defined by a probability distribution that describes the
chances of a character to appear at each of the states of the stream producing texts of this
class. Then, if that class distributions are known or can be reconstructed with sufficient
accuracy from an available sample, any text is classified by comparing its compression
characteristics for each of the class distributions. One metric directly connected with
compressibility is the entropy which is usually calculated per character to make strings
of unequal lengths comparable. If, for the text in question, the entropy per character,
under the assumption that this text originated from one source, estimates to be lower
than the same entropy for another source, then it is assumed that this text is more likely
to belong to the former class rather than the latter.
This way, the objective of a compression-based classifier resolves into statistical
problem of learning unknown class models by observing messages coming on the input
of classifier. In the thesis, we make an attempt to look at the classifier problem from
the adversarial point of view. In abstract, the goal of an adversary consists in tricking
classifier into making decisions advantageous for a certain adversarial objective, e.g.
1
finding a way to modify a spam message so that it would not be classified as such, and
would be label as legible. Unsurprisingly, the adversarial problem is, in some sense,
inverse to the classifier problem: While the goal of the classifier is to model class sources
by analyzing the input stream of texts, the goal of an adversary is to affect the input
stream of the classifier to skew the resulting class models. That is, what was the output
for the classifier becomes input for the adversary, and vice versa.
We approach the adversary problem as a statistical one. That is, for the example
of message filtering, we assume that the adversary does not have a goal of getting a
beneficial verdict of the classifier once for a single message, but rather wants to find a
whole family of messages that would be classified in a certain way (while, optionally,
satisfying some additional conditions aligned with adversary’s goals). In other words, the
adversary is interested in methodically changing statistical properties of the classifier’s
input stream.
To validate the methods discussed in this work, we concentrate on a typical
application of entropy-based text classification—the problem of spam filtering. It allows
us to evaluate our algorithms on a sufficient amount of real data, while remaining in
the smallest case of binary decisions. Considering potential interests of an adversary in
this problem, we introduce three different adversary problem settings that meaningfully
formalize objectives of a spammer. In doing so, we tried to keep the balance between
generality of a mathematically stated objective and feasibility of its analysis. We evaluate
effectiveness of each strategy on the SpamAssassin public corpus of legitimate and
spam email messages (Apache SpamAssassin Project, 2005) that is used widely for this
purpose.
The remainder of the thesis is structured as follows. Chapter II provides background
on the problem of entropy-based classification and related work on some of the
2
corresponding adversarial problems for this type of classifiers. Chapter III introduces
the formal definition of the adversary problem and describes our approach to solving it.
Chapter IV describes the experimental results obtained from evaluation of our method
on a public email corpus. Finally, Chapter V concludes.
3
CHAPTER II
BACKGROUND AND RELATED WORK
The starting point of our research is the problem of compression-based text
classification. Fundamentally, it rests on the assumption that when a pair of texts
compress well together and, consequently, share some structural homogeneity, it is more
likely that they belong to the same category. This assumption forms the basis for the
whole class of methods using measures of compressibility from various compression
models as measures of text similarity. This approach has been studied in a number
of works targeting particular models, specific algorithms for building them, and their
efficiency for estimating similarity of different types of texts (Cormack & Horspool, 1987;
Frank, Chui, & Witten, 2000; Goodman, Heckerman, & Rounthwaite, 2005; Bratko,
Cormack, Filipicˇ, Lynam, & Zupan, 2006; Bratko, Filipicˇ, & Zupan, 2006).
The focus of our research is specifically on the entropy-based classifiers that define
similarity measure to be the cross entropy. At the same time, our approach is mostly
agnostic to particular choice of algorithm to be used for learning probability models
of classes. For the purposes of evaluation, to estimate class probability distributions,
we use the algorithm of prediction by partial matching (PPM) (Cleary & Witten, 1984;
Moffat, 1990; Cleary & Teahan, 1997; Teahan, 1995), which has been shown to suit
well the special case of text classification—spam filtering (Bratko, Cormack, et al., 2006;
Bratko, Filipicˇ, & Zupan, 2006).
While a significant amount of effort has been applied to study the effectiveness
of compression models in selected applications of text classification, there is still
no complete understanding of how robust such algorithms are to different kinds of
adversarial noise.
4
In (Lowd & Meek, 2005b), it has been shown that a word-based attack is effective
against a maximum entropy spam filter (and a naive Bayes one, as well). The authors
propose the attack of inserting and appending freestanding words to spam messages that
are detectable as such by the classifier. To guide the selection of words, they consider
two sets of heuristics depending on whether the adversary is able to determine that a
modified message has been filtered out (active attack) or not (passive attack). In the
former case, the knowledge of classifier’s decisions is used as a supervisory signal.
In principle, random words that, being included in a barely-spam message, make it
legitimate, are promoted for future additions; the ones that make a barely-legitimate
message spam, are impeded. In the case of passive attack, the authors use frequency-
based heuristics requiring the availability of training samples. According to this strategy,
the words with higher frequency of occurrence in legitimate messages, or, alternatively,
with higher ratio of frequencies in legitimate messages over spam messages, are assigned
with higher probability of being used for an attack. Although these strategies prove
the feasibility of attacking maximum entropy classifiers, they are too simplistic. In
both active and passive cases, the information about the message being modified is
unnecessary for approaching the optimal transformation procedure; it is global due to
the nature of the considered classifiers.
The work (Lowd & Meek, 2005a) expands the subject to devising an attack on
a binary linear classifiers in the case of incomplete or lacking information about the
parameters of probabilistic models of those classifiers. This time, however, the focus is
not on the ways of constructing an attack, but rather on methods of retrieving sufficient
knowledge about the classifier to make a construction possible. The authors consider the
cases of Boolean and continuous features for a few adversarial cost functions estimating
the cost, or penalty, for each instance sent by an adversary to the input of a classifier.
5
For both types of features, they propose algorithms that for a linear cost function is able
to discover classifier’s weights in polynomial number of queries to the classifier.
In (Biggio, Nelson, & Laskov, 2011, 2012), the question of robustness of another
linear classifier, support vector machines, is researched. The authors propose a poisoning
algorithm iteratively improving the attack point in attempt to optimize the classification
error which, as they show in evaluation, can be made as large as about one third. The
problem considered in these works, however, is less relevant to the one we are working
with in this thesis. The key distinction consists in that the authors concentrate on a
different kind of adversarial position, where an attacker is supposed to be able to inject
prepared inputs into the classifier’s training data, thus poisoning them. In contrast, we
focus on the adversary’s task of disguising their activity in the input of the classifier
given a fixed training data which is assumed to be unchanged during the time of the
interaction. Additionally, the obtained results significantly depend on properties of
support vector machines, and cannot be directly applied to maximum entropy classifiers.
6
CHAPTER III
METHOD
3.1. Classifier Problem
3.1.1. Preliminaries
Let X ⊆ A∗ be a space of arbitrary text strings over some finite alphabet A. On this
space, we consider sources or classes of strings that are defined by probability distributions
over the set X . In particular, from now on, whenever we discuss a classification problem,
we assume that there exists a single input source of strings from X that come on the
input of the classifier.
The input source is described by the probability g(x)≡ P(ξ = x) assigned to values
x ∈ X of the discrete random variable ξ standing for the input strings. The classifier
reconstructs the probability distributions f (κ)(x) corresponding to one or more classes
κ. Formally, we define probability f (κ)(x) ≡ P(ξ = x | C (κ)) for C (κ) being the event
{ξ belongs to class κ}. In this work we concentrate on the case of two classes of strings:
legitimate Ham messages and unsolicited Spam messages that are designated with κ = H
and κ= S, respectively.
3.1.2. Finite Memory Markov Model
The probability of a string x ∈ X originating from the class κ is equal to
f (κ)(x) =
|x |∏
l=1
P(x l | x l−11 ,κ), (3.1)
7
where x l denotes the l-th character of the string x , and x
l
k the substring of x starting
from the k-th and ending with the l-th character (if k > l, x lk is empty). For the sake
of brevity, P(x l | x l−11 ,κ) stands for the probability P(ξl = x l | ξl−11 = x l−11 , C (κ)) of
character x l following the context x
l−1
1 .
Naturally, we can parametrize distributions f (κ)(x) using these probabilities:
f (κ)(x) = f (x ,θ (κ)) =
|x |∏
l=1
θ
(κ)
i(x l−11 ), j(x l )
, (3.2)
where i(x l−11 ) and j(x l) denote the ordinal numbers of the context x
l−1
1 = ci ∈ A∗ and
the character x l = a j ∈ A for some orderings on the sets A and A∗, and parameters θ (κ)i j
are the probabilities P(ξl = a j | ξl−11 = ci, C (κ)).
From this point on, we will also assume that each class κ can be modelled as a
stationary and ergodic Markov chain which memory is bounded by certain order K ≥ 1.
Under the assumption that limited memory K is sufficient for evaluating probability (3.2),
we can rewrite it for our convenience as
f (x ,θ ) =
|x |∏
l=1
θi(x l−1l−K ), j(x l ) =
∏
ci∈AK
a j∈A
θ
ni j(x)
i j (3.3)
for ni j(x) being the number of times character a j follows context ci in string x (or,
alternatively, substring cia j occurs in x), where
∑
a j∈A
θi j = 1, for all ci ∈ AK , (3.4)∑
ci∈AK
a j∈A
ni j(x) = |x |. (3.5)
8
This way, any string x is viewed as a set of overlapping (K + 1)-grams with frequencies
ni j(x), and parameters θi j characterize a class of strings as a whole.
Reasoning completely analogously for the probability g(x), we obtain the same
parametrized form:
g(x ,τ) =
∏
ci∈AK
a j∈A
τ
ni j(x)
i j . (3.6)
To avoid confusion, we use the letter τ to denote the vector of parameters of the input
source as distinguished from vectors of class parameters θ (κ).
3.1.3. Problem Statement
The above parametrization following from finite memory Markov models allows us
to view the mathematical problem of inferring a class model as an optimization problem
in the space of parameters:
R(θ ) = Eξ

r(ξ,θ )
→min
θ
(3.7)
for some measure function r(ξ,θ) evaluating the “loss” or “penalty” of classifying
message ξ as belonging to the class described by the probability distribution with
parameters θ . In other words, the objective of the problem (3.7) for each class κ is
to find parameters θ (κ) giving the least losses on average according to r(ξ,θ (κ)). The
expectation is taken over the probability distribution g(x) of strings ξ from input source.
Generally, probability g(x) is supposed to be unknown for all classes. For this reason, a
version of the problem (3.7) for empirical averaging is considered:
bR(θ ) = ∑
xk∈T
r(xk,θ )→min
θ
, (3.8)
9
where T stands for a training sample of messages corresponding to the class in question.
Hereinafter, for consistency, training samples of Ham and Spam classes are labeled as
T (H) and T (S) accordingly.
When the inference problem is solved and the vectors of parameters θ (H) and θ (S)
are estimated for each class, they can be used to make classifying decision based on the
same principle of least loss:
q(x ,θ ) = r(x ,θ (H))− r(x ,θ (S)), (3.9)
κ(x) =

H, if q(x ,θ )< α(H);
S, if q(x ,θ )≥ α(S).
(3.10)
In case of α(H) ≤ q(x ,θ) < α(S), additional measures are needed to decide the class
(for example, increasing the length of the message in question). Most commonly, both
parameters are set the same value, α(H) = α(S) = α. The choice of parameter α is guided
by its influence on the number of type I and type II errors.
3.1.4. Entropy Classification
Let us consider the measure function r(ξ,θ) = − 1|ξ| log f (ξ,θ). As it is obvious
from the above definitions, the general criterial function (3.7) specializes to the cross
entropy
R(θ ) = H(θ )≡ −Eξ

1
|ξ| log f (ξ,θ )

= −∑
x∈X
1
|x | g(x) log f (x ,θ )→minθ . (3.11)
We will refer to this specialization of the problem (3.7) as the classifier problem.
10
Similarly, the empiric version (3.8) becomes
bR(θ ) = ÒH(θ )≡ −∑
xk∈T
1
|xk| log f (xk,θ )→minθ , (3.12)
where T , of course, is assumed to be a sample of strings distributed according to g(x).
Decision rule (3.10) can be rewritten as follows.
q(x ,θ ) =
1
|x | log f
(S)(x)− 1|x | log f
(H)(x) =
1
|x | log
f (S)(x)
f (H)(x)
, (3.13)
κ(x) =

H, if q(x ,θ )< α;
S, if q(x ,θ )≥ α.
(3.14)
In practice, parameter α is often set to zero.
It is well known that if the function g(x) is given and f (x ,θ) > 0 for all x such
that g(x)> 0, then
f (x ,θ )∝ g(x)|x | (3.15)
is an exact solution of the problem (3.11). Because, as we have seen above, both f (x)
and g(x) can be parametrized identically, at least in the case when all texts x have the
same length (or the variation in lengths can be neglected), f (x ,θ ) can be constructed
from g(x ,τ) by letting θ = τ. The parameters τ, in turn, can be directly found by
estimating conditional probabilities P(x l | x l−1l−K) on some training sample T .
This observation forms the basis of the technique called Prediction by Partial
Matching (PPM). Aside from differences in strategies of approximating probabilities for
character-context pairs that do not occur in a given sample, PPM algorithms work as
11
simple frequency estimators setting
θi j ≈ Ni jNi , (3.16)
for
Ni j =
∑
x∈T
ni j(x), (3.17)
Ni =
∑
a j∈A
Ni j. (3.18)
For versions of PPM estimators and the details of their implementation, see (Cleary &
Witten, 1984; Teahan, 1995; Cleary & Teahan, 1997; Moffat, 1990).
3.2. Adversary Problem
As we just seen above, in the classifier problem (3.11) the goal was to find an
optimal statistical model f (x ,θ ) for messages of some class, given a fixed input source
defined by probabilities g(x) which manifest itself in a sample T . To put it more strictly,
the function g(x) was fixed (although unknown), while the probability distribution
f (x ,θ ) was known up to the vector θ which were the parameters in question.
It is also of interest to consider the inverse problem statement where given fixed
statistical model f (x ,θ ) of some class, it is required to find the source distribution g(x)
which is the most favourable for certain classification outcome. In this setting, g(x)
becomes the function in question, while f (x ,θ) is fixed through a known vector of
parameters θ .
One example of such inverse objective is the problem of determining g(x) generating
messages that are as close to Ham messages as possible in terms of probability of passing
the spam filter. Another version of the problem that also falls into this category is
the following adversary problem (or, in case of spam filtering, the spammer problem).
12
For a given string z from some set of base messages Z , find probability distribution for
generating strings x t such that the result of some transformation ψ(z, x t) combining
them complies with some statistical requirement, e.g. being classified as Ham on average.
This setting is especially practical for a spammer when z by itself has low chances of
passing the filter.
To state the spammer problem more formally, we assume the following. There is
a generator algorithm which plays the role of a source of strings x t(τ) for a specified
vector of parameters τ. Strings x t(τ) are considered to be generated randomly and
independently, and have the same distribution in the space of strings X . The generated
strings x t(τ) are then used to obtain new messages ut = ψ(z, x t(τ)) from a given
message z according to the predetermined transformation ψ. In general, the function
ψ(z, x) can associate a pair of strings with any string. One such transformation that
is simple but still keeps the problem non-trivial is string concatenation, ψ(z, x) = zx .
Even though our method does not sufficiently depend on a particular transformation,
for illustration purposes, when necessary, we will not be using other definitions of ψ
except for the concatenation.
The objective of the inverse problem itself remains the same:
G(τ)≡ −∑
x∈X
1
|x | g(x ,τ) log f (x)→minτ , (3.19)
with the exception of that optimization is done for the parameters τ of the source
distribution g(x ,τ), not the class distribution f (x ,θ ) = f (x). The decision to search in
the parametrized space of distributions g(x ,τ) is justified by the necessity to obtain a
generative (rather than discriminative) model of the desired message source.
As in the case of the classifier problem (3.11), it is well known that, in non-
parametrical form, the inverse problem (3.19) also has an analytical solution. Any
13
function g(x) such that
∑
x∈X fmax
g(x) = 1, (3.20)
g(x) = 0, for all x ∈ X \ X fmax , (3.21)
where
X fmax = argmax
x
f (x)
|x | , (3.22)
minimizes the cross entropy for a given f (x). These solutions for the non-parametric
problem, however, does not solve the spammer problem. None of the functions
g(x) satisfying the above properties is guaranteed to be represented in the space
of parametrized functions g(x ,τ) which makes them useless for generating x t(τ).
Moreover, even if this difficulty did not exist, the diversity of the generated messages
would be extremely low, because any of such g(x) leads to generating the same few
messages from X fmax over and over again which makes spammer easily detectable.
Empirical analog of the criterion (3.19) is
bG(τ)≡∑
xk∈T
1
|x | g(xk,τ)→minτ , (3.23)
where the sample T is obtained from a distribution P(ξ = x)∝ log 1f (x) . Therefore,
in order to approach the inference problem in the form (3.23), it is necessary to have
an auxiliary instrumental sample which, unlike training samples for the classes or the
combined sample for the input source, cannot be observed in practice.
14
3.3. Instrumental Sampling Approach
Let us introduce new parameters wi j such that
τi j =
exp(wi j)∑
a j∈A exp(wi j)
, (3.24)
where, as before, subscripts i and j correspond to some context ci ∈ AK and character a j ∈
A, respectively. For any values of wi j, the required conditions on τi j hold automatically:
0< τi j < 1 and
∑
a j∈A
τi j = 1 (3.25)
(0≤ τi j ≤ 1, if wi j = ±∞ are allowed).
For the new parameters, the probability
g(x ,τ) =
∏
ci∈AK
a j∈A
τ
ni j(x)
i j (3.26)
changes to
g(x ,τ(w)) =
∏
ci∈AK
a j∈A

exp(wi j)∑
a j′∈A exp(wi j′)
ni j(x)
=
∏
ci∈AK
1
Zni(x)i
exp
∑
a j∈A
wi jni j(x)

=
∏
ci∈AK

1
Zi
exp
∑
a j∈A
wi j
ni j(x)
ni(x)
ni(x)
, (3.27)
15
where ni j(x) is, as usual, the number of occurrences of a substring cia j in x , and
ni(x) =
∑
a j∈A
ni j(x), (3.28)
Zi(w) =
∑
a j∈A
exp(wi j). (3.29)
Now, let us calculate the gradient of the function g(x ,τ(w)) using the equality
∂ g(x ,τ(w))
∂ wlk
= g(x ,τ(w))
∂
∂ wlk
log g(x ,τ(w)), (3.30)
where
log g(x ,τ(w)) =
∑
ci∈AK
a j∈A
ni j(x) log

exp(wi j)
Zi(w)

=
∑
ci∈AK
a j∈A
wi jni j(x)−
∑
ci∈AK
a j∈A
ni j(x) log Zi(w). (3.31)
Then,
∂ g(x ,τ(w))
∂ wlk
= g(x ,τ(w))
∂
∂ wlk
∑
ci∈AK
a j∈A
wi jni j(x)−
∑
ci∈AK
a j∈A
ni j(x) log Zi(w)

= g(x ,τ(w))

nlk(x)− nl(x)Zl exp(wlk)

= g(x ,τ(w))nl(x)
 bτlk(x)−τlk(w), (3.32)
where
bτlk(x) = nlk(x)nl(x) . (3.33)
16
Now, consider a problem of the form
Eξ

F(ξ, w)
≈∑
xk∈T
F(xk, w)→minw , (3.34)
where the random variable ξ(w) is distributed and strings xk from an instrumental
sample T are generated according to the probabilities g(x ,τ(w)). The problem (3.23)
that has motivated us to consider this approach is a special case for
F(ξ, w) = F(x) =
1
|x | log
1
f (x)
. (3.35)
Given that
∂
∂ wlk
Eξ

F(ξ, w)

=
∂
∂ wlk
∑
x∈X
F(x , w) g(x ,τ(w))

=
∑
x∈X
∂
∂ wlk

F(x , w) g(x ,τ(w))

=
∑
x∈X

∂
∂ wlk
F(x , w) g(x ,τ(w)) + F(x , w)
∂
∂ wlk
g(x ,τ(w))

=
∑
x∈X

∂
∂ wlk
F(x , w) + F(x , w)nl(x)
 bτlk(x)−τlk(w) g(x ,τ(w))
= Eξ

F(ξ, w)

∂
∂ wlk

log F(ξ, w)

+ nl(ξ)
 bτlk(ξ)−τlk(w)
= Eξ

∂
∂ wlk
F(ξ, w) + F(ξ, w)nl(ξ)
 bτlk(ξ)−τlk(w), (3.36)
from the necessary condition of extremum, we see that optimal w satisfies the equation
Eξ

∂
∂ wlk
F(ξ, w) + F(ξ, w)nl(ξ)
 bτlk(ξ)−τlk(w)= 0, (3.37)
17
which, in the case when F(x , w) = F(x) is independent of parameters, simplifies to
Eξ

F(ξ)nl(ξ)
 bτlk(ξ)−τlk(w)= 0, (3.38)
for all cl ∈ AK , ak ∈ A. (Both nl(x) and nlk(x) are random variables and, consequently,
cannot be factored out of the expectation.)
Since ξ ∼ g(x ,τ(w)), as the size of instrumental sample T grows, frequencies
bτlk(x) converge to the current estimations τlk(w) that were used to generate the sample
in the first place. For this reason, any attempt of iterative optimization of (3.34) turns
into a random walk around initial values of wi j.
Moreover, for many practical generation procedures it is true that
Eξ
bτlk(ξ)= τlk(w). (3.39)
In a simplified case of both F(x) and nl(x) being independent of parameters w, which
takes place when, for example, generation procedure stops after reaching the same
length of x chosen beforehand, the equation (3.38) simply degenerates, and the problem
becomes meaningless.
If the function F preserves some dependence on parameters—either in the general
form F(x , w), or in a weaker variant F(x(w))—the problem (3.38) is not strictly
meaningless. However, for sufficiently long samples, as the difference |bτlk(ξ)−τlk(w)|
approaches zero, the influence of the F(x , w)-multiplier becomes effectively eliminated
making the expectation (3.38) almost independent of F . For this reason, we consider
approaching problem (3.19) as (3.34) unpromising.
18
3.4. Importance Sampling Approach
Formally, we consider a vector of parameters τ to be a solution to the inverse
problem, if
FD[q(u,θ )]≡F [q(u,θ ) | u =ψ(z, x), x ∈ D]→max
τ
, (3.40)
where D is a set of text strings, the domain, and F (·) is an ensemble operation defined
on D. For example, the domain D might be the set of all strings of some bounded length,
or some subset of that set. An empirical sample of strings produced by the generator
used by the adversary can also be taken as the domain Dτ = {x t(τ)}t .
The choice of ensemble operation depends on what criterion of success aligns best
with the goals of the spammer in a particular problem setting. Let us consider some of
them.
(a) For all x ∈ Dτ, messages u =ψ(z, x) are successfully pass the spam filter:
FDτ

q(u,θ )

= min
x∈Dτ
1(H)(u)→max
τ
, (3.41)
where
1(H)(u)≡ 1q(u,θ )< α= 1log f (S)(u)− log f (H)(u)< α|u|. (3.42)
(b) As many messages x ∈ Dτ,l = {x ∈ Dτ | |x | ≤ l} of a bounded length l are
successfully pass the spam filter:
FDτ,l

q(u,θ )

=
∑
x∈Dτ,l
1(H)(u)→max
τ
. (3.43)
19
(c) Empirical frequency of passing the spam filter successfully estimated over a sample
Dτ is maximal:
FDτ

q(u,θ )

=
1
|Dτ|
∑
x∈Dτ
1(H)(u)→max
τ
. (3.44)
(d) The average logarithmic ratio of probabilities q(u,θ ) estimated over a sample Dτ
is as minimal as possible:
FDτ

q(u,θ )

= −∑
x∈Dτ
q(u,θ )→max
τ
, (3.45)
or ∑
x∈Dτ
q(u,θ )→min
τ
. (3.46)
Criterion (a) is too optimistic and requires the acceptance of the implicit assumption
that there exists a vector τ which guarantees that all messages pass the filter. Solution
of the problem in the sense of this criterion, generally, is unlikely to exist (unless the
generator algorithm is complemented with constraints significantly restricting diversity
of generated strings).
Criterion (b) does not take probabilities of strings x into account, while it would
be worthwhile to ignore strings with zero or close to zero probabilities.
We consider criteria (c) and (d) to be more appropriate. Let us discuss the latter
objective before the former.
3.4.1. Entropy-Based Criterion
Empirical criterion (3.45) is equivalent to the optimization problem
Fq(u,θ )=∑
x∈X
q(u,θ ) g(x | z,τ) = Eξ

q(u,θ )
→min
τ
, (3.47)
20
where the expected value is taken over the probability distribution g(x | z,τ) of text x
being generated for the base string z and parameters τ.
Let us now rearrange the sum in (3.47) using the well known technique of
importance sampling:
R(τ | z) = Eξ

q(u,θ )

=
∑
x∈X
q(u,θ ) g(x | z,τ)

γ(H)
p(H) f (H)(x)
p(H) f (H)(x)
+ γ(S)
p(S) f (S)(x)
p(S) f (S)(x)

=
∑
κ∈{H,S}
p(κ) E(κ)
ξ

γ(κ) q(ξ | z,θ ) g(ξ | z,τ)
p(κ) f (κ)(ξ)

= Eξ,κ

q(ξ | z,θ ) g(ξ | z,τ)
p(κ) f (κ)(ξ)

= Eξ,κ

W (κ)(ξ | z,θ ) g(ξ | z,τ)→min
τ
, (3.48)
for
W (κ)(x | z,θ ) = γ(κ) q(x | z,θ )
p(κ) f (κ)(x)
, (3.49)
where for each class κ ∈ {H,S}, expected value E(κ)
ξ
[ · ] denotes conditional expectation
Eξ[ · | ξ ∼ f (κ)(x)], p(κ) stands for the a priori probability of the class κ, and γ(H), γ(S)
are arbitrary splitting weights such that γ(H) + γ(S) = 1 (for example, γ(H) = γ(S) = 12 or
γ(H) = p(H), γ(S) = p(S)).
In this problem setting, all statistical information that can be available to the
adversary—that is, both samples T (H) and T (S) of Ham and Spam messages—are used:
R(τ | z)≈ bR(τ | z) =∑
(xk ,κk)∈T
W (κk)(xk | z, θ ) g(xk | z,τ)→min
τ
. (3.50)
21
Here κk are true labellings of messages xk from the sample T which is the union of
samples T (H) and T (S) that are assumed to be drawn from distributions f (H)(x) and
f (S)(x), respectively.
Due to the necessary condition of extremum, we have the equation:
∂
∂ τi j
R(τ | z) = ∂
∂ τi j
Eξ,κ

W (κ)(ξ | z,θ ) g(ξ | z,τ)
= Eξ,κ

W (κ)(ξ | z,θ ) ∂
∂ τi j
g(ξ | z,τ)= 0. (3.51)
Since it has the form E[ · ] = 0, it is natural for us to apply the method of stochastic
optimization (Robbins & Monro, 1951). Switching to the parameters wi j that introduced
in (3.24), we obtain the stochastic algorithm
w(t+1)i j = w
(t)
i j − γtW (κk(t))(z, xk(t)) g(xk(t) | z,τ(w(t)))
· ni(xk(t) | z)
 bτi j(xk(t) | z)−τi j(w(t)), (3.52)
where γt is a series satisfying the properties
γt ≥ 0, for all t ≥ 0, (3.53)
∞∑
t=0
γt =∞, (3.54)
∞∑
t=0
γ2t <∞, (3.55)
and xk(t) and κk(t) run through the sample T (potentially repeatedly) in some order
defined by k(t).
22
3.4.2. Probability-Based Criterion
Obviously, objective function (3.44) is the empirical version of the criterion
Fq(u,θ )=∑
x∈X
1(H)(u) g(x | z,τ) = Eξ

1(H)(u)
→max
τ
, (3.56)
where ξ ∼ g(x | z,τ) and, as in the previous section, g(x | z,τ) is generational
probability distribution for base string z and parameters τ. This criterion, in turn,
makes the problem be equivalent to maximizing the probability of the transformed
message ψ(z,ξ) passing the spam filter:
R(τ) = Pr

1(H)(ψ(z,ξ)) | z,τ→max
τ
. (3.57)
Since we have only two classes, maximization of the criterion (3.56),
R(H)(τ) =
∑
x∈X
1(H)(ψ(z, x)) g(x | z,τ), (3.58)
is equivalent to minimization of the dual criterion
R(S)(τ) =
∑
x∈X
1(S)(ψ(z, x)) g(x | z,τ), (3.59)
where 1(S)(u) = 1− 1(H)(u). Let us combine both of them into a single problem
R(τ)≡ γ(H)R(H)(τ)− γ(S)R(S)(τ) (3.60)
=
∑
x∈X

γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))g(x | z,τ)→max
τ
, (3.61)
where γ(H) and γ(S) are some splitting weights such that γ(H) + γ(S) = 1.
23
3.4.2.1. Supervised Learning
Formally rearranging the criterion function (3.60) into two sums and applying the
importance sampling for the distribution of the pair (ξ,κ), we see that
R(τ) =
∑
x∈X

γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))g(x | z,τ)
=
∑
κ∈{H,S}
p(κ)
∑
x∈X
W (κ)(z, x) f (κ)(x) g(x | z,τ)
= Eξ

W (κ)(z,ξ) g(ξ | z,τ)→max
τ
, (3.62)
for the random variable ξ distributed according to f (κ)(x) and
W (κ)(z, x) =
γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))
f (κ)(x)
=
1(H)(ψ(z, x))− γ(S)
f (κ)(x)
. (3.63)
Assuming that a sample of messages xk ∈ T is available together with true labeling of
classes κk = κ(xk), the criterion R(τ) can be estimated as
R(τ)≈ bR(τ)≡∑
(xk ,κk)∈T
W (κk)(z, xk) g(xk | z,τ). (3.64)
For the parameters wi j that have been introduced earlier this chapter through the
equality
τi j =
exp(wi j)∑
a j∈A exp(wi j)
, (3.65)
24
we obtain that
∂ R(w)
∂ wi j
= Eξ

W (κ)(z,ξ) g(ξ | z,τ(w))

ni j(ξ | z)− ni(ξ | z)exp(wi j)∑
al∈A exp(wil)

= Eξ

W (κ)(z,ξ) g(ξ | z,τ(w))ni(ξ | z)
 bτi j(ξ | z)−τi j(w), (3.66)
where, as in previous sections,
bτlk(x) = nlk(x)nl(x) , (3.67)
and ni(x | z) and ni j(x | z) stand for the number of occurrences of the context ci ∈ AK
followed by any character and followed by the character a j ∈ A, respectively, in the text
x appended to the message z.
Thus, the stochastic approximation algorithm for the necessary condition of the
extremum,
∂ R(w)
∂ wi j
= 0, (3.68)
takes the form
w(t+1)i j = w
(t)
i j + γtW
(κk(t))(z, xk(t)) g(xk(t) | z,τ(w(t)))
· ni(xk(t) | z)
 bτi j(xk(t) | z)−τi j(w(t)). (3.69)
3.4.2.2. Unsupervised Learning
In case when true labeling κk of messages xk from sample T is unknown, we can
alternatively do the importance sampling for the distribution
f (x) = p(H) f (H)(x) + p(S) f (S)(x). (3.70)
25
Then
R(τ) =
∑
x∈X
γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))
f (x)
f (x) g(x | z,τ)
= Eξ

W (z,ξ) g(ξ | z,τ)→max
τ
, (3.71)
where the random variable ξ is distributed in accordance with f (x), and
W (z, x) =
γ(H)1(H)(ψ(z, x))− γ(S)1(S)(ψ(z, x))
f (x)
=
1(H)(ψ(z, x))− γ(S)
p(H) f (H)(x) + p(S) f (S)(x)
. (3.72)
Since the criteria (3.71) and (3.62) differ only in definition of the weights W (z, x)
which are independent of wi j, the resulting stochastic optimization algorithm is exactly
the same as in (3.69) (again, up to differences between W (z, x) and W (κ)(z, x)).
3.5. Likelihood-Based Criterion
Let us again consider the transformation u =ψ(z, x) of a message z with an arbitrary
string x . Entropy per character of the resulting string u can be estimated empirically as
H(u | τ) = − 1|u| log
 |u|∏
l=1
g(ul | ul−1l−K ,τ)

= − 1|u|
∑
ci∈AK
a j∈A
ni j(u) logτi j. (3.73)
Averaged over random transformed messages u, it is equal to
H(τ) = Eu

H(u | τ)= −∑
ci∈AK
pi
∑
a j∈A
p j|i logτi j = −
∑
ci∈AK
pi Hi(τ), (3.74)
26
where
Hi(τ) =
∑
a j∈A
p j|i logτi j, (3.75)
pi is the probability of the context ci occurring in a random transformed message u, and
p j|i is the conditional probability of character a j occurring in u after the context ci. The
function Hi(τ), in turn, can be estimated on a single message u as
Hi(τ)≈ ÒHi(u | τ) =∑
a j∈A
ni j(u)
ni(u)
logτi j. (3.76)
Assuming that we have available a sample T of messages x out of the universal
space X , we can split T into auxiliary samples depending on to what class ψ(z, x) is
assigned by the classifier:
T (κ) = {xk ∈ T | 1(κ)(ψ(z, xk) | θ ) = 1}, (3.77)
where, as in previous sections,
1(H)(x | θ ) = 1[q(x ,θ )< α], (3.78)
1(S)(x | θ ) = 1[q(x ,θ )≥ α]. (3.79)
That is, T (H) and T (S) consist of messages xk ∈ T that make the base message z being
recognized as Ham and Spam, respectively.
27
Considering these samples, we can generalize the estimate ÒH(u | τ) to the estimates
over samples T (H) and T (S):
R(H)(τ | z) = − 1|T (H)|
∑
xk∈T (H)
∑
ci∈AK
pi ÒHi(ψ(z, xk) | τ), (3.80)
R(S)(τ | z) = − 1|T (S)|
∑
xk∈T (S)
∑
ci∈AK
pi ÒHi(ψ(z, xk) | τ). (3.81)
Then, we can state our goal in a new way: Find parameters τ such that the entropy
estimate R(H)(τ | z) becomes low, while the estimate R(S)(τ | z) remains high. One way
to achieve these goals simultaneously is to formalize them as a problem of minimization
the difference of the above objective functions:
R(τ) = R(H)(τ | z)− R(S)(τ | z)→min
τ
, (3.82)
subject to usual normalization requirements
τi j ≥ 0 and
∑
a j∈A
τi j = 1, (3.83)
for all contexts ci ∈ AK and all characters a j ∈ A.
28
Substituting the entropy estimation (3.76) definition into the criterion (3.82), we
have:
R(τ | z) = R(H)(τ | z)− R(S)(τ | z)
= − 1|T (H)|
∑
uk∈U (H)
∑
ci∈AK
pi ÒHi(u | τ) + 1|T (S)| ∑
uk∈U (S)
∑
ci∈AK
pi ÒHi(u | τ)
= −∑
ci∈AK
pi

1
|T (H)|
∑
uk∈U (H)
ÒHi(u | τ)− 1|T (S)| ∑
uk∈U (S)
ÒHi(u | τ)
= −∑
ci∈AK
pi
∑
a j∈A
(ν(H)i j − ν(S)i j ) logτi j →minτ , (3.84)
for
U (κ) = {ψ(z, xk) | xk ∈ T (κ)}, (3.85)
ν
(κ)
i j =
1
|T (κ)|
∑
uk∈U (κ)
ni j(uk)
ni(uk)
. (3.86)
Since parameters τi j occur only in summands for the context ci, optimization (3.84)
naturally falls into |AK | smaller problems:
Ri(τ | z) = R(H)i (τ | z)− R(S)i (τ | z)→ min{τi1,τi2,...,τi|A|}, (3.87)
where
R(κ)i (τ | z) = −
∑
a j∈A
ν
(κ)
i j logτi j. (3.88)
Lemma 1. The objective function
Ri(τ) = −
∑
j∈J
νi j logτi j, (3.89)
29
where weights νi j ≥ 0 for any j ∈ J, and parameters τi j are subject to constraints
τi j ≥ 0 and
∑
j∈J
τi j = s > 0, (3.90)
reaches its minimum value at
τ∗i j = s
νi j
νi
, (3.91)
where
νi =
∑
j∈J
νi j. (3.92)
Proof. Considering that logε≤ (ε− 1) for any ε > 0, we see that for an arbitrary vector
of parameters τ,
Ri(τ
∗)− Ri(τ) = −
∑
j∈J
νi j log
sνi j
νi
+
∑
j∈J
νi j logτi j
=
∑
j∈J
νi j log
τi jνi
sνi j
≤∑
j∈J
νi j

τi jνi
sνi j
− 1

=
νi
s
∑
j∈J
τi j −
∑
j∈J
νi j = 0, (3.93)
by definition of νi and normalization requirements on τ. From the obtained relation it
immediately follows that Ri(τ∗)≤ Ri(τ) for any τ.
Lemma 2. The objective function
Ri(τ) = −
∑
j∈J
νi j log
1
τi j
, (3.94)
30
where weights νi j ≥ 0 for any j ∈ J, and parameters τi j are subject to constraints
τi j ≥ 0 and
∑
j∈J
τi j = s > 0, (3.95)
reaches its minimum value at
τ∗i j =

s
|Jmini | , if j ∈ J
min
i ;
0, if j ∈ J \ Jmini ;
(3.96)
where
Jmini = { j ∈ J | νi j = minj∈J νi j}. (3.97)
Proof. Let us consider the following values of parameters under the temporary
assumption that τi j ≥ " for some arbitrarily small " > 0 and all j ∈ J .
τ∗i j(") =

s
1− (|J | − |Jmini |)"
|Jmini | , if j ∈ J
min
i ;
s", if j ∈ J \ Jmini .
(3.98)
We can assume that
∑
j∈J\Jmini νi j > 0, which is always true unless J
min
i = J .
It is clear that for smaller values of " criterion function Ri(τ∗(")) also gets smaller
values.
Ri(τ
∗(")) = − ∑
j∈Jmini
νi j log
|Jmini |
s
 
1− (|J | − |Jmini |)"
 − ∑
j 6∈Jmini
νi j log
1
s"
. (3.99)
31
Therefore, passing to the limit, we can make the criterion arbitrarily small while
approaching the desired solution τ∗:
lim
"→0 Ri(τ
∗(")) = −∞, (3.100)
lim
"→0τ
∗(") = τ∗. (3.101)
Generally speaking, this solution is not unique: any distribution of the probability
mass across τ∗i j for j ∈ Jmini minimizes the criterion. However, one solution is sufficient
for our purposes.
Theorem 3. The criterion function
Ri(τ | z) = −
∑
a j∈A
(ν(H)i j − ν(S)i j ) logτi j (3.102)
subject to constraints
τi j ≥ 0 and
∑
a j∈A
τi j = 1, (3.103)
reaches its minimum value at
τ∗i j =
µi j
µi
, (3.104)
where
µi j = max{0,ν(H)i j − ν(S)i j }, (3.105)
µi =
∑
a j∈A
µi j. (3.106)
Proof. Let us divide the sum in the objective function Ri(τ | z) into the following three
sums over disjoint subsets of indices according to the sign of the difference δi j ≡
32
ν
(H)
i j − ν(S)i j :
Ri(τ | z) = −
∑
a j∈A
δi j logτi j
= −∑
j∈J+1i
δi j logτi j −
∑
j∈J−1i
δi j logτi j −
∑
j∈J0i
δi j logτi j
= −∑
j∈J+1i
δi j logτi j −
∑
j∈J−1i
(−δi j) log 1
τi j
, (3.107)
where
Jσi = { j | a j ∈ A∧ sgn(ν(H)i j − ν(S)i j ) = σ}. (3.108)
Similarly to Jσi , let
sσi =
∑
j∈Jσi
τi j, (3.109)
s+i + s
−
i + s
0
i = 1. (3.110)
Clearly, the problem of finding optimal τi j can be solved separately for each of the
sums in (3.107).
– For the first sum
R+i (τ) = −
∑
j∈J+1i
δi j logτi j, (3.111)
conditions of the Lemma 1 hold for J = J+1i , νi j = δi j, and s = s
+
i . Consequently,
the function R+i (τ) is minimized for
τ∗i j =
s+i δi j∑
l∈J+1i δil
, j ∈ J+1i . (3.112)
Notice that the greater the sum s+i becomes, the lesser is the minimal value R
+
i (τ
∗).
33
– For the second sum
R−i (τ) = −
∑
j∈J+1i
(−δi j) log 1
τi j
, (3.113)
conditions of the Lemma 2 hold for J = J−1i , νi j = −δi j, and s = s−i . As we have
shown in Lemma 2, when parameters are bounded below by some arbitrarily small
" > 0, the function R−i (τ) is minimized for
τ∗i j(") =

s−i
1− (|A| − |Jmini |)"
|Jmini | , if j ∈ J
min
i ;
s−i ", if j ∈ J−1i \ Jmini ;
(3.114)
Jmini = { j ∈ J−1i | −δi j = min
l∈J−1i
(−δil) = −max
l∈J−1i
(δil)}. (3.115)
Notice that, since τi j occurs in R
−
i (τ) inversed, unlike in R
+
i (τ), the lesser the sum
s−i becomes, the lesser is the minimal value R
−
i (τ
∗).
– For the indices j ∈ J0i , the choice of τi j is irrelevant and does not change the value
of Ri(τ | z) regardless of the magnitude of s0i .
In order to combine the independent solutions (3.112) and (3.114) optimizing the
separate sums, it is necessary to determine in which proportion should the probability
mass be distributed between parameters belonging to J+1i , J
−1
i , and J
0
i . As we have seen
above, for the minimal value (as a function of the bound ") to be the smallest, s+i has
to be as large as possible, while both s−i and s
0
i , to the contrary, have to be as small as
34
possible. Therefore, the optimal proportion for the parameters bounded below by " is
s0i = |J0i |", (3.116)
s−i = |J−1i |", (3.117)
s+i = 1− s−i − s0i . (3.118)
The corresponding parameters are then
τ∗i j(") =

δi j∑
l∈J+1i δil
 
1− (|A| − |J+1i |)"

, if j ∈ J+1i ;
", if j ∈ J−1i ∪ J0i .
(3.119)
Passing to the limit for "→ 0, we finally obtain the parameters that deliver minimum to
the function Ri(τ | z):
τ∗i j = lim"→0τ
∗
i j(") =

δi j∑
l∈J+1i δil
, if j ∈ J+1i ;
0, if j ∈ J−1i ∪ J0i ;
=
max{0,δi j}∑
a j∈A max{0,δi j}
=
µi j
µi
. (3.120)
3.6. Generalization for Multiple Base Messages
Throughout this chapter we have considered the method for a single arbitrary base
message z that is chosen beforehand. However, with minor modifications, the presented
reasoning holds for the same criteria but averaged over multiple base messages zl ∈ Z .
Indeed, for both approaches described in section 3.4 resulting in stochastic optimization,
35
the only change averaging over Z makes is that the variable z, as well as x , runs over a
sample on iterations:
w(t+1)i j = w
(t)
i j + γtW
(κk(t))(zl(t), xk(t)) g(xk(t) | zl(t),τ(w(t)))
· ni(xk(t) | zl(t))
 bτi j(xk(t) | zl(t))−τi j(w(t)). (3.121)
The same is true for the likelihood-based approach discussed in section 3.5. If the
criterion (3.82) is averaged over a set of base messages Z , the order of summation can
be seamlessly changed so that the new outer sum over zl ∈ Z , together with the sum
over uk ∈ U , is taken before the sum over ci ∈ AK and a j ∈ A. Consequently, the resulting
objective function takes the same form (3.84) but for
ν
(κ)
i j =
1
|Z | |T (κ)|
∑
zl∈Z
∑
uk∈T (κ)
ni j(ψ(zl , xk))
ni(ψ(zl , xk))
. (3.122)
36
CHAPTER IV
EVALUATION
4.1. Methodology
In order to validate the method proposed in this research, first we implemented
the entropy classifier for the problem of spam filtering. Following the definition of the
problem given in section 3.1.4, our implementation uses the algorithm of prediction by
partial matching (PPM) to learn the finite memory Markov models for each of the two
classes, and then makes classifying decisions depending on for which class entropy per
character is the minimal. We also implemented all three of the algorithms proposed in
sections 3.4.1, 3.4.2, and 3.5.
Our numerical experiments were organized as follows. For each run of evaluation,
first, a combined sample T of both legitimate (T (H)) and spam (T (S)) messages was
drawn out of the SpamAssassin public corpus (Apache SpamAssassin Project, 2005).
Each message in xk ∈ T was accompanied with the true class labelling κk ∈ {H,S}.
The sample T was additionally temporarily split at random in proportion seven to
three into the training and testing samples, respectively. The former was used to train
the classifier, the latter was used to ensure that performance of the classifier is within
the expected boundaries (as compared, for example, to (Bratko, Cormack, et al., 2006)).
All of the spam messages in T that were recognized as such according to the obtained
class parameters θ (H) and θ (S), were remembered and declared to be the set of base
messages Z .
Then, our algorithms (3.52), (3.69), and (3.104) were run on the combined sample
T in order to obtain transformation parameters τ(E), τ(P), and τ(L), correspondingly. The
37
first two algorithms based on the stochastic optimization were repeatedly run over
all pairs (zl , xk) ∈ Z × T , where the index k was incremented first. To control the
convergence, after each pass over T (i.e. every |T | iterations), the value of the criterion
function corresponding to the current algorithm was estimated using a ten percent
subsample of Z × T . This estimation together with the total number of iterations
performed by the moment were used to make a stopping decision.
Once in a several passes over T (between |T | and 10 |T | iterations, depending
on the size of the problem), the current parameters τ(t) = τ(w(t)) were supplied to
the Markov chain generator. For each base message z ∈ Z , the generator produced a
continuation stream of characters distributed according to the distributions g(x ,τ(t))
that were stopped when the string ex of characters produced so far was enough to get the
transformed message u =ψ(z, ex) = zex past the classifier’s spam filter. If the length of ex
exceeded 20 · |z|, the generator was forcefully stopped. This way, for each z, a thousand
of continuations ex were generated to estimate a secondary evaluation measure, the
average length of ex required to make z legitimate to the classifier.
The third algorithm (3.104) required less work since it provided the analytical
solution as long as the values µi j were calculated. To do so, a single pass of averaging
ni j(ψ(zl , xk)) over the samples Z × T (H) and Z × T (S) was done. After that, the same
generation procedure described above was done, so there was an auxiliary measure for
comparing this algorithm with the other two and the baseline strategy.
The role of a baseline strategy in our experiments was played by the same generation
procedure called for the vector of parameters τ(H) = θ (H) that were estimated during the
training of the classifier on the sample of legitimate messages T (H). That same vector
θ (H) also served as an initial estimate for the stochastic optimization.
38
TABLE 4.1. Performance of the classifier on the full dataset.
True class
Classified as
Ham Spam
Ham 68.0% (1645) 0.5% (13)
Spam 1.5% (36) 30.0% (725)
0 1 2 3 4 5
·104
0
0.1
0.2
0.3
0.4
0.5
0.6
Length in bytes
Fr
eq
ue
nc
y
FIGURE 4.1. Histogram of lengths of all messages in the full dataset.
4.2. Results
Due to limited computational resources, during all evaluation runs, Markov models’
memory was fixed to be three characters for the classifier as well as the adversary.
In practice, entropy-based spam filters demonstrate the best performance for orders
of Markov models between six and eight characters (Bratko, Cormack, et al., 2006)).
However, even for K = 3 our implementation of the classifier based on the algorithm of
prediction by partial matching has error rate of approximately 2% on the SpamAssassin
dataset. Table 4.1 shows statistics of one such run when all 6046 bodies of email
messages were split into 3627 training and 2419 testing messages. The distribution of
lengths of the messages is shown in Figure 4.1.
For the order K = 3, the space of parameters τ, representing conditional
probabilities of a one-byte character given a context of at most K another one-byte
39
characters, is bounded by
K∑
k=0
|A|k+1 = 2561 + 2562 + 2563 + 2564 ≈ 232. (4.1)
The total number of character-context combinations for K = 3 that actually occur in all
messages from the SpamAssassin dataset is approximately 524 000.
To avoid memory pressure and achieve faster convergence, the algorithms (3.52)
and (3.69) requiring stochastic optimization, were run on a series of small subsets of
the original dataset. Each time, approximately one percent of messages were sampled
at random from the full dataset. Let us present evaluations for a typical such run on a
1% dataset done for the three algorithms, as it was described in the previous section.
The failure rate of the chosen concatenation-based transformation was zero for
all spam messages and parameters τ obtained from all three algorithms as well as the
Ham baseline τ(H). That is, it was possible to generate an appendix xk for each base
spam message zl such that their concatenation uk =ψ(zl , xk) = zl xk was classified as
legitimate. For this reason, to compare performance for different parameters, we used
a supplementary index of the ratio |uk|/|zl | between the lengths of each transformed
message. Note that none of the methods proposed in this work was constructed to
directly optimize this length ratio.
Figures 4.2, 4.3, and 4.4 depict distributions of length ratios averaged over
transformation appendices xk generated according to the parameters τ
(E), τ(P), τ(L)
that were optimized for the entropy-based, probability-based, and likelihood-based
criteria, respectively. As it is easy to see from the cumulants provided to the right of
each histogram, the algorithms using probability-based and likelihood-based criteria are
more preferable to the one using entropy-based criterion in terms of the length ratio.
However, comparing these graphs with Figure 4.5, showing the histogram and cumulant
40
1 1.5 2 2.5 3
0
0.03
0.06
0.09
0.12
Averaged length ratio
Fr
eq
ue
nc
y
(a)
1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Averaged length ratio
Fr
eq
ue
nc
y
(b)
FIGURE 4.2. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the parameters τ
(E) optimized for the entropy-based criterion (3.48)
on a 1% dataset; (b) its cumulant.
1 1.5 2 2.5 3
0
0.05
0.1
0.15
0.2
Averaged length ratio
Fr
eq
ue
nc
y
(a)
1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Averaged length ratio
Fr
eq
ue
nc
y
(b)
FIGURE 4.3. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the parameters τ
(P) optimized for the probability-based criterion (3.60)
on a 1% dataset; (b) its cumulant.
41
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
0
0.05
0.1
0.15
Averaged length ratio
Fr
eq
ue
nc
y
(a)
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
0
0.2
0.4
0.6
0.8
1
Averaged length ratio
Fr
eq
ue
nc
y
(b)
FIGURE 4.4. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the optimal parameters τ
(L) (3.104) for the likelihood-based
criterion (3.84) on a 1% dataset; (b) its cumulant.
1 1.5 2 2.5 3 3.5 4
0
0.05
0.1
0.15
Averaged length ratio
Fr
eq
ue
nc
y
(a)
1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
1
Averaged length ratio
Fr
eq
ue
nc
y
(b)
FIGURE 4.5. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the baseline parameters τ
(H) estimated from θ (H) on a 1% dataset;
(b) its cumulant.
42
TABLE 4.2. Summary statistics for the performance of different generation parameters
on 1% datasets.
Index
Optimization based on Ham
baselineEntropy Probability Likelihood
Failure rate 0% 0% 0% 0%
Averaged length ratio
Minimum 1.21 1.15 1.16 1.32
5% quantile 1.29 1.26 1.26 1.52
Median 1.58 1.49 1.52 2.06
Mean 1.65 1.59 1.57 2.13
95% quantile 2.26 2.35 2.13 3.27
Maximum 2.89 3.14 2.61 4.27
for the parameters τ(H), all three techniques provide a noticeably better performance
compared to the baseline of generating Ham-like appendix. A short summary of the
statistics from these figures is given in Table 4.2.
Figure 4.6 shows several examples of generated transformation texts for a few short
spam messages from the dataset. Each of the original spam messages (typeset on white
background) is followed by strings produced by the generation procedure according
to optimal parameters (highlighted with gray background). Any of the presented
appendices is sufficient to make the corresponding spam message look as a legitimate
text, and cannot be shortened without changing the class of the transformed message
back to spam.
For the purpose of comparing the systems of probability distributions resulting from
the aforementioned attack techniques, we measured the the difference between two
probability distributions defined by parameters τ(1)i and τ
(2)
i for each context ci ∈ AK . To
do so, we used the distance function
d(τ(1)i ,τ
(2)
i ) =
∑
a j∈A
|τ(1)i j −τ(2)i j |. (4.2)
43
Hi we are luke’s secret following we
,→ love luke fictitious!
We are also your long lost friend! Hi
This email has nothing to do with
,→ lukefictitious.com
We wil be putting up our very own fan
,→ site soon
and wanted to let you know in advance!
Have a beautifull day!
Joseph
Regard E
__________
Exm
Hey, I just wanted to tell you about a
,→ GREAT website.
,→ http://www.metrojokes.com Features
,→ lots of jokes! Extremly unique
,→ features and classified in categories.
,→ I appriciate your time.
Thank you
your loved one out of your typical diam
- No, the out their until your typi
from you’re decent? I
ass, but the
sun doesn’t
fat able to be and wonderful
>>
(suddenlysusan@Stoolmail.zzn.com) on
,→ Tuesday, July 30, 2002 at 17:07:56
: Why Spend upwards of $4000 on a DVD
,→ Burner when we will show you an
,→ alternative that will do the exact same
,→ thing for just a fraction of the cost?
,→ Copy your DVD’s NOW.
This?
I
It spamassin-dev
--
"If you."
This a multi-part, surround you’re du
This a must IM. Build
searcharself with think?
This a deady to be looking of some
,→ merge.net
DON’T MISS OUT ON AN AMAZING BUSINESS
,→ OPPORTUNITY AND WEIGHT LOSS PRODUCT!
PLEASE VISIT
,→ www.good4u.autodreamteam.com
THERE IS NO OBLIGATION
AND IT’S WORTH A LOOK!
Remore
> OK guys -- I r
md: rules.
>
> with smart_0088
[evel
Yet emailename="smime
> OK guys -- I reck_f
>
> BSMTP-support people
> OK guy
FIGURE 4.6. Examples of original spam messages zl (white background) and several
appendices xk corresponding to each zl that are generated using parameters τ
(E)
optimized on a 1% dataset (gray background).
44
TABLE 4.3. Summary statistics for the performance of the generation parameters optimal
for the likelihood-based criterion on the full datasets.
Index
Likelihood
optimization
Ham
baseline
Failure rate 0% 0.16%
Averaged length ratio
Minimum 1.001 1.001
5% quantile 1.068 1.248
Median 1.184 1.704
Mean 1.232 1.976
95% quantile 1.594 4.101
Maximum 2.929 8.498
Figure 4.9 features distributions of distances d(τ(1)i ,τ
(2)
i ) for all pairs of parameters
τ
(1)
i ,τ
(2)
i ∈ {τ(E)i ,τ(P)i ,τ(L)i ,τ(H)i } that were fitted on a 1% dataset using all three
algorithms, as well as the baseline Ham parameters. The minimal distance of
d(τ(1)i ,τ
(2)
i ) = 0 indicates that distributions τ
(1)
i and τ
(2)
i are identical. The maximal
distance of d(τ(1)i ,τ
(2)
i ) = 1 corresponds to greatest difference between τ
(1)
i and τ
(2)
i .
Since the likelihood-based algorithm (3.104) does not have as high computational
requirements as the other two algorithms resorting to stochastic optimization, it was
possible for us to run it on the full dataset. Resulting distributions of the length ratio
for the parameters optimal for the likelihood-based criterion and the baseline Ham
parameters are presented in Figures 4.7 and 4.8 in a similar fashion to the case of a
1% dataset shown above. Table 4.3 lists the same five quantiles of the length ratio as
well as its mean values. Comparing these statistics with the ones in Table 4.2, we can
conclude that the breach between the Ham-like generation and the likelihood-based
algorithm is even greater (≈ 70% vs. ≈ 50%).
45
1 1.5 2 2.5 3
0
0.05
0.1
0.15
0.2
0.25
Averaged length ratio
Fr
eq
ue
nc
y
(a)
1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Averaged length ratio
Fr
eq
ue
nc
y
(b)
FIGURE 4.7. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the optimal parameters τ
(L) (3.104) for the likelihood-based
criterion (3.84) on the full dataset; (b) its cumulant.
2 4 6 8 10
0
0.1
0.2
0.3
Averaged length ratio
Fr
eq
ue
nc
y
(a)
2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
Averaged length ratio
Fr
eq
ue
nc
y
(b)
FIGURE 4.8. (a) Histogram of the length ratio |ψ(zl , xk)|/|zl | averaged over appendices
xk generated using the baseline parameters τ
(H) estimated from θ (H) on the full dataset;
(b) its cumulant.
46
0.25 0.5 0.75 1
0
0.1
0.2
0.3
Distance d(τ(E)i ,τ
(H)
i )
Fr
eq
ue
nc
y
Entropy vs. Ham
0.25 0.5 0.75 1
0
0.1
0.2
0.3
Distance d(τ(P)i ,τ
(H)
i )
Fr
eq
ue
nc
y
Probability vs. Ham
0.25 0.5 0.75 1
0
0.1
0.2
0.3
Distance d(τ(L)i ,τ
(H)
i )
Fr
eq
ue
nc
y
Likelihood vs. Ham
0.25 0.5 0.75 1
0
0.1
0.2
0.3
Distance d(τ(E)i ,τ
(P)
i )
Fr
eq
ue
nc
y
Entropy vs. Probability
0.25 0.5 0.75 1
0
0.1
0.2
0.3
Distance d(τ(E)i ,τ
(L)
i )
Fr
eq
ue
nc
y
Entropy vs. Likelihood
0.25 0.5 0.75 1
0
0.1
0.2
0.3
Distance d(τ(P)i ,τ
(L)
i )
Fr
eq
ue
nc
y
Probability vs. Likelihood
FIGURE 4.9. Histograms of the distances between parameters obtained through different
adversary algorithms as well as the baseline parameters.
47
CHAPTER V
CONCLUSIONS AND FUTURE WORK
We introduced three formalizations of possible adversarial objectives for classifiers
using cross-entropy as the deciding criterion. Each of the three approaches has
proved its efficiency as compared to the baseline approach of using the probability
distributions estimated only on legitimate messages to define a transformation source.
The third technique showed itself as the most efficient of three, both in terms of
transformation and computational requirements. Although the first two techniques have
same implementation difficulties, after appropriate calibration, they showed comparable
performance. Together, all three methods have shown the feasibility of statistically
attacking compression-based classifiers using relatively limited extent of transformation
(for the SpamAssassin corpus, on average, approximately 20% of original message’s
length is to be generated and appended).
Future work includes three directions of possible extension of this research. To begin
with, it is of interest to explore how methods of parametrized optimization, examples of
which are considered in this work, compare with other algorithms allowing to optimize
the contents of transformation texts directly on a per character basis. This problem
setting can potentially increase the number of different criterion functions that can be
considered to formalized the adversary problem. The methods that might be applied in
that setting include Markov chain Monte Carlo methods, genetic algorithms, simulated
annealing, and others.
Another promising direction consists in analyzing the dynamics of classifier-
adversary system for the particular case of entropy-based classification considered in
this thesis. Although it is hard to attack this problem in general, it might be feasible to
48
derive some useful properties for the specific algorithms that we have discussed in our
work.
Last but not least, it is important to research ways of improving performance of
compression-based classifiers that might be possible given the knowledge of potential
adversary attacks.
49
REFERENCES CITED
Apache SpamAssassin Project. (2005). The SpamAssassin public mail corpus. Available
at https://spamassassin.apache.org/publiccorpus/.
Biggio, B., Nelson, B., & Laskov, P. (2011). Support vector machines under adversarial
label noise. In Journal of Machine Learning Research: Workshop and Conference
Proceedings (Vol. 20, pp. 97–112). MIT.
Biggio, B., Nelson, B., & Laskov, P. (2012). Poisoning attacks against support vector
machines. In Proceedings of The 29th International Conference on Machine Learning.
Omnipress.
Bratko, A., Cormack, G. V., Filipicˇ, B., Lynam, T. R., & Zupan, B. (2006). Spam filtering
using statistical data compression models. The Journal of Machine Learning Research,
7, 2673–2698.
Bratko, A., Filipicˇ, B., & Zupan, B. (2006). Towards practical PPM spam filtering:
Experiments for the TREC 2006 Spam Track. In Proceedings of the 15th Text REtrieval
Conference (TREC 2006). Gaithersburg, MD.
Cleary, J. G. & Teahan, W. J. (1997). Unbounded length contexts for PPM. The Computer
Journal, 40(2 and 3), 67–75.
Cleary, J. G. & Witten, I. H. (1984). Data compression using adaptive coding and partial
string matching. IEEE Transactions on Communications, 32(4), 396–402.
Cormack, G. V. & Horspool, R. N. S. (1987). Data compression using dynamic Markov
modelling. The Computer Journal, 30(6), 541–550.
Frank, E., Chui, C., & Witten, I. H. (2000). Text categorization using compression
models. In Proceedings of DCC-00, IEEE Data Compression Conference (pp. 200–209).
Snowbird, US: IEEE Computer Society Press, Los Alamitos, US.
Goodman, J., Heckerman, D., & Rounthwaite, R. (2005). Stopping spam. Scientific
American, 292(4), 42–49.
Lowd, D. & Meek, C. (2005a). Adversarial learning. In Proceedings of the Eleventh
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp. 641–647). ACM.
50
Lowd, D. & Meek, C. (2005b). Good word attacks on statistical spam filters. In Proceedings
of the Second Conference on Email and Anti-Spam. Palo Alto, CA.
Moffat, A. (1990). Implementing the PPM data compression scheme. IEEE Transactions
on Communications, 38(11), 1917–1921.
Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of
Mathematical Statistics, 22, 400–407.
Teahan, W. J. (1995). Probability estimation for PPM. In New Zealand Computer Science
Research Students’ Conference, 1995. University of Waikato, Hamilton, New Zealand.
51