
Citation 
 Permanent Link:
 https://ufdc.ufl.edu/UFE0019813/00001
Material Information
 Title:
 Network Centric Traffic Analysis
 Creator:
 Fan, Jieyan
 Place of Publication:
 [Gainesville, Fla.]
Florida
 Publisher:
 University of Florida
 Publication Date:
 2007
 Language:
 english
 Physical Description:
 1 online resource (140 p.)
Thesis/Dissertation Information
 Degree:
 Doctorate ( Ph.D.)
 Degree Grantor:
 University of Florida
 Degree Disciplines:
 Electrical and Computer Engineering
 Committee Chair:
 Wu, Dapeng
 Committee Members:
 Li, Tao
Yang, Liuqing Chen, Shigang
 Graduation Date:
 12/14/2007
Subjects
 Subjects / Keywords:
 Algorithms ( jstor )
Analyzers ( jstor ) Arithmetic mean ( jstor ) Distance functions ( jstor ) Feature extraction ( jstor ) Machine learning ( jstor ) Mathematical vectors ( jstor ) Simulations ( jstor ) Spectral energy distribution ( jstor ) Voice data ( jstor ) Electrical and Computer Engineering  Dissertations, Academic  UF classification, network, security, traffic
 Genre:
 bibliography ( marcgt )
theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) borndigital ( sobekcm ) Electronic Thesis or Dissertation Electrical and Computer Engineering thesis, Ph.D.
Notes
 Abstract:
 Over the past few years, the Internet infrastructure has become a critical part of the global communications fabric. Emergence of new applications and protocols (such as voiceover Internet Protocol, peertopeer, and video on demand) also increases the complexity of Internet. All these trends increase the demand for more reliable and secure service. This has affected the interest of Internet service providers (ISP) in network centric traffic analysis. Our study considers network centric traffic analysis from the two perspectives that most interest ISPs: network centric anomaly detection, and network entric traffic classification. In the first part, we focus on network centric anomaly detection. Despite the rapid advance in networking technologies, detection of network anomalies at highspeed switches/routers is still far from maturity. To push the frontier, two major technologies need to be addressed. The first is efficient featureextraction algorithms/hardware that can match a line rate in the order of Gb/s. The second is fast and effective anomaly detection schemes. Our study addresses both issues. The novelties of our scheme are the following. First, we design an edgerouter based framework that detects network anomalies as they first enter an ISP?s network. Second, we propose the socalled twoway matching features, which are effective indicators of network anomalies. We also design data structure to extract the features efficiently. Our detection scheme exploits both temporal and spatial correlations among network traffic. Simulation results show that our scheme can detect network anomalies with high accuracy, even if the volume of abnormal traffic on each link is extremely small. In the second part, we focus on network centric traffic classification. Nowadays, VoIP and IPTV become increasingly popular. To tap the potential profits that VoIP and IPTV offer, carrier networks must efficiently and accurately manage and track the delivery of IP services. Yet, the emergence of a bloom of new zeroday voice and video applications such as Skype, Google Talk, and MSN pose tremendous challenges for ISPs. The traditional approach of using port numbers to classify traffic is infeasible because it uses a dynamic port number. The proliferation of proprietary protocols and usage of encryption techniques make applicationlevel analysis infeasible. Our study focus on a statistical pattern classification technique to identify multimedia traffic. In particular, we focus on detecting and classifying voice and video traffic. We propose a system (VOVClassifier ) for voice and video traffic classification that uses the regularities residing in multimedia streams. Experimental results demonstrate the effectiveness and robustness of our approach. ( en )
 General Note:
 In the series University of Florida Digital Collections.
 General Note:
 Includes vita.
 Bibliography:
 Includes bibliographical references.
 Source of Description:
 Description based on online resource; title from PDF title page.
 Source of Description:
 This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
 Thesis:
 Thesis (Ph.D.)University of Florida, 2007.
 Local:
 Adviser: Wu, Dapeng.
 Statement of Responsibility:
 by Jieyan Fan.
Record Information
 Source Institution:
 UFRGP
 Rights Management:
 Copyright Fan, Jieyan. Permission granted to the University of Florida to digitize, archive and distribute this item for nonprofit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
 Resource Identifier:
 663881414 ( OCLC )
 Classification:
 LD1780 2007 ( lcc )

Downloads 
This item has the following downloads:

Full Text 
10 ~ E[Ta]=4
SE[Ta]=6
1085
1083
1081
103 102 n110.0156 101
Figure 312. Space complexity vs. collision probability for fixed time complexity.
results in Section 3.6.1, we use the same 96bit signature, i.e., SA, DA, SP, and DP, and
let R=250,000 packets/second and F=80 seconds, which translates to 250,000 x 80 20M
input signatures for each simulation. These signatures are preloaded into memory before
the beginning of simulations so that I/O speed of hard drive does not affect the execution
time of simulations.
For each simulation run of the hash table algorithm, we specify the memory size
i and measure the algorithm performance in terms of the average number of hash
function calculations per signature query request, denoted by Th, and the execution
time. Due to the Law of Large Numbers, Th approaches the expected number of hash
function calculations per signature query request, i.e., E[Th] in Equation (36), if we run
the simulation many times with the same .i In our simulations, we run the hash table
algorithm ten times; each time with a different set of input signatures but with the same
1,
For each simulation run of the BFA algorithm, we specify the memory size ma and
the number of hash functions IC, and measure the algorithm performance in terms of
the collision frequency, denoted by t, and the execution time. The collision frequency is
defined as the ratio of the number of collision occurrences in BloomFilterSearch to the
The optimal partition (see Equation (639)), in terms of minimum coding length
criteria, should minimize coding length of the segmented data, i.e.,
min Lc (T; 1 m) min Ic ()+ k log2 (645)
k1 kL J
where 1I denotes the partition scheme. The first term in Equation (645) is the summation
of coding length of each group, and the second one is the number of bits needed to
encoding membership of each item of T in the K groups.
The optimal partition is achieved in the following way. Let the segmentation scheme
be represented by the membership matrix,
Ilk diag ([lk, T2k,... Nk]) E RNXN, (646)
where 7 k denotes the probability that vector Q(n) belongs to subset k, such that
K
YTnk = 1, for Vn 1,..., N (647)
k= 1
and diag() denotes converting a vector to a diagonal matrix.
Hi i. [r, page 34] proved that the coding length is bounded as follows.
< r tr (f1k) + K logKdet (I +tr () K q4lk ')
kI 2 tr (Ik 2
f tr (u~k) 10t2
k=
A(q; II), (648)
where tr (.) denotes the trace of a matrix, and det (.) denotes matrix determinant.
Combining Equations (645) and (648), one achieves a minimax criterion
S= argmin max ~c (1; 1)] arg min, ('; 1). (649)
n c n 
nodes, which represent whether an edge router is in abnormal state. Next, we explain the
algorithm in detail.
1. function ViterbiDecodeHMT(...)
2. Argument 1: transition probabilities, {P(R ,(i)); i E \ 0o}.
3. Argument 2: likelihood, {p(il0i); i c E}.
4. Argument 3: feature, 0.
5. Return: estimated node states, {ui; i E E}.
{P(Q ,(i) ), P(A i, QP() i); i 2 \ Eo}
BP {P(  ( Q,()); i e \ o} {p( I i); i e } (
P (Qi, QPy) r
P (Q~i )
(441)
for V i E 0o
Ui = arg n ::
end for
for l  ,...,L
P(A = u ,l)
, for V i C EE
ii =argmaxs(ui()P (OQ
u,
SII
(442)
13. end for
Figure 411. Viterbi algorithm for HMT decoding.
By B ,i, ii formula, the terms in Equation (439) can be computed by
P (Q6i Ui," QPW() UP(W)4
P Qv I ( )
(443)
(440)
P (Qi I (i), )
UjjQ 6i)itPhi)'I) 
O(i,)P (Q/,
P (Qi, Ui, I QPW) UPW), )
6.5 Experiment Results
In this section, we demonstrate the experiment results of applying the system
presented in Figure 61 to network traffic classification. Before that, we first describe
experiment settings in Section 6.5.1.
6.5.1 Experiment Settings
We perform four sets of experiments. In Section 6.5.2, two sets of experiments are
conducted on traffic generated by Skype. In Section 6.5.3, other two sets of experiments
are conducted on traffic generated by Skype, MSN, and GTalk.
For each set of the experiments, we use Receiver Operating C'.,,,,i. I/ ristics (ROC)
(< iL [2! page 107] as the performance metric. ROC curve is a curve of detection
probability, PD, vs. false alarm probability, PFA, where,
PDniH P (The estimated state of nature is H The true state of nature is '), (659)
PFA\H =P (The estimated state of nature is IThe true state of nature is not H), (6 60)
where H can be voice, video, file+voice, and file+video. By tuning parameters 8,, OA, and
Ov ( see Figure 67), one is able to generate ROC curve.
During the experiments, we collected network traffic from three applications, i.e.,
Skype, MSN, and GTalk. For each application, traffic was collected in two scenarios. For
the first scenario, two lab computers located in University A and University B respectively
were communicating with each other. There was direct connection between two peers. For
the second one, we used a firewall to block direct connection between the two peers such
that the application was forced to use relay nodes.
To do classification, we chose first 10 seconds of each flow, i.e., Ama, < 10 seconds.
We set T, 0.5 milliseconds. Hence, Id = 20, 000.
099 099
098 098
097 097 
096 096
0 95 095 
094 094
093 093
092 092
091 091
09 09
0 02 04 06 08 1 0 02 04 06 08 1
PFA PFA
(a) (b)
Figure 610. The ROC curves of singletyped flows generated by Skype, MSN, and GTalk:
(a) VOICE and (b) VIDEO.
Table 61: Typical PD and PFA values.
Skype Skype+MSN+GTalk
PFA (PD) Single Hybrid Single Hybrid
VOICE 0(1) 0(1) 0(.995) .002(.986)
VIDEO 0(.993) 0(.965) 0(.952) 0(.948)
Voice flows vs. video flows. From Table 61, one notes that classification of
VOICE traffic is more accurate than that of VIDEO. Specifically, we can achieve 10' .
accurate classification of Skype voice flows. This is due to the fact that voice traffic has
higher regularity than video does (Figure 58 and Figure 59). One can immediately tell
the dominant periodic component at 33Hz in the voice flows. This frequency corresponds
to the 30millisecond IPD of the employ, 1 voice coding. On the other hand, Video PSDs
have peaks at 0. It means that nonperiodic component dominates in video flows. One can
see that PSDs of the two video flows are close to each other. That is the reason why our
approach achieves high classification accuracy by using PSD features.
Singletyped flows vs. hybrid flows. From Table 61, one can see that the
classification of singletyped flows is more accurate than that of hybrid flows. Mixing
multiple types of traffic together is like increasing noise. Hence, it is not surprising that
classification accuracy is reduced.
To formulate the model, we define a random variable Q : H  Z2, such that
Q(H,) = u, u e Z2. (410)
Furthermore, denote by A the traffic observed by traffic monitors. Since network
state induces stochastic process of traffic, we employ the widelyused graphic model
representation [31] to depict this causeeffect relationship in Figure 41.
(a) (b)
Figure 42. Extended generative model including traffic feature vectors: (a) original model
and (b) simplified model.
Denote by 4, I E c the feature extracted from traffic, where ED is the feature space.
Most importantly, in selection of the optimal features, we seek for the most discriminative
statistical properties of the traffic. Also note that it is possible to employ multiple features
in the detection procedure, in which case ) is a vector. Since features are succinct
representations of the voluminous traffic, we extend the above model in Figure 41 to the
one illustrated in Figure 42(a). Once ) is extracted from A, we assume that ) represents
A well. It means that we may operate only over lowerdimensional 4, which reduces
computational complexity. Therefore, we simplify the model in Figure 42(a) to that
illustrated in Figure 42(b), where A is dismissed.
Since the feature is measurable, it is called observable random vector, and is depicted
by a rectangular node in Figure 42(b). The network state generating the traffic feature
is to be estimated. We call it hidden random variable, and depict it by a round node in
Figure 42(b). Now, the goal becomes to estimate the hidden state f given the observable
where F, . Equation (6 33) shows the relationship between the periodic components
of a stochastic process in the continuoustime domain and the shape of its PSD in the
frequency domain.
f (f; pd) is a continuous function in frequency domain. To handle it in a computer,
we need to do sampling in frequency domain. In other words, we select a series of
frequencies,
0< fl < fi
2' (6
and define the PSD feature vector as
S1T
= f (fl;'Pd) f (f2;Pd), f (fM;Pd) (635)
c e RM is the feature vector we use to perform classification.
In the next Section, we introduce a new technique that we use to translate the
characteristic of these highdimensional feature vectors into a more tractable low
dimensional space.
6.3 Subspace Decomposition and Bases Identification on PSD Features
In _r inl,, scientific and engineering problems, the data of interest can be viewed
as drawn from a mixture of geometric or statistical models instead of a single one.
Such data are often referred to in different contexts as "mixed," or "multimodal," or
"multimodel," or "heterogeneous," or "hybrid." Subspace decomposition is a general
method for modeling and segmenting such mixed data using a collection of subspaces,
also known in mathematics as a subspace arrangement. By introducing certain new
algebraic models and techniques into data clustering, traditionally a statistical problem,
the subspace decomposition methodology offers a new spectrum of algorithms for data
modeling and clustering that are in many aspects more efficient and effective than (or
complementary to) traditional methods, e.g., principle component ii 1, i (PCA),
Expectation Maximization (EM), and KMeans clustering.
N(0,1) distribution
04
O~
035
03
025
S02
015
01
005
0
4
1 2 3 4
Figure 46. Probability density function of the univariate Gaussian distribution A/ (x; 0, 1).
Histogram
0 100 200 300 400 500 600
Figure 47. Histogram of the twoway matching features measured at a real network
during network anomalies.
However, multiple peaks may exist in the empirical distribution of l Qi. For
example, Figure 47 shows the histogram of the twoway matching features measured
in a real network during DDoS attacks. It has two peaks. Hence, the unimodal Gaussian
distribution is not suitable. In the paper, we adopt the Gaussian mixture model (GMM)
to model the likelihood distribution.
3 2 1 0
X
250
200
150
100
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time Slot
SYN (Link
2) 
0 1000 2000 3000 4000 5000
Time Slot
6000 7000 8000 9000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time Slot
300 
UMSYN (
250
200
150
100
50
^Jji.JF
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time Slot
Figure 316. Feature data: (a) Number of SYN packets (link 1), (b) Number of unmatched
SYN packets (link 1), (c) Number of SYN packets (link 2), and (d) Number of unmatched
SYN packets (link 2).
m 111
N
I
u
rather than the source addresses, are distributed. However, the paper did not resolve an
important problem, i.e., how to extract features efficiently to match a high line rate in
the order of Gb/s. We proposed a data structure called Bloom filter array to address this
problem.
3.2 Hierarchical Feature Extraction Architecture
Network anomaly detection is not an easy task, especially at highspeed routers.
One of the main difficulties arises from the fact that the data rate is too high to afford
complicated data processing. An anomaly detection algorithm usually works with traffic
features instead of the original traffic data itself. Traffic features can be regarded as
succinct representations of the voluminous traffic, e.g., the traffic data rate is a feature of
the traffic.
We focus on presenting our feature extraction architecture for network anomaly
detection. We also cover extraction schemes for some simple features, such as data rate
and SYN/FIN(RST) ratio. The more advanced features, the socalled twoway matching
features, are discussed later.
3.2.1 ThreeLevel Design
Incoming packets
Level 1 Filter
TCP SYN TCP FIN TCP RST TCP SYN/ACK Level 2 Filter
 SYN Rate  YN/FIN(RST) Ratio 2D Matching Feature Extraction
Figure 31. Hierarchical structure for feature extraction.
To efficiently extract features from traffic, we design a threelevel hierarchical
structure (Figure 31), where incoming packets are processed by levelone filters, then
by leveltwo filters, and finally by (levelthree) feature extraction modules. Levelone filters
machine learning algorithm exploits the spatial correlation among traffic on multiple links,
while the thresholdbased scheme only uses the traffic on one link.
Single CUSUM
Dual CUSUM
Threshold 
0.8 Machine Learning / /
0.6
2
5 0.4
0.2
0 ......4 "i r ........
0.0001 0.001 0.01 0.1 1
False Alarm Probability
Figure 414. Performance of four detection algorithms
In Figure 414, we compare the ROC performance of four detection algorithms (the
thresholdbased, the singleCUSUM, the dualCUSUM described in Section 4.1.3, and our
machine learning algorithm) under the same feature, i.e., the number of unmatched SYN
packets. For the singleCUSUM and the dualCUSUM algorithm, the detection delay D is
chosen from 1 to 10 slots and the parameter ai of link i is determined by
ai = (Dattack Dormal) V 1 < < 16,
17
where Dattack and Dnormal are the average number of unmatched inbound SYN packets in
attack and normal conditions, respectively.
The ROC performance (Figure 414) of our machine learning algorithm is the
best among all the algorithms. We also see that the dualCUSUM outperforms the
simple thresholdbased algorithm and the singleCUSUM algorithm has the worst ROC
performance.
4.5.3 Discussion
We would like to point out that, besides detecting network anomalies with low
volume traffic, our machine learning algorithm is also able to detect high volume anomaly,
probability, which give the ROC curve [28]. In this paper, we will use the ROC curve
to compare the performance of different detection algorithms.
Next, we introduce some basic classification algorithms.
4.1.2 ThresholdBased Algorithm
The idea of the thresholdbased algorithm is that if the feature value exceeds a
preset threshold, declare 'abnormal'; otherwise, declare 'normal'. Note that the detection
operation is conducted in each slot. By tuning the threshold for the feature value, we
can obtain different pairs of false alarm probability and detection probability, resulting
in the ROC curve. Given the ROC curve and the desired false alarm probability, one can
determine the value of the threshold for detection operation.
Although it is the simplest method, thresholdbased algorithm can only be used
when the features make significant difference between normal and abnormal conditions.
Therefore, thresholdbased algorithm is not suitable to detect low volume network
anomalies.
4.1.3 ChangePoint Algorithm
In the literature, a simple changepoint algorithm nonparametric Cumulative
Summation (CUSUM) algorithm has been widely used [1113, 17]. However, existing
studies only consider the change from normal state to abnormal state, which means that
the number of false alarms can be very large after network anomalies end. To facilitate the
discussion, we define the following parameters used in CUSUM in Table 41:
Table 41: Parameters used in CUSUM
Parameter Description
S(ti) The observed traffic feature at the end of time slot i.
4n The expectation of K(ti) in normal states.
io The expectation of +(ti) in abnormal states. Without losing generality,
here we assume that 4, < 4a
K(ti) The adjusted variable, defined as i(ti) = i(ti) a, where a is a parameter such that
< a < 4,
router Y, then the socalled twoway matching features between subnet A and server B
shall be obtained at the global analyzer, which has the routing information of the ISP
network.
The advantages of our framework design are that:
1. it is deploy, .1 on edge routers instead of systems of end users, such that it can detect
network anomalies in the first place they enter an AS;
2. it has no burden on core routers;
3. it is flexible in that detection of network anomalies can be made both locally and
globally;
4. it is capable of detecting low volume network anomalies accurately by exploiting
spatial correlations among edge routers.
The framework is designed to be an addon service provided by ISP to protect end
users from network anomalies.
2.3 Summary
This chapter is concerned with design of network anomaly detection frameworks.
There are two types of frameworks (i.e., hostbased and network based. Our design is of
the second type). Specifically, we designed a framework deploy, .1 on edge routers. It is
composed of three components, traffic monitors, local i i liv. i1, and global analyzers.
This framework is flexible in that it can detect network anomalies from both local view
and global view of the network. By exploiting spatial correlations among edge routers, our
framework is capable of detecting low volume network anomalies.
6.2.2 Power Spectral Density (PSD) Computation . ..
6.3 Subspace Decomposition and Bases Identification on PSD Features
6.3.1 Subspace Decomposition Based on Minimum Coding Length
6.3.2 Subspace Bases Identification . ......
6.4 Voice/Video Classifier . ..............
6.5 Experiment Results . ...............
6.5.1 Experiment Settings . ...........
6.5.2 Skype Flow Classification . ........
6.5.3 General Flow Classification . .......
6.5.4 D discussion . . . . . .
6.6 Sum m ary . . . . . . .
7 CONCLUSION AND FUTURE WORK . .......
. 108
. 115
. 117
. . . 120
. . . 12 1
. . . 123
. . . 123
. . . 124
. . . 124
. . . 125
. . . 128
. . . 129
Summary of Network Centric Anomaly Detection
Summary of Network Centric Traffic Classification
APPENDIX
A PROOFS . . .
Equation (431) . ...
Equation (432) . ...
Equation (433) . ...
Equation (434) . ...
133
133
134
134
REFERENCES ...............
BIOGRAPHICAL SKETCH .............................. .......
1.2 Introduction to Network Centric Traffic Classification
Besides network anomaly detection, classification of normal network traffic is also
of practical significance to both enterprise network administrators and ISPs. Along
with the rapid emergence of new types of network applications such as VoIP, VoD, and
P2P file exchange, quality of service (QoS) becomes a more and more important issue.
For example, transmission of realtime voice and video has bandwidth, delay, and loss
requirements. However, there is no QoS guarantee for these realtime applications over the
current besteffort network. Many schemes are proposed to address this problem. On the
other hand, enterprise network administrators may want to restrict network bandwidth
used by disallowed VoIP, VoD, or P2P applications, if not totally block, which might be
too rude. That is, they want to limit the QoS of specific network traffic.
Wu et al.[4] summarized techniques for QoS provision for realtime streams from
the point of view of end hosts. These techniques include coding methods, protocols,
and requirements on stream servers. Another effective solution is from the point of view
of network carriers or ISPs. For example, ISPs can assign different forwarding priority
to different types of network traffic on routers. This is the motivation of differentiated
services (DiffServ)[5, 6].
DiffServ is a method designed to guarantee different levels of QoS for different classes
of network traffic. It is achieved by setting the "type of service" (TOS)[7] field, which
hence is also called DiffServ code point(DSCP)[5], in the IP header according to the class
of the network data, so that the better classes get higher numbers. Unfortunately, such
design highly depends on network protocols, especially proprietary protocols, observing
DiffServ regulations. In the worst case, if all protocols set TOS to the highest number, it
is even worse to employ DiffServ method.
For this reason, we believe a proper DiffServ scheme should be able to classify
network traffic on the fly, instead of relying on any tags in packet header. Thus, the
difficulty lies in accurate classification of network traffic in realtime.
Table 42: Notations for hidden markov tree model
Notation Description
fi The random variable representing the state of node i.
ti The random variable/vector representing the features) measured at node i.
4r {4i; i E T}, where T is a subtree of the HMT.
L The number of levels of the HMT.
2 The set of all nodes in the HMT.
El The set of nodes at level 1, 1 E ZL, in the HMT. Specifically, Eo represents the set
of root nodes and EL1 leaf nodes.
B The number of children nodes of each node, except leaves. For example,
B = 4 for quadHMT, as illustrated in Figure 45.
p(i) The parent node of node i, where i o0.
v(i) The set of children nodes of node i, where i EL_1.
T' The set of ancestor nodes of node i, where i Eo, including node i.
7(i) The root node of the subtree containing node i, where i E B.
% The subtree whose root is node i, where i E .
Ti TKY) \ 7>
nodes represent clusters of edge routers. Features of nodes are defined in Equation (416).
Features measured at the corresponding edge router i E EL_l(i.e., leaf node)
1 jE,(i) 4 i i ZEL(i.e., nonleaf node)
(416)
One notes that only features of leaf nodes have physical nii iii:. i.e., features measured
at corresponding edge routers. Features of a nonleaf node are assumed to be average of
features of its child nodes.
We have two assumptions for the HMT:
1. Node state only depends on state of its parent, if it is known, i.e.,
P(QilQ,j e E,j / i) =P(QQQ(i)), i E \ So (417)
2. Features measured at a node only depends on state of that node, if it is known, i.e.,
p(AN A) =p(AN ), i E B (418)
to sell and deliver IP services to their customers. Unfortunately the emergence of a bloom
of new zerod, voice and video applications over IP, like Skype, Google Talk (Gtalk),
MSN, etc, the proliferation of new peertopeer protocols that now allow the usage of voice
and video among other applications, and the continuing growth of the usage of encryption
techniques to protect data confidentiality, lead to tremendous revenue leakage for ISPs due
to their inefficiency in detecting these new applications and thus lack of proper actions.
The result from unmanaged commercial traffic adds up to loss of hundreds of millions of
dollars annually and poses a solid road block to the profitability of ISPs' VoIP and IPTV
services.
As a consequence, it is imperative for ISPs to identify robust solutions for detecting
voice and video over IP datastreams. The most common approach for identifying
applications on an IP network is to associate the observed traffic with an application
based on TCP or UDP port numbers [43, 44]. In principle the TCP and UDP server
port numbers can be used to identify the higher 1 v.r application, by simply identifying
the server port and mapping this port to an application using the Internet Assigned
Numbers Authority (IANA) list of registered ports [45]. However, portbased application
classification has limitations due to the emergence of new applications that no longer
use fixed, predictable port numbers. For example, nonprivileged users often have to use
ports above 1024 to circumvent operating system access control restrictions; or common
applications like FTP allows the negotiation of unknown and unpredictable server ports to
be used for the data transfer; or proprietary applications may deliberately try to hide their
existence or bypass portbased filters by using standard ports. For example, server port 80
is being used by a large variety of nonweb applications to circumvent firewalls which do
not filter port80 traffic; others (e.g., Skype) tend to use dynamic ports.
A more reliable technique involves stateful reconstruction of session and application
information from packet contents [4648]. Although this avoids reliance on fixed port
numbers, it imposes significant complexity and processing load on the classification device,
Assume the length of a slot is 7. Then, we have F = w x 7. The data structure of BFA is
as follows:
* An array of bit vectors {IVj} (j E Z+), where IVj is the jth insertion vector holding
inbound signatures in slot [rj, Tj+l), where Tj+l = j + 7.
* An array of bit vectors {RVj} (j E Z+), where RVj is the jth removal vector holding
outbound signatures in slot [rT, Tj+i).
* An array of counters {C,} (j E Z+), where Cj is used to count the number of UIF in
slot [Tj, Tj+,).
Since the twoway flows need to be matched within a time interval of length F, we
only need to keep information within a time window of length F. That is, if the current
slot is [7j, j+l), only {IVj_,w+,... ,IVj}, {RVj_w+,... ,RVj}, and {Cj_,+I,...,Cj} are
kept in memory.
3.5.2 Algorithm
Our algorithm for BFA (Figure 38) consists of three functions, namely, ProcInbound,
ProcOutbound and Sample, which are described as below.
Function ProcInbound is to process inbound packets. It works as below. When
an inbound packet arrives during [ T, 9Tj+), we increase Cj by 1 and insert its inbound
signature s into IVj if none of the following conditions is satisfied:
1. s is stored in at least one RVj,, where j w + 1 < j' < j;
2. s is stored in IVj.
Condition 1 being true means that the corresponding outbound flow of this inbound
packet has been observed previously; so we should not count it as an unmatched inbound
packet. Condition 2 being true means that the inbound flow, to which this inbound packet
belongs, has been observed during the current slot j; so we should not count the same
inbound flow again. If both conditions are false, we increase Cj by one to indicate a new
potential UIF (line 7 to 10).
1. function ProcInbound(s)
2. a false, b false
3. if 3j', j w + 1 < < j, such that BloomFilterSearch(RVj,,s) returns true then
4. a  true
5. if BloomFilterSearch(IVj,s) returns true then
6. b true
7. if a and b are both false
8. Cj Cj + 1
9. BloomFilterInsert(IVj, s)
10. end if
11. end function
12. function ProcOutbound(s')
13. for j' j to j w + 1
14. if BloomFilterSearch(RVj,, s') returns true
15. break
16. if BloomFilterSearch(IVjy, s') returns true
17. cj, j, 1
18. end for
19. BloomFilterInsert(RVj, s')
20. end function
21. function Sample(j)
22. return Cj,w+l
23. end function
Figure 38. Bloom Filter Array Algorithm
Function ProcOutbound is to process outbound packets. It works as below. When an
outbound packet arrives during [rj, Tj+,), we check whether we need to update counter Cy
for each j' (j w + 1 < j' < j). Specifically, for each j' (j w + 1 < j < j), decrease Cj
by one if its outbound signature s' satisfies both of the following conditions:
1. s' is not contained in RVj,;
2. s' is contained in IVy'.
Condition 1 being true means that no packet from the outbound flow to which this
outbound packet belongs arrives during the j'th time slot. Condition 2 being true means
that the matched inbound flow of this outbound packet has been observed in the j'th slot.
Satisfying both conditions means that its matched inbound flow has been counted as a
potential UIF; hence, upon the arrival of the outbound packet, we need to decrease C, by
ACKNOWLEDGMENTS
First of all, thank my advisor Professor Dapeng Wu for his great inspiration, excellent
guidance, deep thoughts, and friendship. I also thank my supervisory committee members,
Professors Shigang C'!i, i, Liuqing Yang, and Tao Li, for their interest in my work.
I also express my appreciation to all of the faculty, staff, and my fellow students
in the Department of Electrical and Computer Engineering. In particular, I extend my
thanks to Dr. Kejie Lu for his helpful discussions.
CHAPTER 4
MACHINE LEARNING ALGORITHM FOR NETWORK ANOMALY DETECTION
4.1 Introduction
The third issue in network anomaly detection is the classification algorithm. In
this section, we introduce three basic detection algorithms, thresholdbased algorithm,
changepoint algorithm, and B iv i i, decision theory. Our machine learning algorithm
derives from the B i, i ,i decision theory.
This section is organized as below. Section 4.1.1 introduces the Receiver Operating
CIi o ':teristics curve. It is used in Section 4.5 as the metrics to compare performance of
different classification methods. We describe the thresholdbased algorithm, changepoint
algorithm, and B ,i, i oi decision theory in Sections 4.1.2, 4.1.3, and 4.1.4, respectively.
4.1.1 Receiver Operating Characteristics Curve
Receiver Operating C'li,,,. /. ristics (ROC) curve [28] is a typical method to quantify
the performance of detection algorithms. It is a plot of detection probability vs.
false alarm probability. In practice, we estimate detection probability and false alarm
probability by the fraction of true positives and the fraction of true negatives, respectively.
Hence, to obtain an ROC curve, one needs to measure the following quantities
* Af: the number of false alarms, i.e., the number of slots in which the detection
algorithm declares 'abnormal' given that no anomaly actually happens in these slots;
* A,: the number of slots in which no anomaly happens;
* Ad: the number of slots in which the detection algorithm declares 'abnormal' given
that network anomalies actually happen in these slots;
* Aa: the number of slots in which network anomalies happen.
The false alarm probability and the detection probability of the detection algorithm
can be estimated by Af/A, and Ad/Aa, respectively. By varying parameters of detection
algorithms, we can obtain different pairs of false alarm probability and detection
For the same scenario that one needs to estimate AR model from order 1 up to pmax,
the time complexity of LDA is 0 (p~,), much better than the direct solution given by
Equations (617) and (618).
1. function PSDEstimate(...)
2. Argument 1: data, {y(i)} i.
3. Argument 2: order, p.
4. Return: PSD [w(~; y).
5.
[ap, _C ] LDA({y(i)}0 ,p) (630)
6.
2
(; y) = t2 (631)
It + EY atea
Figure 64. Parametric PSD Estimate using LevinsonDurbin Algorithm.
Once the AR model is estimated, one can estimate PSD of signal {y(i)} 1. The
procedure is given in Figure 64.
PSD feature vector.
According to the above discussion, we now define the PSD feature vector of a flow
as the following. Let us assume {Pd(i)}d, (see Equation (64)) to be secondorder
stationary. Then its PSD can be estimated as:
b (7;Pd) PSDEstimate ({Pd(i)} d ,) (632)
where w E [7r, 7) and p is the prespecified order.
Recall that {Pd(i)}!I are obtained by sampling a continuoustime signal Ph(t) at
time interval T, (see Figure 62). Thus, one can further formulate the PSD in terms of real
frequency f as
bf(f; 'Pd) ( ; P ed) ( F s ) (633)
F, 2 2
A.3 Equation (4 33)
Proof:
Tr(u)' (u)
E., Ti(u')v (u')
p OQ = U, pv) P (?=v I U)
K,, P (Q U u) p (^  U)
p Qi = u,
K,, P (j U1,
p [Qi u,
p( )
P(p(i) = u) (A3)
A.4 Equation (434)
Proof:
(, u)Vp(i)( ')P (,, = U, p( ) p R ') p(j(')
[ ,,, Ti(u(u")] [ ,( P (P (=2 u" Q) = u') v (u")]
P ( i u)P (TK\. Qp() U') P (Q UQp() U) p (ip()) U',)
p (i =u (T I Qp(i) U') P (pp(j) u', p?\^)) P KW  p(i) U')
[E i=P (i U', )I [YP (Q i Ip(i ) = u)
P Hp (fe p(i) /)
L p (i = U ", pp() = U', U p(i) Ti, p(i) = U
p ( ui = Qp() = U',
P(Qj = u, Qp(i) = u'\^) (A4)
goals, we propose to use a hierarchical model, the hidden Markov tree (HMT) model, as
depicted in Figure 45.
1+1
I1
Figure 45. Hidden Markov tree model. For an node i, p(i) denotes its parent node and
v(i) denotes the set of its children nodes.
The motivation of applying HMT model is that we assume edge routers are not
equally correlated. Instead, edge routers topologically close to each other have high mutual
correlations. Based on this assumption, we cluster edge routers according to the topology
of AS and form a tree structure, as depicted in Figure 45. Without loss of generality,
Figure 45 plots a quadtree structure, i.e., each node, except leaf nodes, has four children.
To facilitate further discussion, each node in the HMT is assigned an integer number,
beginning with 0, from top to bottom. That is, node 0 is ah,i a root node2 Table 42
lists the notations used in the rest of the paper for HMT.
In the HMT, each leaf node stands for an edge router. Zeropadding virtual edge
routers are introduced when the number of edge routers is not a power of B. States of
these zeropadding virtual nodes are alx i normal and features are ah,l 0. Nonleaf
2 A HMT might have multiple roots, depending on the number of edge routers and the
number of levels.
i.e., E2, a .. ., ap. Its solution is
1
a,1 (0; /) ... r( + p ;l/) (l)
S (617)
p (p ; y) ... (0; y) r(p)
(2 =(0; y) + (t; y)at. (618)
t=1
LevinsonDurbin algorithm (LDA). The direct solution of YuleWalker method,
i.e., Equations (617) and (618), is not good enough in terms of time complexity.
Equation (617) computes the inversion of covariance matrix, whose time complexity is
O (p3)[60, page 755]. In addition, in most applications, there is no a priori information
about the true order p. To cope with that, the YuleWalker system of equations,
Equation (615), has to be solved for p = 1 up to p = pmax, where pmax is some
prespecified maximum order. The time complexity is O (p4ax).
In this paper, we use the LevinsonDurbin Algorithm (LDA)[59]to reduce time
complexity. It estimates AR signal coefficients recursively in the order p. To facilitate
further discussion and to emphasize the order p, we denote
r(0; Y) r( ; y) r(p; y)
A l r (1; y) r(0; y) r(p+ 1;y) (69)
,p+i (619)
r(p;y}) r (1;y) r(0;y)
a1
ap^ (620)
ap
We show the features extracted from the two traces, specifically, the number of SYN
packet arrivals and the number of unmatched inbound SYN packet arrivals during a slot
(Figure 316).
From Figure 316, it can be observed that the features are rather noisy, especially for
the feature of the number of SYN packets. From Figs. 316(a) and 316(c), we can hardly
distinguish the slots under the low volume synchronized attacks from the slots without
attacks (by visual inspection). In comparison, it is much easier to identify the slots under
the synchronized attacks (by visual inspection) when the number of unmatched SYN
packets is used as the feature (see Slot 1400 1600 and Slot 2800 3200 in Figs. 316(b)
and 316(d).
3.8 Summary
This chapter is concerned with design of data structure and algorithms for network
anomaly detection, more specifically, feature extraction for network anomaly detection.
Our objective is to design efficient data structure and algorithms for feature extraction,
which can cope with a link with a line rate in the order of Gbps. We proposed a novel
data structure, namely the Bloom filter array, to extract the socalled twoway matching
features, which are shown to be effective indicators of network anomalies. Our key
technique is to use a Bloom filter array to trade off a small amount of accuracy in feature
extraction, for much less space and time complexity. Different from the existing work, our
data structure has the following properties: 1) .;,ir i.' Bloom filter, 2) combination of a
sliding window with the Bloom filter, and 3) using an insertionremoval pair to enhance
the Bloom filter with a removal operation. Our analysis and simulation demonstrate
that the proposed data structure has a better space/time tradeoff than conventional
algorithms.
Next, we discuss classification algorithm based on extracted features.
Equation (44) is equivalent to
=argmin X(H* H)p( H)P(H). (46)
HEH
Equation (46) gives the B li i ,i criterion for pattern classification.
A simple loss function is defined to be
X(H,., H,) = (u)(u* = u), for V u e Zu, (47)
where <(u) is the gain factor, typically positive, representing the gain obtained by
correctly detecting H,, and Z() is the indicator function such that
1 if x is true
I(x) = (48)
0 if x is false
Equation (47) specifies that misclassification induces zero loss and correct classification
induces negative loss, which actually achieves gain. Applying Equation (47) to Equation (46),
the B li,i ,i criterion is simplified to,
S= arg max ((u)p( H,) P(H,). (49)
uEZU
We call Equation (49) the maximum gain criterion. Further note that, scaling all gain
factors by a same factor does not change the criterion specified by Equation (49). Hence,
we can ah,ix set x(0) = 1. By tuning other gain factors, we can generate ROC curve for
B i, i io decision theory.
In this chapter, we extended the B i, i i, decision theory for network anomaly
detection. The remainder of this chapter is organized as follows. In Section 4.2, we
establish B li, i models for network anomaly detection. Sections 4.3 and 4.4 solve
two fundamental problems of B i, i i, model, i.e., training problem and classification
problem, respectively. Section 4.5 shows our simulation results and Section 4.6 concludes
this chapter.
In our experiment, we use two 24hour traces as the background traffic. We simulate
network anomalies caused by TCP SYN flood attacks [18] by randomly inserting TCP
SYN packets with random source IP addresses into the background trace during specified
time periods. Specifically, synchronized attacks are simulated during 14000 16000 second
and 28000 32000 second in both traces. In addition, .ivnchronous attacks are launched
during 5000052000 second in trace 1 and during 5700059000 second in trace 2. The
average attack rate is 1 of the packet rate of the background traffic during the same
period.
To detect TCP SYN flood attacks, we choose < SA, DA, SP, DP > as the signature
of inbound packets and < DA, SA, DP, SP > for outbound ones. Thus the twoway
matching feature is the number of unmatched inbound TCP SYN packets in one time slot.
The average flow rate is 2480 flows/second. Therefore, we set R = 2480. We further set
S= 0.1 C = 8, w = 8, and 7 = 10. Then, by solving Equation (318) for ifl.. and
requiring i .. to be a power of 2, we obtain f.. = 215 bits. The computer used for our
experiments has one 2.4G Hz CPU and 1GB memory. For comparison, we also extract the
number of inbound SYN packets in a slot.
Performance.
The average processing rate is measured to be 265, 000 packets/second. Hence,
the algorithm can deal with a line rate of 1 Gbps since the average Internet packet
size is about 500 bytes. Note that our test is offline and data is read from hard disk,
whose access speed is much lower than that of memory. In a real implementation,
data is captured by a highspeed network interface and maintained in the memory;
so the processing speed can be increased. Furthermore, in our test, a hash function is
implemented by software, which is also much slower than a dedicated hardware. Therefore,
it is reasonable to anticipate a higher processing rate if a dedicated hardware is used.
[31] B. J. Frey, Graphical Models for Machine Learning and D.:.I:l.i Communication.
Cambridge, MA: MIT Press, 1998.
[32] M. C. Nechyba, l\! ::iiiiiiiilikelihood estimation for mixture models:
the em algorithm,," 2003, course note. [Online]. Available: http:
//mil.ufl.edu/~nechyba/__eel6825.f2003/coursematerials/t4.em_theory/emnotes.pdf
[33] Y. Weiss, "Correctness of local probability propagation in graphical models with
loops," Neural Computation, vol. 12, pp. 14, 2000.
[34] J. S. Yedidia, W. T. Freeman, and Y. Weiss, "Generalized belief propagation,"
Advances in Neural Information Processing S,'l. mi, vol. 13, pp. 689695, Dec. 2000.
[35] J. Pearl, Probabilistic Reasoning in Intelligent SI. I" Networks of Plausible
Inference. Morgan Kaufmann, Sept. 1988.
[36] S. M. Aji and R. J. McEliece, "The generalized distributive law," IEEE Trans. I,,1,1
Th(.,;, vol. 46, pp. 325343, Mar. 2000.
[37] T. Richardson, "The geometry of turbodecoding dynamics," IEEE Trans. InfT, 11
Th(.,;, vol. 46, pp. 923, Jan. 2000.
[38] R. J. McEliece, D. J. C. McKay, and J. F. C'!. i: "Turbo decoding as an instance
of pearls belief propagation algorithm," IEEE J. Select. Areas Commun., vol. 16, pp.
14052, Feb. 1998.
[39] F. Kschischang and B. Frey, 11I I, i i. decoding of compound codes by probability
propagation in graphical models," IEEE J. Select. Areas Commun., vol. 16, pp.
219230, Feb. 1998.
[40] A. J. Viterbi, "Error bounds for convolutional codes and an .,iipll ically optimum
decoding algorithm," IEEE Trans. Inform. Tht.' ; vol. 13, pp. 260269, Apr. 1967.
[41] J. G. DAVID FORNEY, "The viterbi algorithm," in Proceedings of the IEEE, vol. 61,
Mar. 1973, pp. 268278.
[42] L. R. RABINER, "A tutorial on hidden markov models and selected applications in
speech recognition," in Proceedings of the IEEE, vol. 77, Feb. 1989, pp. 257286.
[43] D. Moore, K. Keys, R. Koga, E. Lagache, and k claffy, "The CoralReef software suite
as a tool for system and network administrators," in Usenix LISA. (2001), Dec. 2001.
[Online]. Available: citeseer.ist.psu.edu/moore01coralreef.html
[44] C. Logg, "C'!i i:terization of the traffic between slac and the internet," July
2003. [Online]. Available: http://www.slac.stanford.edu/comp/net/slacnetflow/html/
SLACnetflow.html
[45] I. A. N. Authority, "Port numbers," Aug. 2006. [Online]. Available:
http://www.iana.org/assignments/portnumbers
Usually it is difficult to estimate prior probabilities. In the paper, we simply assume
P(Q = 0) = P(Q = 1) = c i E 0o. That is, states of root nodes are equally likely to
be normal or abnormal. Other parameters such as likelihood and transition probabilities
are estimated from training data. This is covered in Section 4.3. After that, classification
using maximum gain criterion is described in Section 4.4.
4.3 Estimation of HMT Parameters
In this section, we describe estimation of HMT parameters. It is organized as follows.
In Section 4.3.1, we describe estimation of likelihood p(pI Q), i E B. Estimation of
transition probabilities P(Q2lR p()), i cE \ 0o are presented in Section 4.3.2.
4.3.1 Likelihood Estimation
For the purpose of likelihood estimation, we collect two sets of training data.
* The set of features sampled in normal states, {1 k); k {1,..., Ko}}, where Ko is
(k) 0(k) j,(k) denotes the kth feature
the number of normal samples and 0k) )' i denotes te feature
measured at node i;
* The set of features sampled in abnormal states, {(k); k c {1,..., Ki}}, where K1 is
the number of abnormal samples and j { 'i; i E denotes the kth feature
measured at node i.
Gaussian mixture model. In order to effectively estimate the likelihood, we assume
that the random variables/vectors, ilti, i E follow a statistical distribution model.
Then, likelihood estimation translates to model parameters estimation. We establish the
statistical model in the following.
Because of its good properties, Gaussian (normal) distribution [30] is widely emploiv, 1
in many applications. The pdf of a ddimensional multivariate Gaussian distribution with
mean vector p and variance matrix E is
(X; (2 P, d 2 exp (x P)tl(x ) (4 23)
Figure 46 plots the pdf of the univariate Gaussian distribution AV (x; 0, 1). It is observed
that Gaussian distribution is a unimodal distribution[30], i.e., its pdf only has one peak.
processes, we transform P(t) to a discretetime sequence by applying sampling at
frequency F, = .
Because the signal defined in Equation (62) is represented as a summation of delta
functions, its spectrum spans the whole frequency domain. In order to correctly reshape
the spectrum of Ph(t) to avoid aliasing when it is sampled at interval T,, we apply a
low pass filter (LPF) characterized by its impulse response hLpF(t). Ph(t) can then be
mathematically described as following:
Ph(t) =P(t) hLpp(t)= Ph(t A). (63)
EFs
After sampling at interval T, we obtain the following discretetime sequence:
Pd(i) =Ph(iT) Ph(iT A), (64)
<'P,A> E5s
where i 1,..., Id A A + 1, Am is the arrival time of the last packet in the flow.
We note that the sampling interval T, cannot be arbitrarily chosen. If T, is too large,
then the spectrum of the flow F, contains information only related to low frequencies and
thus lacks of information related to the high frequency spectrum. On the other hand, if T,
is too small, then the length Id of the resulting discretetime sequence will be very large,
resulting in very high complexity in computing the PSD of Fs. After an extensive analysis
of widelyused voice and video applications such as Skype, MSN, and GTalk, we observed
that choosing T, = 0.5 milliseconds is sufficient to extract all useful information for our
purpose.
Next, we provide a methodology to extract the regularities residing in the signal P(t).
We achieve this by studying the extracted digital signal Pd(i) in the frequency domain
applying power spectral density analysis.
6.2.2 Power Spectral Density (PSD) Computation
Power spectral density definition.
Features.
To detect distributed TCP SYN attacks, we use the twoway matching features
described in '!O ipter 3, i.e., the number of unmatched inbound SYN packets in one
time slot. The parameter setting of twoway matching features extraction is same as in
Section 3.7.2. For convenience, we summarize the parameters in Table 43.
Table 43: Parameter setting of feature extraction for network anomaly detection
Notation Description
R 2480
1l 0.1.
/C 8
w 8
7 10 seconds
1i1. 215 bits
For comparison purpose, we also measure the number of SYN packets and SYN/FIN
ratio[11] in a slot.
4.5.2 Performance Comparison
Table 44: Performance of different schemes.
Feature Detection algorithm Detection probability False alarm probability
SYN/FIN ratio CUSUM 0.174 0.129
SYN CUSUM 0.52 0.129
SYN Machine learning 0.656 0.123
Unmatched SYN CUSUM 0.690 0.130
Unmatched SYN Machine learning 0.973 0.115
Table 44 compares the performance of different schemes, where the benchmark is
the scheme in [11], i.e., the CUSUM scheme with SYN/FIN ratio as the feature; for the
benchmark scheme, we use the same parameter setting as that in [11]; we compare the
benchmark with CUSUM and our machine learning algorithm under different features.
To make fair comparison, we make the false alarm probability of each scheme almost
the same and compare the detection probability. From Table 44, it can be seen that,
the benchmark scheme ('SYN/FIN ratio'+CUSUM) performs very poorly in detecting
low volume DDoS attacks. In contrast, a CUSUM algorithm with the number of SYN
achieve 1C(,' detection rate of both voice and video in the case of singletyped flow, e.g.
one application per 5tuple, and 98.,' and 94.>' respectively for voice and video when
dealing with the more complex scenario of hybrid flows, e.g. voice, video and file transfer
bundled together in the same 5tuple flow.
The rest of the chapter is organized as follows. In Section 5.2, we introduce the
related work in the area of pattern classification methodologies. Section 5.3 describes
the weaknesses of metrics previously used by other works when applied in our context
and highlights the new traffic features that constitute the foundation of our approach.
Section 5.4 summarizes this chapter.
5.2 Related Work
Existing work on traffic classification uses discriminating criteria such as the packet
size distribution per flow, the interarrival times between packets within the same flow
and other statistics captured across multiple flows. For example, in Ref. [49], the authors
proposed the combination of average packet size within a flow and the interarrival
variability metric, e.g. defined as the ratio of the variance to the average interarrival
times of packets within a flow, as a powerful metric to define fairly distinct boundaries for
three groups of applications: (i) bulk data lii.r er like FTP, (ii) interactive like HTTP,
and (iii)streaming like voice, video, ,nii.r etc. Several classification techniques, like
nearestneighbor and Knearestneighbor, were then tested using the above traffic features.
Although this preliminary study has proved that the approach of pattern classification has
great potential for a proper application classification, it proves that much more work still
remains, e.g. exploiting other alternative for traffic features and classification techniques.
Moreover, although the features extracted are simple and feasible to be implemented
onthefly, the learning algorithm is complex and the outcome boundaries among the three
families of applications are heavily nonlinear and timedependent.
Similar to Ref. [49], Karagiannis et al.[50] proposed a novel approach, called BLINC,
that exploits networkrelated properties and characteristics. The novelty of this approach
REFERENCES
[1] P. Mockapetris, "Domain names concepts and facilities," RFC 1034.
[2] P. Mockapetris, "Domain names implementation and specification," RFC 1035.
[3] "Video on demand," Wikipedia. [Online]. Available: http://en.wikipedia.org/wiki/
Video_on_demand
[4] D. Wu, Y. T. Hou, W. Zhu, Y.Q. Z!i ii and J. M. Peha, "Streaming video over
the internet: Approaches and directions," IEEE Trans. Circuits Syst. Video Technol.,
vol. 11, pp. 282300, Mar. 2001.
[5] K. Nichols, S. Blake, F. Baker, and D. Black, "Definition of the differentiated services
field (ds field) in the ipv4 and ipv6 headers," RFC 2474.
[6] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, "An architecture
for differentiated services," RFC 2475.
[7] P. Almquist, "Type of service in the internet protocol suite," RFC 1349.
[8] S. S. Kim, A. L. N. Reddy, and M. Vannucci, "Detecting traffic anomalies using
discrete wavelet transform," in Proceedings of International Conference on Information
Networking (ICOIN), vol. III, Busan, Korea, Feb. 2004, pp. 13751384.
[9] C.M. C(! i;,, H. T. Kung, and K.S. Tan, "Use of spectral analysis in defense against
dos attacks," in Proceedings of IEEE Globecom 2002, vol. 3, Taipei, Taiwan, Nov.
2002, pp. 21432148.
[10] A. Hussain, J. Heidemann, and C. Papadopoulos, "A framework for classifying denial
of service attacks," in Proceedings of AC'If SIGCOMM, Karlsruhe, Germany, Aug.
2003.
[11] H. Wang, D. Z!h Ii: and K. G. Shin, "Detecting SYN flooding attacks," in Proc. IEEE
INFOCOM'02, New York City, NY, June 2002, pp. 15301539.
[12] T. Peng, C. Leckie, and K. Ramamohanarao, "Detecting distributed denial of service
attacks using source IP address monitoring," Department of Computer Science and
Software Engineering, The University of Melbourne, Tech. Rep., 2002. [Online].
Available: http://www.cs.mu.oz.au/~tpeng
[13] R. B. Blazek, H. Kim, B. Rozovskii, and A. Tartakovsky, "A novel approach to
detection of "denialofservice" attacks via adaptive sequential and batchsequential
changepoint detection methods," in Proc. IEEE Workshop on Information Assurance
and S.. ;,, West Point, NY, June 2001, pp. 220226.
[14] S. Mukkamala and A. H. Sung, "Detecting denial of service attacks using support
vector machines," in Proceedings of IEEE International Conference on F;,..; S.l,/; ii
May 2003.
Yet, the emergence of a bloom of new zerodv voice and video applications such as
Skype, Google Talk, and MSN poses tremendous challenges for ISPs. The traditional
approach of using port numbers to classify traffic is infeasible due to the usage of
dynamic port number. In the second part of our research, we focus on a statistical
pattern classification technique to identify multimedia traffic. Based on the intuitions that
voice and video data streams show strong regularities in the packet interarrival times
and the associated packet sizes when combined together in one single stochastic process,
we propose a system, called VOVClassifier, for voice and video traffic classification.
VOVClassifier is an automated selflearning system that classifies traffic data by extracting
features from frequency domain using Power Spectral Density analysis and grouping
features using Subspace Decomposition. We applied VOVClassifier to real packet traces
collected from different network scenarios. Results demonstrate the effectiveness and
robustness of our approach.
simulation results
numerical results
2 3 4 5 6 7
Average number of hash function calculations
simulation results
numerical results
2 3 4 5 6 7
Average number of hash function calculations
Figure 315. Comparison of numerical and simulation results. (a) Hash table algorithm.
(b) BFA algorithm with =1.
equally likely to be accessed. Hence, Equation (32) does not hold perfectly, neither does
Equation (33). As a result, the average number of hash function calculations per query in
simulation is larger than that predicted by Equation (36).
Figure 315(b) shows that the numerical result agrees well with the simulation result
for all the values of the average number of hash function calculations per query under our
study.
3.7.2 Experiment on Feature Extraction System
In this section, we show the performance of the complete feature extraction system
implemented on traffic monitors and local analyzers, which uses the BFA algorithm.
The reason of conducting this experiment is that we would like to know the
performance of the whole hierarchical feature extraction architecture presented in
Section 3.2. In contrast, the experiment in Section 3.7.1 does not involve the interaction
among the three levels.
Experiment settings. We use the trace data provided by Auckland University [27]
as the background traffic. This data set consists of packet header information of traffic
between the Internet and Auckland University. The connection is OC3 (155 Mb/s) for
both directions.
BIOGRAPHICAL SKETCH
Jieyan Fan was born on Jul 26, 1979 in Shanghai, ('!i i, The only child in the family,
he grew up mostly in his home town, graduating from the High School Affiliated to Fudan
University in 1997. He earned his B.S. and M.S. in electrical engineering from Shanghai
Jiao Tong University, Shanghai, C('l, in 2001 and 2004, respectively. He is currently
a Ph.D. candidate with electrical and computer engineering, University of Florida,
Gainesville, FL. His research interests are network security and pattern classification.
Upon completion of his Ph.D. program, Jieyan will be working in Yahoo! Inc,
Sunnyvale, CA.
The rest of this chapter is organized as the following. Sections 6.2, 6.3, and 6.4
describe the components, Feature Extractor, Voice / Video Subspace Generator, and Voice
/ Video Classifier, respectively. In Section 6.5, we conduct experiments on traffic collected
between two universities using Skype, MSN, and GTalk. Section 6.6 summarizes this
chapter.
6.2 Feature Extractor (FE) Module via Power Spectral Density (PSD)
As explained in Section 5.3 the extraction and processing of simple traffic features
does not solve the problem of detecting and separating voice and video data streams from
other applications. In this section we first introduce the preliminary steps that we take to
transform each generic flow Ts obtained from FSG into a stochastic process that combines
the interarrival times and packet sizes. Then we describe how to use power spectral
density (PSD) analysis as a powerful methodology to extract such hidden key regularities
residing in realtime multimedia network traffic.
6.2.1 Modeling the network flow as a stochastic digital process
Preprocessing
________, 1, ___]LPF ] Th
Stochastic aqtF LPF Sample Estimate
Model hut Ts PSO
Figure 62. Power spectral density features extraction module. Cascade of processing
steps.
Each flow Fs extracted from the FSG is forwarded to the FE module that applies
several steps in cascade (Figure 62). First, any Fs extracted (see Equation (61)) is
modeled as a continuous stochastic process as illustrated in Equation (62):
P(t)= P6 (t A), (6 2)
< P,A> Es
where 6(.) denotes the delta function. As the reader can notice, our model combines
packet arrival times and packet sizes together as a single stochastic process. Because
digital computers are more suitable to deal with discretetime sequences than continuoustime
4. combining packet interarrival time and packet size in frequency domain.
It turns out that the last one is the most distinctive feature for classifying voice and
video traffic.
We then presented the VOVClassifier system to classify voice and video traffic.
VOVClassifier is composed of four 1i i, ri" modules that operate in cascade, flow summary
generator, feature extractor, voice / video subspace generator, and voice / video classifier.
The novelty of VOVClassifier is that
* we combine packet interarrival times and packet sizes of a network flow and model it
by a stochastic process;
* we estimate PSD feature vector to extract regularities residing in voice and video
traffic;
* we use minimum coding length to decompose subspaces from the training feature
vectors and principal component analysis to identify bases of each subspace;
* we use minimum distance to subspaces as the similarity metric to perform classification.
The experiment results demonstrate the effectiveness and robustness of our approach.
the results of which are not shown here due to the space limit. The machine learning
algorithm is shown to be robust under realistic timevarying traffic patterns such as
the Auckland data traffic [27]. We tested our machine learning algorithms for a large
IP address space, i.e., the IP address space can be the whole IP address space for the
Internet.
4.6 Summary
In this chapter, we propose a novel machine learning detection algorithm, based on
B i, i in decision theory and hidden Markov tree model. The key idea of our algorithm is
to exploit spatial correlation of network anomalies. Our detection scheme has the following
nice properties:
* In addition to detecting network anomalies having highdatarate on a link, our
scheme is also capable of accurately detecting attacks having lowdatarate on
multiple links. This is due to exploitation of spatial correlation of network anomalies.
* Our scheme is robust against timevarying traffic patterns, owing to powerful machine
learning techniques.
* Our scheme can be deploy, 1 in largescale highspeed networks, thanks to use of
Bloom filter array to efficiently extract features.
With the proposed techniques, our scheme can effectively detect network anomalies
without modifying existing IP forwarding mechanisms at routers. Our simulation results
show that the proposed framework can detect DDoS attacks even if the volume of attack
traffic on each link is extremely small (i.e., 1 .). Especially, for the same false alarm
probability, our scheme has a detection probability of 0.97, whereas the existing scheme
has a detection probability of 0.17, which demonstrates the superior performance of our
scheme.
Similar to Sections 4.2.1 and 4.2.2, we employ maximum gain criterion (see
Equation (49)) to estimate node states, i.e.,
ui = arg max ((u)P(Q = iu)
Sarg max I (u/,)P(A,'= ui/,QP(i,) = up(i),/)
i^ T e' \{iR(i)
SI (;T,,) P(A ) W ,), (419)
for V i E L_1. Applying Viterbi algorithm to solving Equation (419), we reduce the
computation complexity from O (2") (see Section 4.2.2) to O (Ba). This is the ni ii r
advantage of introducing HMT model. The details are given in Section 4.4.
Solving Equation (419) requires knowledge of P(Q~ i p(i), for Vi E E \ Eo, and
P(j = usll), for Vi e Zo. By B ,, i o, formula[30],
P(Q= us 1 P ,,Q) U)p(\
P(li = Ui\l,(i) = Up(i), A) =  (420)
P(Qp(i) = Up(i) )
for Vi E E \ Eo. Therefore, solving Equation (419) translates to estimating
P(A0, Qp() ), Vi B \o, (4 21)
and
P(Q
Estimating Equations (421) and (422) in closed form is difficult. We proposed a
belief propagation (BP) algorithm, described in Section 4.3, to estimate them efficiently
given knowledge of
* prior probabilities: P(, = 0), i E Eo;
* likelihood: p(o( j ), i E E;
* transition probabilities: P(Q ISQ(i)), i E E \ o0.
have global information of the whole AS, detection approach employing spatial correlation
can only be deploy, ,1 in global analyzers.
When network anomaly happens, usually more than one edge router exhibits
abnormal symptoms. For example, when DDoS attacks are launched toward a victim
in an AS, the attack traffic enters the AS from multiple edge routers as the attack sources
are distributed. At each of those edge routers, the monitored traffic volume may be
low. That is, each traffic monitor or local analyzer observes a small deviation of traffic
features from normal distribution. However, the global analyzer, upon obtaining reports
from local analyzers, will observe small deviations of features from multiple edge routers
simultaneously. Employing spatial correlation contributes to low traffic network anomaly
detection.
Oi 02 a K
Figure 44. Generative dependent model that describes dependencies among edge routers.
Introducing the spatial correlation into the independent model in Figure 43 results
in the dependent model as illustrated in Figure 44. The difference between two models
is that, from the view point of a global analyzer, edge routers are no long independent.
As a result, statistical dependence among states of edge routers is represented by the
nondirectional connections. Note that the independent model can be regarded as a special
case of the dependent one. Also note that we still assume that features extracted from one
edge router are independent of the states of other edge routers.
007
 AUDIO
VIDEO
006
0 05 /
0 04
n
003
002
^ j I ) / \
0 01
0 100 200 300 400 500 600 700
Packet Size (Bytes)
Figure 53. Packet size distribution for voice and video traffic
traveling, packets might experience random d 1 iv due to congestion at routers' interfaces.
As a consequence, the interarrival times between packets at the receiver might be severely
affected by random noise, e.g. jitter, and thus this metric might not represent a reliable
candidate feature for a robust classification methodology. Although this problem does
exist, we note how the interarrival times between packets within the same flow still shows
a strong iiolai;, when studied in the frequency domain at the receiver side. As an
example, in Figure 52, we show the distributions of the interarrival packet times at the
receiver side when using Skype to transmit respectively voice only and video only between
two hosts, one located in University A in east coast and the other in University B in
west coast of USA. As we can see, the distributions for both video and voice are centered
around 0.03 second. On the other hands, Figure 53 shows the distributions of the packet
sizes for both voice and video. As you can see, both voice and video are characterized by
similar distribution for packet size less than 200 bytes. Although video traffic generates
packet size of larger than 200, these larger packets cannot be reliably used to separate
video from voice since other applications such as chat or file transfer might also generate
these larger packets. As a consequence, packet interarrival time or packet size is a weak
feature when considered in the temporal domain.
* extracting complicated features from traffic information obtained at a single edge
router;
* making decisions based on local traffic information (i.e., one edge router);
* reporting decisions, feature data, and summary of traffic information (if necessary) to
a global analyzer.
The local analyzer can utilize temporal correlation of traffic to generate feature data.
2.2.3 Global Analyzer
A global analyzer is responsible for:
* extracting complicated features that require global information, such as routing
information, from traffic;
* analyzing feature data obtained from multiple local analyzers; and
* making decisions with global information obtained from multiple edge routers.
Global Analyzer
'B
IX Y
Local Analyzer Local Analyzer
A
Figure 24. Example of .ivii, i ic traffic whose feature extraction is done by the global
analyzer.
The global analyzer has a global view of the whole network. Hence, it exploits
both temporal correlation and spatial correlation of traffic. Here it is important to note
that, some feature data must be obtained at the global analyzer if global information
is required. For example, in Figure 24, if the traffic from subnet A to server B passes
through edge router X, and the traffic from server B to subnet A passes through edge
As illustrated in Figure 61, we collect voice and video training flows during the
training phase. After processing the raw packet data through the feature extraction
module via PSD, one obtains two sets of feature vectors,
) { (1), (2),. .., (Ni)}, (636)
which is obtained through voice training data, where N1 is the number of voice flows; and
A (2) {' (t), (2), .. ,, (N2)} (637)
which is obtained through video training data, where N2 is the number of video flows.
To facilitate further discussion, let us also regard (i) as a M x Ni matrix, for i = 1, 2,
where each column is a feature vector. In other words, 1(i) e RMXN.
In this section, we present techniques to identify the low dimensional subspaces
embedded in RM, for both T1 and T2.
There are a lot of low dimensional subspace identification schemes, such as Principal
Components Analysis (PCA) [61] and Metric Multidimensional Scaling (Il )S) [62], which
identify linear structure, and ISOMAP [63] and Locally Linear Embedding (LLE)[64],
which identify nonlinear structure.
Unfortunately, all these methods assume that data are embedded in one single
lowdimensional subspace. This assumption is not alvx true. For example, as different
software uses different voice coding, it is more reasonable to assume that the PSD feature
vector of voice traffic is a random vector generated from a mixture model than a single
model. In such case, it is more likely that there are several subspaces in which the feature
vectors are embedded. The same holds for video feature vectors.
As a result, a better scheme is to first, cluster the trained feature vectors into several
groups, known as subspace decomposition; and second, to identify the subspace structure
of each group, known as subspace bases identification. We describe the two steps in the
following sections.
33 seconds for 30 frames/s; hence, the interarrival time between two packets in an Iframe
may span a large range, resulting in a flat PSD.
45
Trace 1
40 Trace 2
S35
S30
C
) 25
U 2020
20
10
5
0
0 10 20 30 40 50
Frequency (Hz)
Figure 58. Power spectral density of two sequences of continuoustime packet sizes for
voice traffic
100 200 300
Frequency (Hz)
Trace 1
Trace 2
400 500
Figure 59. Power spectral density of two sequences of continuoustime packet sizes for
video traffic
5.4 Summary
In this chapter, we motivated the importance and presented challenges faced by
network traffic classification, specifically, detecting and classifying voice and video traffic.
To those who sparked my interest in science, opening for me the door to discovering
nature and letting me walk through it in my own way.
Then for 'Y' C T, ,' e k after segmentation if and only if
k =arg max lnk. (650)
k
1. function MCLPartition(...)
2. Argument 1: set of feature vectors, T = (1), (N)
3. Return: partition of T,
II {Il,..., K; l1 U... U\ K Tt, ri n J 0 for Vi j,}
4. Initialization:
II= { (1)} (2)},. ..,{(N),
5. while true do
6.
(71,72) arg min C ( Uw ) (F ,*) (651)
7rl En,7rF EHn
7. if (TT U TT) (TT*, TT) > 0 then
8. break
9. else
10. II (II\{ri, r2}) U {ri U 72}
11. end if
12. end while
13. return TT
Figure 65. Pairwise steepest descent method to achieve minimal coding length.
There is no closed form solution for Equation (649). IT. .4[r, page41] proposed a
pairwise steepest descent method to solve it (Figure 65). It works in a bottomup way.
It starts with a partition scheme that assigns each element of T to a partition. Then, at
each iteration, the algorithm finds two subsets of feature vectors such that by merging
these two subsets, one can decrease the coding length the most (Equation (651)). This
procedure stops when no further decrease of coding length can be achieved by merging any
two subsets.
Using the above method, we obtain a partition of voice feature vector set T(1),
[15] S. Savage, D. Wetherall, A. Karlin, and T. Anderson, "Practical network support for
ip traceback," in Proc. of ACIf SIGCOMM'2000, Aug. 2000.
[16] A. Lakhina, M. Crovella, and C. Diot, "C('!i i:terization of networkwide anomalies
in traffic flows," in Proc. AC'I[ SIGCOMM Conference on Internet Measurement '04,
Oct. 2004.
[17] H. Wang, D. Zi ir. and K. G. Shin, "C'!I i,i:,.point monitoring for the detection
of dos attacks," IEEE Transactions on Dependable and Secure Conri'l.:,' no. 4, pp.
193208, Oct. 2004.
[18] J. Mirkovic and P. Reiher, "A i:;:nm, of ddos attacks and ddos defense mechanisms,"
in Proc. ACI[ SIGCOMM Computer Communications Review '04, vol. 34, Apr. 2004,
pp. 3953.
[19] J. B. Postel and J. Reynolds, "File transfer protocol," RFC 959, Oct. 1985. [Online].
Available: http://www.faqs.org/rfcs/rfc959.html
[20] K. Lu, J. Fan, J. Greco, D. Wu, S. Todorovic, and A. Nucci, "A novel antiddos system
for largescale internet," in AC'I SIGCOMM 2005, Philadelphia, PA, Aug. 2005.
[21] B. H. Bloom, "Space/time tradeoffs in hash coding with allowable errors," Commun.
AC(M, vol. 13, no. 7, pp. 422426, July 1970.
[22] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, "Summary cache: A scalable widearea
web cache sharing protocol." IEEE/ACM11 Trans. Netw., vol. 8, no. 3, June 2000.
[23] F. C'!I i!;, W. chang Feng, and K. Li, "Approximate caches for packet classification,"
in IEEE INFOCOM 2004, vol. 4, Mar. 2004, pp. 21962207.
[24] R. Rivest, "The md5 messagedigest algorithm," RFC 1321, Apr. 1992. [Online].
Available: http://www.faqs.org/rfcs/rfcl321.html
[25] MD5 CRYPTO CORE FAMILY, HDL Design House, 2002. [Online]. Available:
http://www.hdldh.com/pdf/hcr_7910.pdf
[26] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The
Hardware/Software In.I, li ., San Francisco, CA: Morgan Kaufmann, 1998, ch. 5,6.
[27] "AucklandIV trace data," 2001. [Online]. Available: http://wand.cs.waikato.ac.nz/
wand/wits/auck/4/
[28] L. L. Scharf, Statistical i..i:,l processing: detection, estimation, and time series
analysis. Addison Wesley, 1991.
[29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Cla1..:;.'.n, 2nd ed.
WileyInterscience, Oct. 2000.
[30] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. Duxbury Press, June 2001.
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 Summary of Network Centric Anomaly Detection
In the first part of our study, we presented our achievement on network centric
anomaly detection. We first proposed a novel edgerouter based framework to robustly
and efficiently detect network anomalies in the first place they happen. The key idea is
to exploit spatial and temporal correlation of abnormal traffic. The motivation of our
framework design is to use both spacial and temporal correlation among edge routers. The
framework consists of three types of components (i.e., traffic monitors, local analyzers,
and a global i, &iv. r). Traffic monitors summarize traffic information on each single link
between edge router and user subnet. Local analyzers collect information on edge routers,
which is provided by traffic monitors, and reports to the global analyzer. The global
analyzer has a global view of the whole autonomous system and makes final decision. The
advantages of our framework design are the following.
1. It is deploy, .1 on edge routers instead of systems of end users, such that it can detect
network anomalies in the first place they enter an AS.
2. It has no burden on core routers;
3. It is flexible in that detection of network anomalies can be made both locally and
globally;
4. It is capable of detecting low volume network anomalies accurately by exploiting
spatial correlations among edge routers.
We then presented feature extraction for network anomaly detection. Based on
the framework, we designed the hierarchical feature extraction architecture. Different
components extract different features. For example, traffic monitors can extract features
such as packet rate, data rate, and SYN/FIN ratio. Local analyzers are able to extract
features such as SYN/SYNACK ratio, roundtrip time, and twoway matching features
on one edge router. The global analyzer can extract twoway matching features from
the whole autonomous system. Specifically, we focus on the novel type of features,
6.5.2 Skype Flow Classification
In this section, we conduct experiments on Skype traffic. We first consider the
scenario when each Skype flow carries one type of traffic. In other words, in this set of
experiments, one flow is of type VOICE, VIDEO, or none of the above.
Figure 68 shows the ROC curves of classifying voice and video flows.
1 1
099 099
098 098
097 097
096 096
0 95 6 095
094 094
093 093
092 092
091 091
09 09
0 02 04 06 08 1 0 02 04 06 08
PFA PFA
(a) (b)
Figure 68. The ROC curves of singletyped flows generated by Skype, (a) VOICE and (b)
VIDEO.
We then conduct experiments on hybrid Skype flows. In other words, each flow
may be of type VOICE, VIDEO, FILE+VOICE, FILE+VIDEO, or none of the above.
Figure 69 plots the ROC curves of these five types, respectively.
6.5.3 General Flow Classification
Now, let us do the same experiments on network traffic generated by Skype, MSN,
and GTalk, as these are very common applications at present. In other words, a voice flow
now can be a VoIP flow generated by Skype, MSN, or GTalk. So are video flows. Note
that, GTalk does not support video conference. Similarly, two sets of experiments are
conducted, one on singletyped flows and the other on hybrid flows.
which must be kept powerful enough to perform concurrent analysis of a large number
of flows while applying techniques to search very complex protocol signatures that might
require processing of a large chunk of packet 1 ivload. The proliferation of proprietary
protocols, coupled with the growing trend in the usage of encryption techniques to ensure
data confidentiality, makes this approach infeasible. For example, Skype does not run on
any standard port, but randomly selects ports for its communication and use either the
TCP, or UDP or both for the data transfer. Furthermore, its use of a 256bit encryption
algorithm and no visibility into neither the algorithm nor its keys makes its detection
even harder. All the above makes the general problem of the detection of VoIP and video
datastreams over IP challenging and yet it is of huge business interest.
A new emerging tendency in the research community to approach this problem is
to rely on pattern classification techniques. This new family of techniques formulate
the application detection problem as a statistical problem that develop discriminating
criteria based on statistical observations and distributions of various flow properties in the
packet traces. A few papers [49, 50] have taken this statistical approach to classify traffic
into p2p, multimedia 1r ii:1i1 interactive application, and bulk transfer application.
Unfortunately, although these papers addressed the problem of distinguishing multimedia
traffic from other applications, they have not addressed the problem of distinguishing
voice traffic from video traffic. One problem is to separate streaming traffic from other
applications, and a different problem is to detect and correctly classify voice and video and
clearly separate the two applications from each other. In the extreme case, voice and/or
video data streams might even be bundled together in the same exact flow with other
applications. These problems are common for many applications like Skype, Gtalk and
MSN that allow users to mix voice and/or video streams with chat and/or file transfer
traffic in the same exact 5tuple flow, defined as source IP address, destination IP address,
source port, destination port, and protocol numbers. In such cases, one flow may carry
extracted by global analyzers when an AS is not symmetric. However, the feature
extraction approaches used by local analyzers and global analyzers are same.
3.4 Basic Algorithms
This section presents two basic algorithms to process and store the twoway matching
features, namely, the hash table algorithm and the Bloom filter algorithm.
3.4.1 Hash Table Algorithm
The general procedure to extract the twoway matching features from traffic at an
local analyzer is:
1. The local analyzer maintains a buffer in memory;
2. When the traffic monitor captures an inbound packet, if its inbound signature is not
in the buffer, the local analyzer creates one entry for its signature and set the state of
that entry to "UNMATCHED";
3. When the traffic monitor captures an outbound packet, if its outbound signature is in
the buffer, the local analyzer sets the state of that entry to \ IATCHED";
4. At time ti+l, the local analyzer assigns the number of entries with state "UNMATCHED"
to D(ti).
So typically we need three operations: insertion, search and removal4
A basic algorithm to do this is to use a hash table. Suppose the signature extracted
from a packet is b bits long. We organize the buffer into a table, V, with f cells of b + 1
bits each. The extra one bit is the state bit. We also have C hash functions hi:S Z+ ,
where i E Zc {0, 1,..., /C 1}, and S is the data set of interest, e.g., signature domain.
The symbol Z stands for the set {0,..., 1}, where is an integer.
The operations of hash table algorithm are listed in Figure 35, where the argument s
is the signature extracted from a packet.
4 Setting the state to \! ATCHED" is actually the removal operation.
CHAPTER 1
INTRODUCTION
Over the past few years, the Internet infrastructure has become a critical part of the
global communications fabric. A survey by the Internet Systems Consortium (ISC) shows
that the number of hosts advertised in domain name system (DNS)[1, 2] has risen from
approximately 9,472,000 in January 1996 to 394,991,609 in January 2006. In addition, the
emergence of new applications and protocols, such as voice over Internet Protocol (VoIP),
peartopear (P2P), and video on demand (VoD)[3], also increases the complexity of the
Internet. Accompanying this trend is an increasing demand for more reliable and secure
service. A 1i ir challenge for Internet service providers (ISP) is to better understand the
network state by analyzing network traffic in real time. Thus ISPs are very interested in
the problem of network centric traffic analysis.
We consider the network centric traffic analysis problem from two perspectives: 1)
network anomaly detection and 2) network centric traffic classification. We introduce the
two perspectives in the next two sections.
1.1 Introduction to Network Anomaly Detection
With the rapid growth of Internet, detection of network anomalies becomes a i, i i 1r
concern in both industry and academia since it is critical to maintain availability of
network services. Abnormal network behavior is usually the symptom of potential
unavailability in that:
* Network anomaly is usually caused by malicious behavior, such as denialofservice
(DoS) attacks, distributed denialofservice (DDoS) attacks, worm propagation,
network scans, or email spams;
* Even if it is caused by unintentional reasons, network anomaly is often accompanied
with network congestion or router failures.
However, detecting network anomalies is not an easy task, especially at highspeed
routers. One of the main difficulties arises from the fact that the data rate is too high to
afford complicated data processing. An anomaly detection algorithm usually works with
and EF the power of noise of the AR(p) signal. Thus, one can rewrite Equation (615) as
RP+1[
function LDA(...)
Argument 1: data, {y(i)} i1
Argument 2: order, p.
Return: parameters of AR(p) model, a, and F2
p,
I
r(k; y) = y7iS*
i=k+l
r = al
2
I r(0; y)
k), for Vk 0, ,...,p
r(1; y)
r (0; y)
Ir(1; y) 12
r(0; y)
A
rt A
at
rt+1
[r*(t; ),r*(t ; y),.. r*(l;y)]T
[at, at_, al]T
r(t + 1; y) + rtat
(625)
(626)
(627)
(628)
(629)
i2 (2 t l / 2' )
Ct+l C t t+
at t1 / 2
at+ = o + rt+l 1
10 ondr for
Figure 63. LevinsonDurbin Algorithm.
Figure 63 gives the LDA algorithm to estimate coefficients of AR(p) model given
data {y(i)} i.
1 E2
dp 0
(621)
(622)
(623)
(624)
8. for t < 1,...,p
Now define variable S(ti) by
0 i=0
S(ti) (41)
max(0,S(tji_) + )(ti)) i > 0
In the CUSUM algorithm, if S(ti) is smaller than a threshold 'HCUSUM, declare that
the network state is normal; otherwise, declare that the state is abnormal.
From the discussion above, we note that two parameters, i.e., a and 'HCUSUM, need
to be determined. However, we cannot uniquely determine these two parameters. To
overcome this problem, we shall introduce another parameter, i.e., the detection delay,
denoted as V. According to the changepoint theory, we have
V 1 1
S(42)
'HCUSUM (oa nr) n a o aa
From Equation (42), we can obtain
'HcusuM x (.a a). (43)
Hence, once V and a are given, we can determine 'HCUSUM through Equation (43).1
Given a and 'HCUSUM, we can use the CUSUM algorithm to detect network anomaly.
We notice that the existing CUSUM algorithms [1113] only consider one change,
i.e., from the normal state to the abnormal state. In practice, this approach may lead to
a large number of false alarms after the end of attacks. To mitigate the high false alarm
issue of the existing algorithms, which we call singleCUSUM algorithms, we develop a
dualCUSUM algorithm. In this algorithm, one CUSUM will be used to detect the change
from the normal to the abnormal state, while another CUSUM is responsible for detecting
the change from the abnormal to the normal state. The method of setting parameters for
1 In Refs. [11] and [12], a = (4a ,)/2; thus only the detection delay is needed.
Although it does not guarantee to find the global optimal solution to Equation (439),
Viterbi algorithm is efficient and has good performance empirically.
The computation complexity of Viterbi algorithm is O (Ba), much better than
the dependent model, as illustrated in Figure 44, whose complexity is O (2"). The
performance improvement results from the fact that Viterbi algorithm does not exhaustively
test all possible node state combinations. Instead, we decompose the decoding problem
into multiple stages, each of which decodes node states at one level in HMT. At level 1, we
estimate node states of level 1, i.e., {ji; i E EE}, based on results obtained during previous
stages, i.e. {Qj; i E {Bo0,..., 1i}}. In such a procedure, each node is only accessed for
twice. Hence the complexity is linear to the number of nodes, which is
1 1 L
Spose sl< B aa.
1 B1 1 B1
Till now, we described the HMT model, including its parameter training and
classification approaches. Next, we show the simulation results of applying the HMT
model to network anomaly detection.
4.5 Simulation Results
In this section, we evaluate the performance of the proposed schemes through
simulation.
4.5.1 Experiment Setting
In our study, we develop a testbed to 1) extract various feature information, and 2)
analyze feature data by our machine learning algorithm and CUSUM algorithms. Next,
we describe the setting for networks, traffic traces, and feature extraction used in our
experiments.
Network.
In our experiment, we assume that the ISP network consists of a core AS, a victim
subnet, and 16 edge routers that connect to 16 subnets, as illustrated in Figure 412. At
each edge router, two monitors are placed to measure the inbound and outbound traffic
6.3.1 Subspace Decomposition Based on Minimum Coding Length
The purpose of subspace decomposition is to partition the data set
T= {(1), (2),..., ^(N)} (638)
into nonoverlapping K subsets such that
S=iU T2 U .. U 'K. (639)
IH.' [rf] proposed a method to decompose subspaces according to the minimum coding
length criteria. The idea is to view the data segmentation problem from the perspective of
data coding/compression.
Suppose one wants to find a coding scheme, C, which maps data in T E RMxN to
bit sequence. As all elements are real numbers, infinite long bit sequence is needed to
decode without error. Hence, one has to specify the tolerable decoding error, C, to obtain a
mapping with finite coding length, i.e.,
C1 (~()) < 2, for Vn 1,...,N. (640)
Then the coding length of the coding scheme C is a function
Lc : RMxN _ Z+. (641)
It is proven [65] that the coding length is up bounded by
N + K K N K p N
c ) < = N l2 det + )+ o2 (642)
where
N
P =N () (643)
i= 1
4=[ (1) p,...J(N) (644)
Applying Equation (35) to Equation (34), we obtain the expectation of Th
1 R ,
E[Th] R 1 (3 6)
RF +t 11, n(b+ t1)'
Since the time to insert a signature into or remove a signature from a given entry is
much shorter than that to find the proper entry, the time complexities of insertion and
removal operations are almost the same as that of the search operation. Equation (36)
gives the space/time tradeoff (i.e., .1, vs. Th) of the hash table method.
Analysis for Bloom filter. First of all, we consider the space complexity of Bloom
filter. Denote by ./,, the length of the vector V used by Bloom filter (see Section 3.4.2).
The choice of ,. will affect the accuracy of the search function, BloomFilterSearch (see
Figure 36). The reason is the following.
When signatures of N flows are stored in V, 0, denoting the percentage of entries of
V with value 0, is
= C (37)
where KC is the number of hash functions. Assuming KC < 1 as is certainly the case, we
can approximate 0 as
O exp( I ). (38)
Function BloomFilterSearch(V, s) falsely identifies s to be stored in V if and only if
results of all KC hash functions point to bits with value 1, which is known as a collision.
Denote by T]N the collision probability under the condition that N flows have been
recorded. Then
= ( = exp  (39)
( I"'* /KN
CHAPTER 6
NETWORK CENTRIC TRAFFIC CLASSIFICATION SYSTEM
6.1 System Architecture
Voice Training
Flows Flow Summary P Feature Extractor Voice Subspace
Generator (PSD) Generator
Video Training
Flows Flow Summary Feature Extractor Voice Subspace
Generator (PSD) Generator
Training Phase
Raw Packets
Raw Packets Flow Summary_ Feature Extractor_ Voice/Video
Generator (PSD) Classifier
Classification Phase
Voice/Video/Other
Figure 61. VOVClassifier System Architecture
We first present the overall architecture of our system (VOVCl ,I... r Figure 61),
and provide a highlevel description of the functionalities of each of its modules. Generally
speaking, VOVClassifier is an automated learning system that uses packet headers from
raw packets collected off the wire, organize them into transport network flows and process
them in realtime to search for voice and video applications. VOVClassifier first trains voice
and video data streams separately before being used in realtime for classification. During
the training phase, VOVClassifier extracts feature vectors, which is a summary (also
known as a statistic) of raw traffic bit stream, and maintain their statistics in memory.
During the online classification phase, a classifier makes decision by measuring similarity
metrics between the feature vector extracted from onthefly network traffic and the
feature vectors extracted from training data. Flows with high values of similarity metric
with the voice (or video) features are classified as voice(or video); data streams with low
values of similarity with voice/video are classified as other applications.
In general, VOVClassifier is composed of four 1 i, Prw modules that operate in cascade:
(i) Flow Suni,,,;,,, e Generator (FSG),(ii) Feature Extractor (FE) via Power Spectral
resides in twofold. First, the authors shift the focus from classifying individual flows to
associating Internet hosts with applications, and then classifying their flows accordingly.
Second, BLINC follows a different philosophy from previous methods attempting to
capture the inherent behavior of a host at three levels: (i) social level, e.g. how each host
interacts with other hosts, (ii) functional level, e.g. role 1p1 i, d by each host in the network
as a provider of an application or a consumer of the application, and finally (iii) the
application level, e.g. ports used by each host during its communication with other hosts.
Although the approach proposed in [50] is interesting from a conceptual perspective and
proved to perform reasonably well for a variety of different applications, it is still prone
to large estimation errors for streaming applications. Moreover, its high complexity and
large memory consumption remains an open issue for highspeed application classification.
Other papers using pattern classification appeared lately in literature but more focused
on specific application detection like PeertoPeer [46] and chat [51]. More importantly, to
the best of our knowledge, none of the existing work has been able to separate voice traffic
from video traffic or to indicate the presence of voice traffic or video traffic in a hybrid
flow that contains traffic from both voice/video and other applications such as file transfer.
5.3 Intuitions Behind a Proper Detection of Voice and Video Streams
Generally 1'" i1:;i the problem of voice and video detection can be formulated as
a complex pattern Ia I .: /,l.:.n problem that has to deal with curse of i.:,,. ,:..:"',:,l.':l,
e.g. discrimination of voice and video data streams when dealing with hidden traffic
patterns and too many interrelated features. A critical step toward the solution is to
identify traffic features that correctly represent the characteristics of the data streams of
interest and uniquely isolate them from other applications. In order to achieve this, in this
section we start by showing how simple metrics presented in the past are not applicable
in our context and we conclude with some observations that constitute the essence of
our approach. In Figure 51 we show the results obtained when using the combination
of average packet size and the interarrival variability metric proposed by Roughan et
6.1.3 Voice/Video CLassifer (CL)
During the classification phase, the data are processed by one extra module, named
Voice/Video Cl. *.:,' r, that compares the feature vectors extracted from current data
streams entering the system to the voice and video subspaces generated during training
in order to classify the stream as voice, video or other. The problem of data stream
classification requires the implementation of a similarity metric. In literatures, there
are many similarity metrics. For example, Bi;, classifier uses cost function, and
nearestneighbor (1NN) and Knearestneighbor (KNN) use Euclidean distance. In
general, no similarity metric is guaranteed to be the best for all applications. For example,
Bi, classifier is applicable only when the likelihood probabilities are well estimated,
which requires the number of training samples to be much larger than the number of
feature dimensions. As a consequence, it is not suitable for classification based on an high
dimensional feature vector, such as the PSD feature vector. Furthermore, both 1NN and
KNN are proved to be optimal only under the assumption that data of the same category
are clustered together. Unfortunately, this is not alv, the case. We overcome the above
problem by employing a similarity metric based on the normalized distance from feature
vector representing the ongoing flow to the two subspaces obtained during training phase.
The subspace at minimum distance will be elected as candidate only if the distance is
below specific thresholds.
We conclude this section by highlighting one minor limitation of our approach. Our
system is unable to distinguish a flow containing video only from a flow containing video
packets pi.:ybacked by voice data (when video and voice applications are simultaneously
launched in Skype, voice data is pi.:.vbacked on video packets). This is because the
feature for video packets piggybacked by voice data is very similar to that for video only.
Hence, our traffic classifier will declare a flow containing video packets pi:2yvbacked by
voice data as "video".
NETWORK CENTRIC TRAFFIC ANALYSIS
By
JIEYAN FAN
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2007
The power spectral density of a digital signal represents its energy distribution in
the frequency domain. Regularities in time domain translate into dominant periodic
components in its autocorrelation function and finally to peaks in its power spectral
density.
For a general secondorder stationary sequence {y(i)}~ the power spectral density
(defined in [59]) can be computed as:
00
k=oo
where {r(k; y); k E Z} represents the autocovariance sequence of the signal {yi; i E Z}, i.e.,
r(k; y) = [y(i)y*( k)]. (66)
Although w in Equation (65) can take any value, we restrict its domain to be within
[T, 7) because i(j7; y) = i(w + 27; y).
According to Equations (65) and (66), the computation of the PSD for a digital
signal theoretically requires to have access to an infinite long time sequence. Since in
reality, we cannot assume to have infinite digital sequences at our disposal, we need to
face the problem on which technique can be used in our context to estimate the power
spectral density with an admissible accuracy. In literature, two different families of PSD
estimation are available: parametric and nonparametric. Parametric methods have
shown to perform better under the assumption that the underlying model is correct
and accurate. Furthermore, these methods are more interesting from a computational
complexity perspective as they require the estimation of fewer variables when compared
with nonparametric methods.
In our research, we employ parametric method to estimate PSD. The details are
presented in the next section.
PSD estimation based on parametric method.
The most commonly used approach to estimate model parameters is the maximumlikelihood
(\! I) method. Given the training features at node i with network state u, { l ,... ..)},
the ML method chooses the parameters to maximize
K.
p( (lk) i = u), (428)
k=1
where p(. = u) is given in Equation (426). Unfortunately, Nechyba[32] showed
that ML method for Gstate GMM with G > 1 has no closed form solution. In
addition, a Gstate GMM has a 3Gdimensional continuous parameter space. Exhaustive
searching numerical solution for ML estimate in such a parameter space is computational
intractable.
1. Input: k(k), k e {1,..., ; 7 (g), ) (g), and 0(g), g {1,..G}.
2. Output: ri,u(g), pi,(ug), and i,u(g), g c {1,..., G}.
3. j = 0.
4. repeat
9. jKu ( j
C=1" (,' )"
9. j j+ 1
10. until converge
11. iji,u(g) (g), k^,(g) iu (g), i,u(g) = )(g), Vg e {1,..., G}.
Figure 48. The EM algorithm for estimating p(oi I = u), i E t, u E {0, 1}.
A practical solution to this issue is the expectationmaximization (EM) algorithm[29,
30]. Nechyba[32] derived EM algorithm for GMM in detail. Figure 48 illustrates the
algorithm.
The EM algorithm requires initial values for the parameters, as denoted by (0)(g),
o)(g), and Eio(g) in Figure 48. At each iteration j, the EM algorithm uses parameters
estimated at iteration j 1 to calculate new estimates. Although both EM and ML
one to uncount it. In Function ProcOutbound, Line 13 starts a loop to iterate j' from j to
j w + 1. Condition 1 is checked in lines 14 to 15 and Condition 2 is checked in lines 16
to 17. Note that the loop exits (line 15) if RVj, contains s'; this is because an outbound
packet of the same flow arrived in that j'th slot and hence the buffer of the jth slot (for
each j < j') has already been checked.
Function Sample is to extract the twoway matching features. When we execute
Function Sample at the end of the jth slot (i.e., at time Tj+i), the output is D(rjw+i)
instead of D(j) since a time lag of F (w slots) is needed for twoway matching.
3.5.3 Round Robin Sliding Window
1. function ProcInbound(s)
2. a  false, b  false
3. if j',j' e {(I w + 1).,"',(I w + 2) .',....,I .,'},such that
BloomFilterSearch(RVjy,s) returns true then
4. a  true
5. if BloomFilterSearch(IVI,s) returns true then
6. b true
7. if a and b are both false then
8. C C +
9. BloomFilterInsert (IV,, s)
10. end if
11. end function
12. function ProcOutbound(s')
13. for j'  I to (I w + 1) .
14. if BloomFilterSearch(RVj,, s') returns true then
15. break
16. if BloomFilterSearch(IVjy, s') returns true then
17. CY < C 1
18. end for
19. BloomFilterInsert(RVI, s')
20. end function
21. function Sample()
22. I < (I + 1)
23. return CI
24. end function
Figure 39. Bloom Filter Array Algorithm using sliding window
The algorithm presented in Section 3.5.2 has a drawback in memory allocation.
Specifically, at epoch Tj+i, we sample D(rj_w+i), and then we need to throw away the
3.6.1 Space/Time Tradeoff
Space/time tradeoff for both Hash table and Bloom filter algorithms was 1i, lv. I 1 by
Bloom [21]. However, the analysis by Bloom[21] is not directly applicable to our setting
due to the following reasons:
1. A static data set was assumed by Bloom[21]. However, our feature extraction deals
with a dynamic data set, i.e., the number of elements in the data set changes over
time. Hence, new analysis for a dynamic data set is needed. In addition, Bloom[21]
only considered the search operation due to the assumption of static data sets. Our
feature extraction, on the other hand, requires three operations, i.e., insertion, search,
and removal, for dynamic data sets.
2. Bloom[21] assumed bitcomparison hardware in time complexity analysis. However,
current computers usually use word (or multiplebit) comparison, which is more
efficient than bitcomparison hardware. Hence, it is necessary to analyze the
complexity based on word comparison.
3. The time complexity obtained by Bloom[21] did not include hash function calculations.
However, hash function calculation dominates the overall time complexity, e.g.,
calculating one hash function based on MD5 takes 64 clock cycles [25], while one
wordcomparison usually takes less than 8 clock cycles [26].
For the above reasons, we develop new i ,1', i ; for the hash table and Bloom filter,
respectively. In addition, we analyze the performance of BFA and use numerical results to
compare the three algorithms. Table 32 lists the notations used in the analysis.
Table 32: Notations for complexity analysis
Notation Description
N Random variable representing the number of different flows recorded.
SEmpty ratio.
'q Collision probability, i.e., the probability that an item is falsely identified
to be in the buffer.
R Flow arrival rate, which is assumed to be constant.
Analysis for hash table. Denote by i the size of a hash table in bits (i.e., space
complexity) and by Th the random variable representing the number of hash function
calculations for an unsuccessful search (i.e., time complexity).
Let us consider search operation first. Upon the arrival of an inbound packet, the
HashTableSearch (see Figure 35) checks if its inbound signature s is in the table. Because
LIST OF FIGURES
Figure page
21 An ISP network architecture. .................. ...... 19
22 Network anomaly detection framework. .................. .... 19
23 Responsibilities of and interactions among the traffic monitor, local analyzer,
and global analyzer. .................. ... ......... 20
24 Example of .,vmmetric traffic whose feature extraction is done by the global
analyzer . .................... ................ 21
31 Hierarchical structure for feature extraction. ................ .... 24
32 Network in normal condition. ............... ....... 28
33 Sourceaddressspoofed packets. ............... ........ 29
34 Reroute ............... ................ .. 29
35 Hash Table Algorithm ............... .......... .. 33
36 Bloom Filter Operations ............... ........... .. 34
37 Scenarios of the problems caused by Bloom filter. (a) Boundary problem. (b)
An outbound packet arrives before its matched inbound packet with 2 tl < F. 34
38 Bloom Filter Array Algorithm .................. ......... .. 37
39 Bloom Filter Array Algorithm using sliding window .............. 38
310 Space/time tradeoff for the hash table, BFA with q = 0.1 and BFA with q =
1 ... ............... .................. .... .. 48
311 Relation among space complexity, time complexity, and collision probability.
(a) 3. : vs. q. (b) E [T,]* vs. . . . ... . . .. 50
312 Space complexity vs. collision probability for fixed time complexity. ...... ..52
313 Memory size (in bits) vs. average processing time per query (in ps) ...... ..53
314 Average processing time per query (in ps) vs. average number of hash function
calculations per query ................ .......... .. .. 54
315 Comparison of numerical and simulation results. (a) Hash table algorithm. (b)
BFA algorithm with rq1 .................. .......... .. 55
316 Feature data: (a) Number of SYN packets (link 1), (b) Number of unmatched
SYN packets (link 1), (c) Number of SYN packets (link 2), and (d) Number of
unmatched SYN packets (link 2). .................. ..... 58
1. function TransProbEstimate(. .)
2. Argument 1: likelihood, {p(Qk Q); ie E}.
3. Argument 2: training data, {(k); k E {1,..., K}}.
4. Return: transition probability estimate: P( ul f,()), iE \ So.
5. P(o)(Qi u l, u') = for V i eE \ o, V u, u' {0, 1}.
6. j = 0.
7. repeat
8. fork< 1 toK
9.
{P(')(Qp() (k)), p(+l(1) i, Q(i) (k));i e \ 0o}
BP ({PJ)(Q Q(); i e E \ Bo} {p(i D); i }, i f) (429)
10. end for
11. forVi E \ Eo
12.
PQ(i+1)( K1 P+I) (Ai, p(i) (k))
K( P(j+l). p(i) (k)
k1l
SP(J+I) (i P(), (430)
k1
13. end for
14. j<j+1
15. until converge
16. P(QiQP(i))=PF)(QiP(i)).
Figure 49. Iteratively estimate transition probabilities.
equally likely. Then, at each iteration, we update the estimate of transition probabilities
until it converges. The update procedure is the following.
First, we iterate the training feature set. For each feature, we use BP algorithm (see
Figure 410) to estimate the posterior probabilities given that feature. The details of the
BP algorithm is discussed in the next section.
Three sets of arguments are passed to the BP algorithm:
estimate of transition probabilities obtained at the previous iteration, {P()( (iR,)); i E \ Eo};
1. likelihood, {p( ijp); i E}, which is the argument passed to function TransProbEsti
mate;
In order to overcome the above problem, in this section we exploit different metrics
that might have great potential to serve our purpose: strong regularities of interarrival
times between packets within the same flow and packet sizes residing in voice and video
data streams. Specifically, we consider four types of metrics, i.e.,
1. packet interarrival time and packet size in time domain;
2. packet interarrival time in frequency domain;
3. packet size in frequency domain;
4. combining packet interarrival time and packet size in frequency domain.
These metrics are discussed later.
025
AUDIO
VIDEO
02
015
n
01
0 05
0 001 002 003 004 005 006
IAT (seconds)
Figure 52. Interarrival time distribution for voice and video traffic
5.3.1 Packet InterArrival Time and Packet Size in Time Domain
The intuitions behind such metrics reside in the observation that any protocol used
for voice and video applications specifies a constant time between two consecutive packets
at the transmitter side, also known as InterPacket Delay (IPD).For example, Table 51
lists some speech codec standards and the associated IPDs that are required for a correct
implementation of those protocols. Packets leaving the transmitter might traverse a
large number of links in the Internet before reaching the proper destination. Along this
the twoway matching features, proposed by us. This type of features uses both the
temporal and spacial information carried in network traffic. It is a very effective indicator
of network anomaly associated with spoofed source IP address. We designed a novel
data structure, referred to as Bloom filter array, to efficiently extract twoway matching
features. Different from the existing works, our data structure has the following properties:
1) i,',in.' Bloom filter, 2) combination of a sliding window with the Bloom filter, and
3) using insertionremoval pairs to enhance the Bloom filter with a removal operation.
Our analysis and simulation demonstrate that the proposed data structure has a better
space/time tradeoff than conventional algorithms.
In the end, we applied the machine learning technology to network anomaly detection.
Specifically, we used B li, i in, model to determine the state of each edge router, normal
or abnormal. Traditionally, edge routers are regarded as independent. It is incapable to
detect lowtraffic network anomaly. Straightforward improvement over this independent
model is to regard edge routers to be dependent on each other. However, this method has
an exponential time complexity to determine the edge router states, i.e., O (2"), where
r is the number of edge routers. We proposed the hidden Markov tree (HMT) to model
the correlations among edge routers. It takes advantages of employing dependence among
edge routers while has almost linear time complexity, i.e., O (Ba), where B is the number
of child nodes of each nonleaf node in the HMT. Our machine learning scheme has the
following nice properties:
* In addition to detecting network anomalies having highdatarate on a link, our
scheme is also capable of accurately detecting attacks having lowdatarate on
multiple links. This is due to exploitation of spatial correlation of network anomalies.
* Our scheme is robust against timevarying traffic patterns, owing to powerful machine
learning techniques.
* Our scheme can be deploy, 1 in largescale highspeed networks, thanks to use of
Bloom filter array to efficiently extract features.
and leveltwo filters are placed in traffic monitors. A feature extraction module can be
placed in either a traffic monitor or a local analyzer, depending on the type of the feature.
Levelone filters select a packet based on its sourcedestination pair, which is defined
by the source IP address (SA), the source network mask (SNM), the destination IP
address (DA) the destination network mask (DNM). For example, if we are interested in
packets from 172.10.5.28 to 210.33.68.102, we can choose 255.255.255.255 as both the SNM
and the DNM; if we are interested in packets from 172.10.x.x to 208.33.1.x, we can use
255.255.0.0 as the SNM and 255.255.255.0 as the DNM. In this way, we selectively monitor
an endhost or a subnet, giving much flexibility in framework configuration. The output of
a levelone filter is packets with the same sourcedestination pair, which are cor,:' .1 to
leveltwo filters.
A leveltwo filter classifies the packets coming from levelone filters, based on
the upper iv.r' data fields, e.g., TCP SYN or FIN. The packets of interest will be
forwarded to one or multiple feature extraction modules. For example, the number of
TCP SYN packets can be used to generate both the TCP SYN rate feature and the TCP
SYN/FIN(RST) ratio feature; hence, TCP SYN packets are conveyed to both the TCP
SYN rate module and the TCP SYN/FIN(RST) ratio module (Figure 31). On the other
hand, a feature module may need packets from multiple leveltwo filters. For example, the
SYN/FIN(RST) ratio feature extraction requires packets from three filters (Figure 31).
Compared to the packet classification schemes developed by Wang et al.[ll] and
Peng et al.[12], our hierarchical structure for feature extraction is more general and
efficient.
Next, we describe the most important module in the threelevel hierarchical structure,
the feature extraction module.
1 Here, the upper livr can be either Livr 4 or Lir 7.
When these flows are mixed together, classification accuracy is reduced. But the
accuracy reduction is acceptable. Specifically, for hybrid voice traffic at PFA w 0, PD is
reduced from 1 to 0.986; for hybrid video traffic, it is from 0.965 to 0.948.
This shows that our approach is robust. The robustness results from the fact that the
subspace identification module, as presented in Section 6.3, decomposes multiple subspaces
in the original highdimensional feature space. As a result, PSD feature vectors of Skype
and GTalk are likely to be within different subspaces than those of MSN. Therefore, we
can still classify traffic accurately.
6.6 Summary
In this chapter, we describe the VOVClassifier system to do network classification.
VOVClassifier is composed of four components, feature summary generator, feature
extractor, subspace generator, and voice/video classifier. The novelty of VOVClassifier is
* modeling a network flow by a stochastic process;
* estimating PSD feature vector to extract the regularities residing in voice and video
traffic;
* decomposing subspaces of training feature vectors followed by bases identification;
* using minimum distance to subspace as the similarity metric to perform classification.
Experiment results demonstrate the effectiveness and robustness of our approach.
Specifically, we show that classification of voice traffic is more accurate than that of video
traffic; classification of singletyped flows is more accurate than that of hybrid flows;
and classification of pure Skype flows is more accurate than that of flows generated from
multiple applications (e.g., Skype, MSN, and GTalk).
Bloom filter stores data in a vector V of M elements, each of which consists of one
bit. Bloom filter also uses IC hash functions hi:S v+ ZM, where i E c. Figure 36
describes the insertion and search operations of Bloom filter.
1. function BloomFilterInsert(V, s)
2. for Vi e Z;c do
3. V[hi(s)] 1
4. end function
5. function BloomFilterSearch(V, s)
6. for Vi E Z;K do
7. if V[hi(s)] / 1 then
8. return false
9. end for
10. return true
11. end function
Figure 36. Bloom Filter Operations
P p' P' p
ti t ti+1 t '2 ti+2 ti t 1 t '2 ti+1
(a) (b)
Figure 37. Scenarios of the problems caused by Bloom filter. (a) Boundary problem. (b)
An outbound packet arrives before its matched inbound packet with t2 t1 < F.
Although Bloom filter has better performance in the sense of space/time tradeoff, it
cannot be directly applied to our application because of the following problems:
1. Bloom filter does not provide removal functionality. Since one bit in the vector may
be mapped by more than one item, it is unsuitable to remove the item by setting all
bits indexed by its hash results to 0.
2. Bloom filter does not have counting functionality. Although the counting Bloom filter
[22] can be used for counting, it replaces a bit with a counter, which significantly
increases the space complexity.
[59] P. Stoica and R. Moses, Spectral A,i,.l.: of S.:i,.'l 1st ed. Upper Saddle River, NJ:
Prentice Hall, 2005.
[60] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms,
2nd ed. Duxbury Press, Sept. 2001.
[61] L. I. Smith, "A tutorial on principal components analysis," Feb. 2002.
[Online]. Available: http://www.cs.otago.ac.nz/cosc453/studenttutorials/
principal_components.pdf
[62] K. V. Deun and L. Delbeke, \!ill dimensional scaling," University of Leuven.
[Online]. Available: http://www.mathpsyc.unibonn.de/doc/delbeke/delbeke.htm
[63] J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for
nonlinear dimensionality reduction," Science, vol. 290, pp. 23192323, Dec. 2000.
[64] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality reduction by locally linear
embedding," Science, vol. 290, pp. 23232326, Dec. 2000.
[65] W. Hong, "Hybrid models for representation of imagery data," Ph.D. dissertation,
University of Illinois at UrbanaC'l I' p ,in Aug. 2006.
10
'5 15
O 20
5 25
0 30
35 \ /\3 5 / \ F\ fx
40
45
0 02 04 06 08 1
Normalized Frequency (xTc rad/sample)
Figure 55. Power spectral density of two sequences of timevarying interarrival times for
video traffic
and/or combinations of the two. Its decoding needs reference to previously decoded
frames. Packets containing Iframes are larger than those containing Pframes. Usually,
the number of Pframes between two consecutive Iframes is constant. Hence, one can
observe a strong periodic variation of packet size due to the interleaving of Iframes and
Pframes composing video data streams. Voice streams have similar phenomenon if Linear
Prediction Coding (LPC), e.g., code excited linear prediction (CELP) voice coder, is
employ, ,1 As an example, Figs. 56 and 57 show the power spectral density of voice and
video packet sizes, respectively.
5.3.4 Combining Packet InterArrival Time and Packet Size in Frequency
Domain
Figs 58 and 59 show how the regularities hidden in voice and video data streams
can be amplified when combining the two features together in one single stochastic process
that will be described later in the paper. Note how the two important frequencies are
amplified and clearly visible in the PSD plots. The reason why there is a peak in the
PSD for voice (see Figure 58) is that voice applications usually produce closetoconstant
packet rate due to constant interpacket delay of the widely used speech codecs listed
in Table 51; e.g., the peak of 33 Hz in Figure 58 corresponds to 30 ms interpacket
CHAPTER 2
NETWORK ANOMALY DETECTION FRAMEWORK
2.1 Introduction
The first issue of network anomaly detection is to design a framework. There are
two types of network anomaly detection frameworks, i.e., hostbased frameworks and
networkbased frameworks. Hostbased frameworks are deploy, .1 on endhosts. These
frameworks typically use firewall and intrusion detection systems (IDS), and/or balance
the load among multiple (geographically dispersed) servers to defend against network
anomalies. The hostbased approaches can help protect the server system; but it may not
be able to protect legitimate access to the server, because highvolume abnormal traffic
may congest the incoming link to the server.
On the other hand, networkbased frameworks are deploy, .1 inside networks, e.g.,
on routers. These frameworks are responsible for detecting network anomalies and
identifying abnormal packets/flows or anomaly sources. To detect network anomalies,
signal processing techniques (e.g., wavelet [8], spectral analysis [9, 10], statistical methods
[1113]), and machine learning techniques [14] can be used. To identify network anomaly
sources, IP traceback [15] is typically used. The IP traceback techniques can help contain
the attack sources; but it requires largescale deployment of the same IP traceback
technique and needs modification of existing IP forwarding mechanisms (e.g., IP header
processing).
This chapter presents our network anomaly detection framework, which is of the
networkbased category. We present our framework design in Section 2.2 and summarize
this chapter in Section 2.3.
2.2 EdgeRouter Based Network Anomaly Detection Framework
To detect network anomalies in an ISP network, we designed an edgerouter based
network anomaly detection framework. The motivation results from an ISP network
architecture (Figure 21). It consists of two types of IP routers, i.e., core routers and edge
YuleWalker method. Now, we describe methods to estimate coefficients of an AR
signal. Given the fact that q = 0, Equation (610) can be written as
P
y(i) + ,,(i t) =(i). (611)
t=1
Multiplying Equation (612) by y*(i k) and taking expectation at both sides, one obtains
P
r(k; y) + Y atr(k t; y) =E {E(i)y*( k)}. (612)
t= 1
Noting that
0 ifi k
E{ (i)y*( k)} (613)
E2 ifi k
one obtains the equation system
r(0; y) + Et, atr(t; y) 2 (
< (614)
r(k;y) + EY atr(k t;y) = 0, k 1,...,p.
Equation (614) can be rewritten in matrix form, i.e.,
r(0;y) r( ;y) r(p; y) 1 2
r(1; y) r(; y) al 0
(615)
r( ;) :
r(p; y) (. r(0;y) 0, O
Equation (615) is known as the YuleWalker method for AR spectral estimation[59].
Given data {y(i)} 1i, one first estimates the autocovariance sequence
r(k; y) 7= L =k+ y(i)y*(i k)
S, k = 0,...,p. (616)
(k; y) = *(k; y)
When {r(k; y); k = p,...,p} is replaced by its estimate {r(k; y); k = p,...,p},
Equation (615) becomes a system of p + 1 linear equations with p + 1 unknown variables,
Finally, we obtain the posterior probabilities by Equations (433) and (434). These
two equations are proven in Appendix A.3 and A.4, respectively. The estimated posterior
probabilities are used in Function TransProbEstimate to update estimates of transition
probabilities (see Figure 49).
Till now, we established the HMT model and described approaches to estimate
its model parameters from training data. Next, we present network anomaly detection
approaches using the fully determined HMT model.
4.4 Network Anomaly Detection Using HMT
In this section, we present the network anomaly detection using HMT model. This
is equivalent to a decoding problem in terms of pattern classification. That is, given an
observation sequence, i.e., extracted features ( {1(; iE E}, and a HMT model defined by
* prior probabilities: P(RQ), i E Bo;
* likelihood: p(Q ij), i E B;
* transition probabilities: P(QjI2,(i)), i E B \ "o,
we need to compute the "b' I state combination = {2i; i E E}. Here, the word
"b. I is in terms of maximum gain criterion, as illustrated in Equation (419). We
rewrite the criterion in Equation (439),
i = arg max f (u,)P(A, = u, I Qp(i) = p(i'), )
I IP(Q ) I"c ` (439)
for V i E B. Function ViterbiDecodeHMT, as illustrated in Figure 411, shows the
pseudocode for the classification algorithm.
Function ViterbiDecodeHMT takes three arguments. The first two arguments, the
transition probabilities and the likelihood, are estimated during training phase. The
last one is the extracted features, based on which we perform anomaly detection. It
returns the estimates of node states. Among them, we are only interested in states of leaf
2. current training feature, (k).
It returns estimate of posterior probabilities
P (+ O ( ) I(k)) p(j+) i p(i) I (k)) \ 70
With the posterior probabilities obtained through BP algorithm, we update the
estimates of transition probabilities by Equation (430) for V i E E \ Bo and step to the
next iteration. When the estimates converge, iteration stops and Function TransProbEsti
mate returns estimates obtained at the last iteration.
The validity of Equation (430) is shown in the following. For V i E E \ 0o,
P(Q, QP)) J P(QAI Q, = )p b)dQ
= f [P((Q, Q(), ))] (435)
where [.] represents statistical expectation and the subscript stands for the random
variable over which expectation is taken. Because sample average is alvx the best
unbiased estimate of statistical expectation[30], we estimate
[P((QI Q(,),))]
by
k 1 k 1
Combining Equations (435) and (436), we obtain Equation (430).
Next, we discuss the belief propagation algorithm, which is called by function
TransProbEstimate.
Belief propagation algorithm. The BP algorithm [3335], also known as the
sumproduct i,'lj .:thm, e.g., [3639], is an important method for computing approximate
marginal distributions. In this paper, we apply the BP algorithm to estimating posterior
probabilities (Figure 410).
m
60
c
o 50
 40
S30
20
10
0 02 04 06 08
Normalized Frequency (xrc rad/sample)
Figure 56. Power spectral density of two sequences of discretetime packet sizes for voice
traffic
80
Trace 1
75 Trace 2
m 70
6 65
C
a 60
55
50
45
S40
35
30
0 02 04 06 08
Normalized Frequency (xrc rad/sample)
Figure 57. Power spectral density of two sequences of discretetime packet sizes for video
traffic
delay. Compared to voice, video applications have a flatter PSD. The reason is as below.
The number of bits in an Iframe of video depends on the texture of the image (e.g.,
the Iframe of a blackboard image produces much less bits than that of a complicated
flower image), resulting in a large range in the number of packets in an Iframe, e.g., from
1 packet to a few hundred packets produced by an Iframe. The frame rate is usually
constant (e.g., 30 frames/s is a standard rate in USA), i.e., a frame will be generated every
6 BFA (=0 001)
3
S 2 3 4 5 6 7 8
Average number of hash function calculations
Figure 314. Average processing time per query (in ps) vs. average number of hash
function calculations per query.
processing time per query), in Section 3.6.1, we used the average number of hash function
calculations per query to represent the time complexity of the hash table algorithm and
the BFA algorithm.
Performance comparison between numerical and simulation results.
Figure 315 compares the simulations results and the numerical results obtained from
the analysis in Section 3.6, for both hash table algorithm and BFA algorithm in terms of
space complexity vs. time complexity.
In Figure 315(a), the numerical result agrees well with the simulation result, except
when the average number of hash function calculations per query is close to 1. From
Equation (3 6), if the expected number of hash function calculations approaches 1, the
required memory size approaches infinity; in contrast, simulations with a large i A, may
not give accurate results, due to limited memory size of a computer. This causes the big
discrepancy between the numerical result and the simulation result when the average
number of hash function calculations per query is close to 1. When the average number
of hash function calculations per query is greater than or equal to two, it is observed that
simulation ah,~ requires more memory than the numerical result. This is due to the
fact that practical hash function is not perfect. That is, entries in the hash table are not
or 95'. The algorithm returns 5 variables. j7 represents the sampled mean of all feature
vectors. It is the origin of the identified subspace. The columns of U are the bases with
dominant energy (i.e.,variance), whose corresponding variances are denoted by E. These
bases determine the identified low dimensional subspace spanned by T. The columns of
U compose the null space of the previous subspace, whose corresponding variances are E.
The last two outputs are required to calculate the distance of an ongoing feature vector to
the subspace, which will be described in Section 6.4.
Applying the function II. ,l:fyBases on all segmentations, we obtain
k ik k) IdentifyBases (T ) (656)
for Vk = 1,..., K, i = 1, 2. These are the outputs of subspace identification module, and
hence the results of training phase, in Figure 61.
During the classification phase, these outputs are used as system parameters, which
will be presented in the next section.
6.4 Voice/Video Classifier
In Section 6.3, we presented an approach to identify subspaces spanned by PSD
feature vectors of training voice and video flows. Specifically, one obtains the following
parameters:
[l~,^)j i) ,u ) U i)] ( 57)
pk Uk k k k (657)
for Vk = 1,...,K, i, = 1, 2. In this section, we use these parameters to do classification.
During the classification phase, for each ongoing flow F, one composes a subflow,
Fs, by extracting small packets, i.e., packets smaller than Op, and passes it through PSD
feature extraction module to generate PSD feature vector Q. This is the input to the
voice/video classifier.
The voice/video classifier works in the following way. It first calculates the normalized
distances between p and all subspaces of both categories. Then it chooses minimum
Subnet
Edge routers
SSubnet
S Autonomous
System
SSubnet
Core routers Subnet
Figure 21. An ISP network architecture.
routers. Core routers interconnect with one another to form a highspeed autonomous
system (AS). In contrast, edge routers are responsible for connecting subnets (i.e.,
customer networks or other ISP networks) with the AS. In this paper, a subnet can be
either a customer network or an ISP network.
Global Analyzer  Local Analyzer
ubne
Autonomous Systehinm
SLocal Analyzer
0 Traffic monitor
Directional link between autonomous system and subnet
Figure 22. Network anomaly detection framework.
CHAPTER 3
FEATURES FOR NETWORK ANOMALY DETECTION
3.1 Introduction
Given the network anomaly detection framework we have established, the second
issue of network anomaly detection is feature extraction. Features for network anomaly
detection have been studied extensively in recent years. For example, Peng et al.[12]
proposed the number of new source IP addresses to detect DDoS attacks, under the
assumption that source addresses of IP packets observed at an edge router were relatively
static in normal conditions than those during DDoS attacks. Peng further pointed out
that the feature could differentiate DDoS attacks from the flash crowd, which represents
the situation when many legitimate users start to access one service at the same time.
For example, when many people watch a live sports broadcast over the Internet at the
same time. In both cases (DDoS attacks and the flash crowd), the traffic rate is high. But
during DDoS attacks, the edge routers will observe many new source IP addresses because
attackers usually spoof source IP addresses of attacking packets to hide their identities.
Therefore, this feature improves those DDoS detection schemes that rely on traffic rate
only. However, Peng et al.[12] focused on detection of DDoS attacks. It did not mention
other types of network anomalies. For example, when malicious users are scanning the
network, we can also observe high traffic rate but few new source IP addresses. It is
very important to differentiate network scanning from flash crowd because the former is
malicious but the latter is not. The twoway matching feature on different network l1 iri
(Section 3.3.1) can tell not only the presence of network anomalies but also their cause.
Lakhina et al.[16] summarized the characteristics of network anomalies under different
causes. Its contribution is to help identify causes of network anomalies. For example,
during DDoS attacks, we can observe high bit rate, high packet rate, and high flow rate.
The source addresses are distributed over the whole IP address space. On the other hand,
during network scanning, all the three rates are high, but the destination addresses,
packets or the number of unmatched SYN packets as the feature can achieve much higher
detection probability. More importantly, our machine learning algorithm can significantly
outperform CUSUM, given the same feature data, no matter whether the feature is the
number of SYN packets or the number of unmatched SYN packets.
0.8
I 0.6
S 0.4
0.2 Threshold (SYN)
Machine Learning (SYN)
Threshold (UMSYN)
0 Machine Learning (UMSYN)
0 0.2 0.4 0.6 0.8 1
False Alarm Probability
Figure 413. Performance of thresholdbased and machine learning algorithms with
different feature data
Figure 413 compares the ROC curve of the thresholdbased scheme described in
Section 4.1.2 and our machine learning algorithm under two different features, i.e., the
number of SYN packets (denoted by 'SYN') and the number of unmatched SYN packets
(denoted by 'UMSYN'). We observe that, for the same detection algorithm, using the
number of unmatched SYN packets can significantly improve the ROC performance,
compared to using the number of SYN packets. In other words, given the same false alarm
probability, the detection probability is much higher when using the number of unmatched
SYN as feature.
Another important observation from Figure 413 is that given the same feature data,
our machine learning algorithm can (significantly) improve the ROC, compared to the
thresholdbased scheme; e.g., for the same false alarm probability of 0.05, our machine
learning algorithm achieves a detection probability of 0.93, while the thresholdbased
scheme only achieves a detection probability of 0.72. This is due to the fact that our
1. function HashTableInsert(V, s)
2. for i  0 to C 1
3. if V[hi(s)] is empty
4. insert s to V[hi(s)], set state bit of V[hi(s)] to "UNMATCHED"
5. return
6. end if
7. end for
8. report insertion operation error
9. end function
10. function HashTableSearch(V, s)
11. for i  0 to C 1
12. if V[hi(s)] is empty
13. return false
14. if V[hi(s)] holds s
15. return true
16. end for
17. return false;
18. end function
19. function HashTableRemove(V, s)
20. for i  0 to C 1
21. if V[hi(s)] is empty
22. return;
23. if V[hi(s)] holds s
24. set state bit of V[hi(s)] to \lATCHED"
25. return true;
26. end if
27. end for
28. end function
Figure 35. Hash Table Algorithm
3.4.2 Bloom Filter
The hash table algorithm can be used for offline traffic a n i1 i; or analysis of low
datarate traffic but it cannot catch up with a high data rate at edge routers. To address
this limitation, one can use Bloom filter algorithm[21]. Compared to the hash table
algorithm, Bloom filter algorithm reduces space/time complexity by allowing small degree
of inaccuracy in membership representation, i.e., a packet signature, which does not
appear before, may be falsely identified as present.
methods scan the parameter space, EM works in a better way. It is proven that after each
iteration, EM algorithm guarantees to generate estimates of parameters which increase
Equation (428). As a result, EM algorithm converges much faster than numerical ML
method.
However, the disadvantage of EM algorithm is that it converges to a local maxima
rather than the global one. Specifically, initial values of parameters determine the local
maxima to which the EM algorithm converges. In practice, we have prior knowledge of
network features, which helps to choose initial values of parameters.
Till now, we present schemes to estimate likelihood of HMT. In next section, we
estimate transition probabilities.
4.3.2 Transition Probability Estimation
In this section, we estimate P(QilSp,()), i EE \ 0o. Since closed form representation of
the transition probabilities is not available, we also estimate them in an iterative way.
Denote by {Q(k); k {1,... K}} the set of training features for transition probability
estimation, where (k) 0(k); ci E }. Figs. 49 and 410 show the pseudocode for
transition probability estimation. In next two sections, we explain the two figures.
Iteratively estimate transition probabilities. Figure 49 shows the pseudocode
to estimate transition probabilities. The function TransProbEstimate takes three sets of
arguments:
1. likelihood estimated in Section 4.3.1, {p( Qj); i E B};
2. training features, {((k); k e {1,..., K}}.
It returns the estimate of transition probabilities, i.e.,
P(, Q~,(Q)),V i \ o.
Before the iterations, we set the initial transition probabilities to be 1 at line 5 of
Figure 49. This is equivalent to assume normal state and abnormal state to be initially
space/time tradeoff than the hash table. We also see that the curve of BFA with r = 1
is below the curve of BFA with l = 0.1 This shows the relationship between space/time
and collision probability. Specifically, to reach a lower collision probability or more
accurate detection, we need to either calculate more hash functions or use more storage
space.
To see the gain of using BFA, let us look at an example. Suppose E[T] = 5, i.e., in
each slot, 5 hash function calculations is needed on average. Then, the memory required
by the hash table scheme, BFA with q = 0.1 t and BFA with = 1. is 1.01G bits,
115.3M bits, and 62.9M bits, respectively. It can be seen that our BFA with = 1 can
save storage by a factor of 16, compared to the hash table scheme.
Figure 310 shows that for the hash table scheme, i. is a monotonic decreasing
function of E[Th]. The observation matches our intuition that the larger table, the smaller
collision probability for hash functions, resulting in less hash function calculations. Further
note that i., approaches RF(b + 1) when E[Th] increases. This is the minimum space
required to tolerate up to RF flows.
For BFA, .1, is not a monotonic function of E[T,], which approximately equals /C.
We have the following observations.
* Case A: For fixed storage size, the smaller IC, the larger the probability that all /C
hash functions of two different inputs return the same outputs, which is the collision
probability. In other words, the smaller IC, the larger storage size required to achieve
a fixed collision probability. That is, KC = i= M, T.
* Case B: Since an input to BFA may set KC bits to "1" in a vector V, hence the larger
IC, the more bits in V will be set to "1" (nonempty), which translates into a larger
collision probability. In other words, the larger IC, the larger storage size required to
achieve a fixed collision probability. That is, K/ T=> .l, T.
Combining Cases A and B, it can be argued that there exists a value of K/ or E[Ta]
that achieves the minimum value of i ,, given a fixed collision probability. This minimum
property can be used to guide the parameter setting for BFA, which will be addressed in
Section 3.6.2.
traffic from multiple types of applications (such as voice, video, chat, and file transfer),
referred to as i,;l';.: flow in the remainder of this paper.
Our research focuses on detecting and classifying voice and video traffic and further
deal with its more general formulation that considers the presence of hybrid flows.
Based on the intuitions that voice and video data streams show strong regularities in
the interarrival times of packets within the flow and the associated packet sizes when
combined together in one single stochastic process and analyzed in the frequency domain,
we propose a system, called VOVCla' '.,. r for voice and video traffic classification.
VOVClassifier is an automated selflearning system that is composed of four
ii i P.i modules operating in cascade. The system first is trained with voice and video
data streams and afterwards enters the classification phase. During the training
period, all packets belonging to the same flow are extracted and used to generate a
stochastic model that captures the features of interest. Then all flows are processed in
the frequency domain using Power Spectral D. u;.:/ (PSD) analysis in order to extract a
highdimensional space of frequencies that carry the 1 Pi ii y of energy of the signal. All
features extracted from each flow are grouped into a I. II ire vector". Due to the wide
usage of different codecs for voice and video, we propose a second module that clusters the
feature vectors into several groups using Subspace Decomposition (SD) and then identifies
its subspace structure, e.g. bases of the subspace, using Principal Component A,.l;.i..:
(PCA). These two steps are applied to all flows during the training period and produce
lowdimensional spaces, referred in the paper as voice subspace and video subspace.
After training, all flows are processed by the PSD module and the associated feature
vector is compared with the voice and video spaces obtained during training. The space at
minimum normalized distance from the feature vector is selected as candidate and chosen
if and only if its distance is below a specific predetermined threshold.
We applied VOVClassifier to real packet traces collected into two different network
scenarios. Results demonstrate the effectiveness and robustness of our approach, able to
1 r
095
09
085
08
075
07
065
06
055
05
0 02
095
09
085
08
075
07
065
06
055
05
0
04 06 08
PF
FA
(a)
02 04 06 08
09
08
07
06
05
04
03
02
0
0 02 04 06 08
09
08
07
06
05
04
03
02
01
0
0
02 04 06 08
PFA PFA
(c) (d)
Figure 611. The ROC curves of hybrid flows generated by Skype, MSN, and GTalk: (a)
VOICE, (b) VIDEO, (c) FILE+VOICE, and (d) FILE+VIDEO.
One application vs. multiple applications. One further notes that classification
of Skype flows is more accurate than that of flows generated from general applications, i.e.,
Skype, MSN, and GTalk.
Empirically, we found that Skype flows are similar to GTalk flows, but quite different
from MSN. For example, both Skype and GTalk voice flows have about 33millisecond
interarrival time, whereas MSN voice flow has approximately 25millisecond interarrival
time.
In this section, we use the formulae derived in above sections to compare the hash
table scheme with BFA algorithm through numerical calculations. The setting of our
numerical study is the following:
1. Traces captured from an ISP's edge router shows that the average number of
flows during one second is around 250, 000. So, we let R=250, 000. To reduce the
probability of false alarms caused by normal packets with long RTT, we choose F
large enough such that more than 9'' packets have RTT less than F. For the same
traces, F 80 seconds.
2. Suppose we want to detect TCP traffic anomaly. Thus the signature captured from
each packet is composed of 32bit SA, 32bit DA, 16bit SP, and 16bit DP. So b = 96
bits.
3. In the BFA algorithm, we use 40 time slots (i.e., w = 40), each of which is 2 seconds
(i.e., 7 2). Also suppose each counter is a 32bit integer (i.e., L = 32).
1011
Hash Table
 BFA ( = 0 001)
SBFA ( = 0 01)
1010
107
1 2 3 4 5 6 7 8 9 10
12345678910
Time E[T]
Figure 310. Space/time tradeoff for the hash table, BFA with = 0.1 and BFA with
Figure 310 shows M vs. E[T] for the hash table scheme, BFA with collision
probability 1 and BFA with collision probability 0.1 In Figure 310, X axis represents
the time complexity (i.e., the expected number of hash function calculations) and Y
axis represents the space complexity (i.e., the number of bits needed for storage). From
Figure 310, we can see that the curve of BFA is below the curve of the hash table. It
means BFA uses less space for a given time complexity. Therefore, BFA achieves better
=( 2Q1, 22, Q ), a (412)
u U( U2, U ,) (413)
1 Q02, )O (414)
where fi is the random variable representing state of edge router i, i E {1,..., }, which
is defined in the same way as in Equation (410). We further assume gain factors are
independent of node index, i.e., <(ui) = X(uv,) whenever ui = u, (0 or 1), no matter
whether i is equal to i' or not. Then the maximum gain criterion (see Equation (49)) for
the dependent model is
u = argmax (i)p()JQ = ui)P(Q = i)
argmax [ixUi)P(Qi 2 Ut) P(l it). (415)
As the dependent model takes spatial correlation into consideration, it can make more
accurate detection, especially when traffic volume is low. However, it is a computationally
intractable model. That is because solving Equation (415) directly, we need to exhaustively
compute p(l)P() for each possible combination of Q, which results in a O (2")
complexity. For a large AS, it is intractable.
We introduced a hierarchical structure to reduce computation complexity, which is the
topic of the next section.
4.2.3 Hidden Markov Tree (HMT) Model for Global Analyzer
The reason that the dependent model illustrated in Figure 44 becomes computationally
intractable is that we assume edge routers are fully dependent. A rough understanding
is that, if we break some dependence in Figure 44, we can reduce the computation
complexity. On the other hand, we would like to account for the dependencies among as
many nodes as possible to provide accurate detection. To balance these two conflicting
4.2 Bayesian Model for Network Anomaly Detection
In this section, we model the network anomaly detection issue in terms of B li, i i,
decision theory. This section is organized as follows. Section 4.2.1 generalizes the
B li, i i model for network anomaly detection on traffic monitors and local analyzers.
In section 4.2.2, we extend this model to the whole autonomous system. Section 4.2.3
introduces the hidden Markov tree model to decrease the computation complexity of the
general model defined in Section 4.2.2.
4.2.1 Bayesian Model for Traffic Monitors and Local Analyzers
As described earlier, both the traffic monitor and the local ,i 1v. r have local
information of one edge router. They are able to detect network anomalies through one
single edge router. That is, a traffic monitor makes anomaly declaration when it observes
abnormal features extracted from one link of an edge router, such as large data rate, large
SYN/FIN(RST) ratio, and so on. Similarly, a local analyzer detects network anomaly
by observing the twoway matching features on one edge router. Next, we formulate the
detection problem in terms of the B i, i in decision theory introduced in Section 4.1.4.
Figure 41. Generative process in graphical representation, in which the traffic state
generates the stochastic process of traffic.
In the context of network anomaly detection, there are two states of nature of an edge
router, i.e., H= {Ho, H1}, where
* Ho represents normal state, in which case no abnormal network traffic enters the AS
through that edge router;
* Hi represents abnormal state, in which case abnormal network traffic enters the AS
through the edge router.
where i' E E\Eo. Therefore, it also requires to calculate posterior probabilities. We employ
BP algorithm (see Equations (440) and (441) in Figure 411) to solve Equation (443) in
a way similar to HMT transition probability estimation(see Figure 410).
Obtaining solutions to Equation (443) for all nodes, we can solve Equation (439). A
brute force solution to Equation (439) is to exhaustively compute
IH I)P(/ Ui,\Q^) U), V I P () i\) W (444)
i'ETiV'\{R(i)}
for V i e E \ Eo, V Ui e {Space of 0} and select the "b. i one. A HMT with L levels
modeling an AS with K edge routers has
1 L
1 B1
nodes, each of which has two states, normal and abnormal. Then the 0 space has
possible values. The computation complexity of the brute force method is
even worse than the dependent model (see Figure 44), whose complexity is
0 (2").
In this paper, we applied Viterbi algorithm [4042] to solving Equation (439) in an
iterative manner, reducing the computational complexity to O (Ba).
The motivation of Viterbi algorithm is the following. It iterates from top level of a
HMT to bottom level. At each iteration, it estimates the node states in that level in a way
that, when combined with states estimated in upper levels, "b, 1 explains the observed
features. That is, at each iteration, Viterbi algorithm albvi selects the local maxima.
3.6.2 Optimal Parameter Setting for Bloom Filter Array
This section addresses how to determine parameters of BFA under two criteria,
namely, minimum space criterion and competitive optimality criterion.
Minimum space criterion. According to Equation (325), three parameters, ii,,
E[Ta], and r7, are coupled. Since the collision probability Tr critically affects the detection
error rate in our network anomaly detection, a network operator may want to choose an
upper bound r on the acceptable collision probability r and then minimize the storage
required, i.e.,
According to Equation (3
min ii,, subject to T < T
E[Ta h
25), the solution of (3 26) is as below
i f* = minif, = min w [2aR (qT, E[Ta]) + L],
E[Ta] E[Ta]
E[T] = arg minf ., arg min aR (T7, E[T,]).
E[Ta] E[Ta]
10 10
(a)
10 10
10" 10
(b)
Figure 311. Relation among space
(a) .l: vs. T1. (b) E[T]a* vs. T7.
complexity, time complexity, and collision probability.
Figure 311 shows .1 1 vs. T7, and E[T,]* vs. Tr under the same setting as that
in Section 3.6.1. From Figure 311(a), it can be observed that .1i decreases when T
(326)
(327)
(328)
10 10
(652)
and a partition of video feature vector set T(2)
(2) ,T2) U .. U (2) (653)
Next, we describe the method to identify subspace bases in each of the segmentations.
6.3.2 Subspace Bases Identification
In this section, we use PCA[61] algorithm to identify subspace bases for each
segmentation,
{ (i k 1,..., K i 1,2}, (654)
obtained in the previous section. The basic idea is to identify uncorrelated bases and
choose those bases with dominant energy. Figure 66 shows the algorithm.
1. function L,,zsU ,1] II I. IfyBases(T E R MXN6)
2. 4 '1
3. P,', ^ b, im 
4. Do eigenvalue decomposition on 4T such that
TT =UUT', (655)
where U [ul, M], E L diag ([ar ,..., a ]), and a2 > 7g > > a.
5. j argmin5 EY 12(r) > MEM 172(M)
6. U = [uit, u2,..., i1
7. U [j, ij+ ,... ,UMI
8. E diag( 91 ,..., 7 _J)
9. E= diag ( r.... c.r )
10. end function
Figure 66. Function II. ,l.:fyBases identifies bases of subspace.
In Figure 66, argument T represents the feature vector set of one segmentation and
6 is a user defined parameter which specifies the percentage of energy retained, e.g., 91'
5.3.2 Packet InterArrival Time in Frequency Domain
We show how the same feature becomes a key reliable feature when observed in
the frequency domain. In this new domain, we are interested whether it does exist any
frequency component, e.g. interarrival time, that captures the 1i, ii, i ily of the energy of
this stochastic process at the receiver side. We exploit the above by computing the power
spectral density (PSD) analysis of the packet interarrival time process of two traces, each
of which is of length 10 second, in Figures 54 and 55 respectively for voice and video. We
can see that some regularity for both voice and video exist for different traces, although
the regularity is not quite strong. This result holds true for all experiments conducted
when transmitting Skype voice and video packets over the Internet from University A to
University B.
0
Trace 1
Trace 2
10
m
n
S20
(/)
C
o 30
P 40
a 50
O
60
70
0 02 04 06 08 1
Normalized Frequency (xTc rad/sample)
Figure 54. Power spectral density of two sequences/traces of timevarying interarrival
times for voice traffic
5.3.3 Packet Size in Frequency Domain
Somewhat stronger regularity is visible for voice and video packet sizes. Indeed,
most video coding schemes use two types of frames[58], i.e., Intra frames (Iframe)
and Predicted frames (Pframe). An Iframe is a frame coded without reference to any
frame except itself. It serves as the starting point for a decoder to reconstruct the video
stream. A Pframe may contain both image data and motion vector displacements
[46] T. Karagiannis, A. Broido, N. Brownlee, kc claffy, and M. Faloutsos, "Is p2p dying or
just hiding?" in IEEE Globecom 2004, 2004.
[47] S. Sen, O. Spatscheck, and D. Wang, "Accurate, scalable in network identification of
p2p traffic using application signatures," in WWW, 2004.
[48] K. Wang, G. Cretu, and S. J. Stolfo, "Anomalous p ,iloadbased network intrusion
detection," in 7th International Symposium on Recent Advanced in Intrusion
Detection, Sept. 2004, pp. 201 222.
[49] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, "Classofservice mapping for qos:
A statistical signaturebased approach to ip traffic classification," in AC I[ Internet
Measurement Conference, Taormina, Italy, 2004.
[50] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, "Blinc: multilevel traffic
classification in the dark," in SIGCOMM '05: Proceedings of the 2005 conference on
Applications, technologies, architectures, and protocols for computer communications.
New York, NY, USA: AC\ I Press, 2005, pp. 229240.
[51] C. Dewes, A. Wichmann, and A. Feldmann, "An analysis of internet chat
systems," in IMC '03: Proceedings of the 3rd ACi f SIGCOMM conference on Internet
measurement. New York, NY, USA: AC \! Press, 2003, pp. 51 64.
[52] D. Wu, T. Hou, and Y.Q. Zhang, "Transporting realtime video over the internet:
C'! i11! y and approaches," Proceedings of the IEEE, vol. 88, no. 12, pp. 1855 1875,
December 2000.
[53] ITUT, "G.711: Pulse code modulation (pcm) of voice frequencies," ITUT
Recommendation G.711, 1989. [Online]. Available: http://www.itu.int/rec/TRECG.
711/e
[54] ITUT, "G.726: 40, 32, 24, 16 kbit/s adaptive differential pulse code
modulation (adpcm)," ITUT Recommendation G.726, 1990. [Online]. Available:
http://www.itu.int/rec/TRECG.726/e
[55] ITUT, "G.728: Coding of speech at 16 kbit/s using lowd,'1li code excited
linear prediction," ITUT Recommendation G.728, 1992. [Online]. Available:
http://www.itu.int/rec/TRECG.728/e
[56] ITUT, "G.729: Coding of speech at 8 kbit/s using conjugatestructure
algebraiccodeexcited linear prediction (csacelp)," ITUT Recommendation G.729,
1996. [Online]. Available: http://www.itu.int/rec/TRECG.729/e
[57] ITUT, "G.723.1: Dual rate speech coder for multimedia communications transmitting
at 5.3 and 6.3 kbit/s," ITUT Recommendation G.723.1, 2006. [Online]. Available:
http://www.itu.int/rec/TRECG.723.1/en
[58] Y. Wang, J. Ostermann, and Y.Q. Zhang, Video Processing and Communications,
1st ed. Prentice Hall, 2002.

Full Text 
PAGE 1
1
PAGE 2
2
PAGE 3
3
PAGE 4
Firstofall,thankmyadvisorProfessorDapengWuforhisgreatinspiration,excellentguidance,deepthoughts,andfriendship.Ialsothankmysupervisorycommitteemembers,ProfessorsShigangChen,LiuqingYang,andTaoLi,fortheirinterestinmywork.Ialsoexpressmyappreciationtoallofthefaculty,sta,andmyfellowstudentsintheDepartmentofElectricalandComputerEngineering.Inparticular,IextendmythankstoDr.KejieLuforhishelpfuldiscussions. 4
PAGE 5
page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 14 1.1IntroductiontoNetworkAnomalyDetection ................. 14 1.2IntroductiontoNetworkCentricTracClassication ............ 16 2NETWORKANOMALYDETECTIONFRAMEWORK ............. 18 2.1Introduction ................................... 18 2.2EdgeRouterBasedNetworkAnomalyDetectionFramework ........ 18 2.2.1TracMonitor ............................. 20 2.2.2LocalAnalyzer ............................. 20 2.2.3GlobalAnalyzer ............................. 21 2.3Summary .................................... 22 3FEATURESFORNETWORKANOMALYDETECTION ............ 23 3.1Introduction ................................... 23 3.2HierarchicalFeatureExtractionArchitecture ................. 24 3.2.1ThreeLevelDesign ........................... 24 3.2.2FeatureExtractioninaTracMonitor ................ 26 3.2.3FeatureExtractioninaLocalAnalyzeroraGlobalAnalyzer .... 27 3.3TwoWayMatchingFeatures .......................... 27 3.3.1Motivation ................................ 27 3.3.2DenitionofTwoWayMatchingFeatures .............. 30 3.4BasicAlgorithms ................................ 32 3.4.1HashTableAlgorithm ......................... 32 3.4.2BloomFilter ............................... 33 3.5BloomFilterArray(BFA) ........................... 35 3.5.1DataStructure ............................. 35 3.5.2Algorithm ................................ 36 3.5.3RoundRobinSlidingWindow ..................... 38 3.5.4RandomKeyedHashFunctions .................... 39 3.6ComplexityAnalysis .............................. 40 3.6.1Space/TimeTradeo .......................... 41 3.6.2OptimalParameterSettingforBloomFilterArray .......... 50 5
PAGE 6
............................... 51 3.7.1TheBFAAlgorithmvs.theHashTableAlgorithm .......... 51 3.7.2ExperimentonFeatureExtractionSystem .............. 55 3.8Summary .................................... 57 4MACHINELEARNINGALGORITHMFORNETWORKANOMALYDETECTION ..................................... 59 4.1Introduction ................................... 59 4.1.1ReceiverOperatingCharacteristicsCurve ............... 59 4.1.2ThresholdBasedAlgorithm ...................... 60 4.1.3ChangePointAlgorithm ........................ 60 4.1.4BayesianDecisionTheory ........................ 62 4.2BayesianModelforNetworkAnomalyDetection ............... 64 4.2.1BayesianModelforTracMonitorsandLocalAnalyzers ...... 64 4.2.2BayesianModelforGlobalAnalyzers ................. 66 4.2.3HiddenMarkovTree(HMT)ModelforGlobalAnalyzer ....... 68 4.3EstimationofHMTParameters ........................ 72 4.3.1LikelihoodEstimation .......................... 72 4.3.2TransitionProbabilityEstimation ................... 76 4.4NetworkAnomalyDetectionUsingHMT ................... 81 4.5SimulationResults ............................... 84 4.5.1ExperimentSetting ........................... 84 4.5.2PerformanceComparison ........................ 86 4.5.3Discussion ................................ 88 4.6Summary .................................... 89 5NETWORKCENTRICTRAFFICCLASSIFICATION:ANOVERVIEW .... 90 5.1Introduction ................................... 90 5.2RelatedWork .................................. 94 5.3IntuitionsBehindaProperDetectionofVoiceandVideoStreams ..... 95 5.3.1PacketInterArrivalTimeandPacketSizeinTimeDomain ..... 97 5.3.2PacketInterArrivalTimeinFrequencyDomain ........... 99 5.3.3PacketSizeinFrequencyDomain ................... 99 5.3.4CombiningPacketInterArrivalTimeandPacketSizeinFrequencyDomain .................................. 100 5.4Summary .................................... 102 6NETWORKCENTRICTRAFFICCLASSIFICATIONSYSTEM ........ 104 6.1SystemArchitecture .............................. 104 6.1.1FlowSummaryGenerator(FSG) ................... 105 6.1.2FeatureExtactor(FE)andVoice/VideoSubspaceGenerator(SG) 105 6.1.3Voice/VideoCLassifer(CL) ...................... 106 6.2FeatureExtractor(FE)ModuleviaPowerSpectralDensity(PSD) ..... 107 6.2.1Modelingthenetworkowasastochasticdigitalprocess ...... 107 6
PAGE 7
............. 108 6.3SubspaceDecompositionandBasesIdenticationonPSDFeatures .... 115 6.3.1SubspaceDecompositionBasedonMinimumCodingLength .... 117 6.3.2SubspaceBasesIdentication ...................... 120 6.4Voice/VideoClassier ............................. 121 6.5ExperimentResults ............................... 123 6.5.1ExperimentSettings ........................... 123 6.5.2SkypeFlowClassication ........................ 124 6.5.3GeneralFlowClassication ....................... 124 6.5.4Discussion ................................ 125 6.6Summary .................................... 128 7CONCLUSIONANDFUTUREWORK ...................... 129 7.1SummaryofNetworkCentricAnomalyDetection .............. 129 7.2SummaryofNetworkCentricTracClassication .............. 131 APPENDIX APROOFS ....................................... 133 A.1Equation( 4{31 ) ................................. 133 A.2Equation( 4{32 ) ................................. 133 A.3Equation( 4{33 ) ................................. 134 A.4Equation( 4{34 ) ................................. 134 REFERENCES ....................................... 135 BIOGRAPHICALSKETCH ................................ 140 7
PAGE 8
Table page 31Notationsfortwowaymatchingfeatures ...................... 31 32Notationsforcomplexityanalysis .......................... 41 33Space/timecomplexityforhashtable,Bloomlter,andBFA ........... 47 41ParametersusedinCUSUM ............................. 60 42Notationsforhiddenmarkovtreemodel ...................... 70 43Parametersettingoffeatureextractionfornetworkanomalydetection ...... 86 44Performanceofdierentschemes. .......................... 86 51Commonlyusedspeechcodecandtheirspecications ............... 96 61TypicalPDandPFAvalues. ............................. 126 8
PAGE 9
Figure page 21AnISPnetworkarchitecture. ............................ 19 22Networkanomalydetectionframework. ....................... 19 23Responsibilitiesofandinteractionsamongthetracmonitor,localanalyzer,andglobalanalyzer. ................................. 20 24Exampleofasymmetrictracwhosefeatureextractionisdonebytheglobalanalyzer. ........................................ 21 31Hierarchicalstructureforfeatureextraction. .................... 24 32Networkinnormalcondition. ............................ 28 33Sourceaddressspoofedpackets. ........................... 29 34Reroute. ........................................ 29 35HashTableAlgorithm ................................ 33 36BloomFilterOperations ............................... 34 37ScenariosoftheproblemscausedbyBloomlter.(a)Boundaryproblem.(b)Anoutboundpacketarrivesbeforeitsmatchedinboundpacketwitht2t1<. 34 38BloomFilterArrayAlgorithm ............................ 37 39BloomFilterArrayAlgorithmusingslidingwindow ................ 38 310Space/timetradeoforthehashtable,BFAwith=0:1%,andBFAwith=1% ........................................... 48 311Relationamongspacecomplexity,timecomplexity,andcollisionprobability.(a)Mavs..(b)E[Ta]vs.. ........................... 50 312Spacecomplexityvs.collisionprobabilityforxedtimecomplexity. ....... 52 313Memorysize(inbits)vs.averageprocessingtimeperquery(ins) ....... 53 314Averageprocessingtimeperquery(ins)vs.averagenumberofhashfunctioncalculationsperquery. ................................ 54 315Comparisonofnumericalandsimulationresults.(a)Hashtablealgorithm.(b)BFAalgorithmwith=1%. ............................. 55 316Featuredata:(a)NumberofSYNpackets(link1),(b)NumberofunmatchedSYNpackets(link1),(c)NumberofSYNpackets(link2),and(d)NumberofunmatchedSYNpackets(link2). .......................... 58 9
PAGE 10
........................... 64 42Extendedgenerativemodelincludingtracfeaturevectors:(a)originalmodeland(b)simpliedmodel. .............................. 65 43Generativeindependentmodelthatdescribesdependenciesamongtracstatesandtracfeaturevectors. .............................. 66 44Generativedependentmodelthatdescribesdependenciesamongedgerouters. 67 45HiddenMarkovtreemodel.Forannodei,(i)denotesitsparentnodeand(i)denotesthesetofitschildrennodes. ........................ 69 46ProbabilitydensityfunctionoftheunivariateGaussiandistributionN(x;0;1). 73 47Histogramofthetwowaymatchingfeaturesmeasuredatarealnetworkduringnetworkanomalies. .................................. 73 48TheEMalgorithmforestimatingp(iji=u),i2,u2f0;1g. ......... 75 49Iterativelyestimatetransitionprobabilities. .................... 77 410Beliefpropagationalgorithm. ............................ 78 411ViterbialgorithmforHMTdecoding. ........................ 82 412ExperimentNetwork ................................. 85 413Performanceofthresholdbasedandmachinelearningalgorithmswithdierentfeaturedata ...................................... 87 414Performanceoffourdetectionalgorithms ...................... 88 51Averagepacketsizeversusinterarrivalvariabilitymetricfor5applications:voice,video,letransfer,mixofletransferwithvoiceandvideo. ........... 96 52Interarrivaltimedistributionforvoiceandvideotrac .............. 97 53Packetsizedistributionforvoiceandvideotrac ................. 98 54Powerspectraldensityoftwosequences/tracesoftimevaryinginterarrivaltimesforvoicetrac .................................... 99 55Powerspectraldensityoftwosequencesoftimevaryinginterarrivaltimesforvideotrac ...................................... 100 56Powerspectraldensityoftwosequencesofdiscretetimepacketsizesforvoicetrac ......................................... 101 10
PAGE 11
......................................... 101 58Powerspectraldensityoftwosequencesofcontinuoustimepacketsizesforvoicetrac ......................................... 102 59Powerspectraldensityoftwosequencesofcontinuoustimepacketsizesforvideotrac ......................................... 102 61VOVClassierSystemArchitecture ......................... 104 62Powerspectraldensityfeaturesextractionmodule.Cascadeofprocessingsteps. 107 63LevinsonDurbinAlgorithm. ............................. 113 64ParametricPSDEstimateusingLevinsonDurbinAlgorithm. ........... 114 65Pairwisesteepestdescentmethodtoachieveminimalcodinglength. ....... 119 66FunctionIdentifyBasesidentiesbasesofsubspace. ................ 120 67FunctionVoiceVideoClassifydetermineswhetheraowwithPSDfeaturevector~isoftypevoiceorvideoorneither.1are2aretwouserspeciedthresholdarguments.FunctionvoicevideoClassifyusesFunctionNormalizedDistancetocalculatenormalizeddistancebetweenafeaturevectorandasubspace. ..... 122 68TheROCcurvesofsingletypedowsgeneratedbySkype,(a)VOICEand(b)VIDEO. ........................................ 124 69TheROCcurvesofhybridowsgeneratedbySkype,(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. ..................... 125 610TheROCcurvesofsingletypedowsgeneratedbySkype,MSN,andGTalk:(a)VOICEand(b)VIDEO. ............................. 126 611TheROCcurvesofhybridowsgeneratedbySkype,MSN,andGTalk:(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. .............. 127 11
PAGE 12
12
PAGE 13
13
PAGE 14
1 2 ]hasrisenfromapproximately9,472,000inJanuary1996to394,991,609inJanuary2006.Inaddition,theemergenceofnewapplicationsandprotocols,suchasvoiceoverInternetProtocol(VoIP),peartopear(P2P),andvideoondemand(VoD)[ 3 ],alsoincreasesthecomplexityoftheInternet.Accompanyingthistrendisanincreasingdemandformorereliableandsecureservice.AmajorchallengeforInternetserviceproviders(ISP)istobetterunderstandthenetworkstatebyanalyzingnetworktracinrealtime.ThusISPsareveryinterestedintheproblemofnetworkcentrictracanalysis.Weconsiderthenetworkcentrictracanalysisproblemfromtwoperspectives:1)networkanomalydetectionand2)networkcentrictracclassication.Weintroducethetwoperspectivesinthenexttwosections. 14
PAGE 15
15
PAGE 16
4 ]summarizedtechniquesforQoSprovisionforrealtimestreamsfromthepointofviewofendhosts.Thesetechniquesincludecodingmethods,protocols,andrequirementsonstreamservers.AnothereectivesolutionisfromthepointofviewofnetworkcarriersorISPs.Forexample,ISPscanassigndierentforwardingprioritytodierenttypesofnetworktraconrouters.Thisisthemotivationofdierentiatedservices(DiServ)[ 5 6 ].DiServisamethoddesignedtoguaranteedierentlevelsofQoSfordierentclassesofnetworktrac.Itisachievedbysettingthe\typeofservice"(TOS)[ 7 ]eld,whichhenceisalsocalledDiServcodepoint(DSCP)[ 5 ],intheIPheaderaccordingtotheclassofthenetworkdata,sothatthebetterclassesgethighernumbers.Unfortunately,suchdesignhighlydependsonnetworkprotocols,especiallyproprietaryprotocols,observingDiServregulations.Intheworstcase,ifallprotocolssetTOStothehighestnumber,itisevenworsetoemployDiServmethod.Forthisreason,webelieveaproperDiServschemeshouldbeabletoclassifynetworktraconthey,insteadofrelyingonanytagsinpacketheader.Thus,thedicultyliesinaccurateclassicationofnetworktracinrealtime. 16
PAGE 17
17
PAGE 18
8 ],spectralanalysis[ 9 10 ],statisticalmethods[ 11 { 13 ]),andmachinelearningtechniques[ 14 ]canbeused.Toidentifynetworkanomalysources,IPtraceback[ 15 ]istypicallyused.TheIPtracebacktechniquescanhelpcontaintheattacksources;butitrequireslargescaledeploymentofthesameIPtracebacktechniqueandneedsmodicationofexistingIPforwardingmechanisms(e.g.,IPheaderprocessing).Thischapterpresentsournetworkanomalydetectionframework,whichisofthenetworkbasedcategory.WepresentourframeworkdesigninSection 2.2 andsummarizethischapterinSection 2.3 21 ).ItconsistsoftwotypesofIProuters,i.e.,coreroutersandedge 18
PAGE 19
AnISPnetworkarchitecture. routers.Coreroutersinterconnectwithoneanothertoformahighspeedautonomoussystem(AS).Incontrast,edgeroutersareresponsibleforconnectingsubnets(i.e.,customernetworksorotherISPnetworks)withtheAS.Inthispaper,asubnetcanbeeitheracustomernetworkoranISPnetwork. Figure22. Networkanomalydetectionframework. 19
PAGE 20
Responsibilitiesofandinteractionsamongthetracmonitor,localanalyzer,andglobalanalyzer. GivensuchISPnetworkarchitecture,wedesignaframeworktodetectnetworkanomalies.Ourframework(Figure 22 )consistsofthreetypesofcomponents:tracmonitors,localanalyzers,andaglobalanalyzer.Figure 23 summarizesthefunctionalitiesofeachtypeofcomponentsandtheirinteractions.Next,wediscussthefunctionalitiesoftracmonitors,localanalyzers,andglobalanalyzerinSections 2.2.1 2.2.2 ,and 2.2.3 ,respectively. 22 )isresponsiblefor:scanningpartialorallpacketsofasingleunidirectionallink;summarizingtraccharacteristics;extractingsimplefeaturesfromthetraccharacteristic;makingdecisions(e.g.,declarenetworkanomalyorclassifytypeofnormaltrac)ononesingleunidirectionallink;andreportingthesummaryoftracinformation,simplefeaturedata,anddecisionstoalocalanalyzer. 20
PAGE 21
Figure24. Exampleofasymmetrictracwhosefeatureextractionisdonebytheglobalanalyzer. Theglobalanalyzerhasaglobalviewofthewholenetwork.Hence,itexploitsbothtemporalcorrelationandspatialcorrelationoftrac.Hereitisimportanttonotethat,somefeaturedatamustbeobtainedattheglobalanalyzerifglobalinformationisrequired.Forexample,inFigure 24 ,ifthetracfromsubnetAtoserverBpassesthroughedgerouterX,andthetracfromserverBtosubnetApassesthroughedge 21
PAGE 22
22
PAGE 23
12 ]proposedthenumberofnewsourceIPaddressestodetectDDoSattacks,undertheassumptionthatsourceaddressesofIPpacketsobservedatanedgerouterwererelativelystaticinnormalconditionsthanthoseduringDDoSattacks.PengfurtherpointedoutthatthefeaturecoulddierentiateDDoSattacksfromtheashcrowd,whichrepresentsthesituationwhenmanylegitimateusersstarttoaccessoneserviceatthesametime.Forexample,whenmanypeoplewatchalivesportsbroadcastovertheInternetatthesametime.Inbothcases(DDoSattacksandtheashcrowd),thetracrateishigh.ButduringDDoSattacks,theedgerouterswillobservemanynewsourceIPaddressesbecauseattackersusuallyspoofsourceIPaddressesofattackingpacketstohidetheiridentities.Therefore,thisfeatureimprovesthoseDDoSdetectionschemesthatrelyontracrateonly.However,Pengetal.[ 12 ]focusedondetectionofDDoSattacks.Itdidnotmentionothertypesofnetworkanomalies.Forexample,whenmalicioususersarescanningthenetwork,wecanalsoobservehightracratebutfewnewsourceIPaddresses.Itisveryimportanttodierentiatenetworkscanningfromashcrowdbecausetheformerismaliciousbutthelatterisnot.Thetwowaymatchingfeatureondierentnetworklayers(Section 3.3.1 )cantellnotonlythepresenceofnetworkanomaliesbutalsotheircause.Lakhinaetal.[ 16 ]summarizedthecharacteristicsofnetworkanomaliesunderdierentcauses.Itscontributionistohelpidentifycausesofnetworkanomalies.Forexample,duringDDoSattacks,wecanobservehighbitrate,highpacketrate,andhighowrate.ThesourceaddressesaredistributedoverthewholeIPaddressspace.Ontheotherhand,duringnetworkscanning,allthethreeratesarehigh,butthedestinationaddresses, 23
PAGE 24
Hierarchicalstructureforfeatureextraction. Toecientlyextractfeaturesfromtrac,wedesignathreelevelhierarchicalstructure(Figure 31 ),whereincomingpacketsareprocessedbylevelonelters,thenbyleveltwolters,andnallyby(levelthree)featureextractionmodules.Levelonelters 24
PAGE 25
31 ).Ontheotherhand,afeaturemodulemayneedpacketsfrommultipleleveltwolters.Forexample,theSYN/FIN(RST)ratiofeatureextractionrequirespacketsfromthreelters(Figure 31 ).ComparedtothepacketclassicationschemesdevelopedbyWangetal.[ 11 ]andPengetal.[ 12 ],ourhierarchicalstructureforfeatureextractionismoregeneralandecient.Next,wedescribethemostimportantmoduleinthethreelevelhierarchicalstructure,thefeatureextractionmodule. 25
PAGE 26
11 12 ],wegeneratefeaturesinadiscretemanner,i.e.,ourfeatureextractionmodulewillgeneratea(feature)valueoravectorattheendofeachtimeslot.Intuitively,shorterslotdurationmayreducethedetectiondelay,whichisdenedastheintervalfromtheepochwhentheanomalystartstotheepochwhentheanomalyisdetected;butasmallerdurationmayincreasethecomputationalcomplexity,sincethedetectionalgorithmneedstoanalyzemorefeaturedataforthesametimeinterval.Ontheotherhand,ifafeatureisrepresentedbyaratio,theslotdurationmustbesucientlylargetoavoiddivisionbyzero.Forexample,ifwewanttousetheSYN/FIN(RST)ratioasinRef.[ 11 ]todetectTCPSYNood,thentheslotdurationcannotbetoosmall,becausethenumberofFINpacketsinashortperiodcanbe0,whichwillresultinafalsealarmevenifthenumberofSYNpacketsisnotlarge.Featureextractioncanbedoneinatracmonitor,alocalanalyzer,andaglobalanalyzer,whichwillbedescribedinSections 3.2.2 and 3.2.3 ,respectively. 12 ].Soisdatarate.Datarate:denedbythetotalnumberofbitsofallpacketsthatarriveinonetimeslot.SYN/FIN(RST)ratio 17 ]. 26
PAGE 27
17 ]andthepercentageofnewIPaddressesproposedinRef.[ 12 ].However,theexistingfeaturessuchastheSYN/SYNACKratio[ 17 ]andthepercentageofnewIPaddresses[ 12 ]eitherdonotleadtogoodperformanceofdetectors,orrequirehighstorage/timecomplexity(Section 3.1 ).Toaddressthesedeciencies,weproposeanewtypeoffeaturescalledtwowaymatchingfeatures,whichcanmakedistinctfeaturesbetweennormalandattacktrac,therebyimprovingaccuracyofdetectingattacks.Next,wediscussthetwowaymatchingfeaturesandtheextractionscheme. 3.3.1MotivationThemotivationofusingtwowaymatchingfeaturesarisesfromthefactthat,formostInternetapplications,packetsaregeneratedfrombothendhoststhatareengagedincommunication.Informationcarriedbypacketsononedirectionshallmatchthecorrespondinginformationcarriedbypacketsontheotherdirection.Bymonitoringthedegreeofmismatchbetweenowsoftwodirections,wecandetectnetworkanomalies.Toillustratethis,letusconsiderthebehaviorsofthetwowaytracinthreescenarios,namely,1)normalconditions,2)DDoSattacks,and3)reroute.Intherstscenario,whenthenetworkofanISPworksnormally,informationcarriedonbothdirectionsofcommunicationmatches(Figure 32 ).Hostaandhostvaretwo 27
PAGE 28
Networkinnormalcondition. endsofcommunication(assumethathostviswithintheautonomoussystemoftheISPwhilehostaisnot).Hostasendsapackettohostvandvrespondsapacketbacktohosta.BothpacketspasstheedgerouterA.Fromthepointofviewofthelocalanalyzer1attachedtoedgerouterA,wedenetherstpacketasaninboundpacket,andthesecondpacketasanoutboundpacket.ThesourceIPaddress(SA)anddestinationIPaddress(DA)oftheinboundpacketmatchtheDAandSAoftheoutboundpacket.IfthecommunicationisbasedonUDPorTCP,wecanfurtherobservethatthesourceport(SP)anddestinationport(DP)oftheinboundpacketmatchtheDPandSPoftheoutboundpacket.Therefore,thelocalanalyzer1canobservematchedinboundandoutboundpacketsinnormalconditions.IntheexampleofFigure 32 ,itisassumedthatthebordergatewayprotocol(BGP)routingmakestheinboundpacketsandthecorrespondingoutboundpacketspassthroughthesameedgerouter.IftheBGProutingmakestheinboundpacketsandthecorrespondingoutboundpacketsgothroughdierentedgerouters(Figure 24 ),thematchingcanstillbeachievedbyaglobalanalyzer(Section 2.2.3 ),i.e.,multiplelocalanalyzersconveytheunmatchedinboundpacketsandthecorrespondingoutboundpacketstotheglobalanalyzer,whichhastheroutinginformationofthewholeautonomoussystem. 28
PAGE 29
Sourceaddressspoofedpackets. Inthesecondscenario,whenattackerslaunchspoofedsourceIPaddressDDoSattacks[ 18 ],thelocalanalyzer1observesmanyunmatchedinboundpackets(Figure 33 ).Sincesourceaddressesofinboundpacketsarespoofed,theoutboundpacketsareroutedtothenominaldestinations,i.e.,bandcinFigure 33 ,whichdonotpassthroughedgerouterAanymore.Inthiscase,localanalyzer1willobservemanyunmatchedinboundpackets. Figure34. Reroute. Inthethirdscenario(Figure 34 ),thenumberofunmatchedinboundpacketsobservedbylocalanalyzer1isincreasedduetoafailureoftheoriginalrouteandrerouteofoutboundpacketstoanotheredgerouter.Aglobalanalyzercanaddressthisproblemsimilartotheasymmetriccaseintherstscenario. 29
PAGE 30
32 ,ifhostaisaclientuploadingalargeleusingtheFileTransferProtocol(FTP)[ 19 ]tohostv,therewillbemuchmorepacketsfromatovthanthosefromvtoa.UploadingletoanFTPserverisanormalbehaviorbutthenumberofunmatchedinboundpacketsisveryhighinthiscase.Therefore,itismoreappropriatetouseowlevelquantities(insteadofpacketlevelquantities)asfeaturesfornetworkanomalydetection.AsintheaboveFTPcase,whenaTCPconnectionisestablished,allpacketsononedirectionconstituteoneowandpacketsonthereversedirectionconstituteanotherow.Nomatterhowmanypacketsaresentoneachdirection,thereareonlyoneinboundowandonlyoneoutboundow.TheymatchinIPaddressesandportnumbers.Therefore,wecallthenumberofunmatchedinboundowsasatwowaymatchingfeature.Twowaymatchingfeaturesareshowntobeeectiveindicatorsofnetworkanomalies[ 20 ]. 3.4 and 3.5 .Next,wedenethetwowaymatchingfeatures. 30
PAGE 31
Table31: Notationsfortwowaymatchingfeatures NotationDescription Basedontheabovedenitions,wedenethetwowaymatchingfeaturestobethenumberofUIF.Table 31 liststhenotationsusedintherestofthepaper,whereZ+standsforthenonnegativeintegerset.Inthefollowingsections,wepresentalgorithmstoextracttwowaymatchingfeaturesfromthetracatlocalanalyzers.Notethattwowaymatchingfeaturesshouldbe 31
PAGE 32
35 ,wheretheargumentsisthesignatureextractedfromapacket. 32
PAGE 33
functionHashTableInsert(V,s) fori0toK1 3. ifV[hi(s)]isempty 4. insertstoV[hi(s)],setstatebitofV[hi(s)]to\UNMATCHED" 5. return 6. endif 7. endfor 8. reportinsertionoperationerror 9. endfunction 10. functionHashTableSearch(V,s) fori0toK1 12. ifV[hi(s)]isempty 13. returnfalse 14. ifV[hi(s)]holdss returntrue 16. endfor 17. returnfalse; 18. endfunction 19. functionHashTableRemove(V,s) fori0toK1 21. ifV[hi(s)]isempty 22. return; 23. ifV[hi(s)]holdss setstatebitofV[hi(s)]to\MATCHED" 25. returntrue; 26. endif 27. endfor 28. endfunction Figure35. HashTableAlgorithm 21 ].Comparedtothehashtablealgorithm,Bloomlteralgorithmreducesspace/timecomplexitybyallowingsmalldegreeofinaccuracyinmembershiprepresentation,i.e.,apacketsignature,whichdoesnotappearbefore,maybefalselyidentiedaspresent. 33
PAGE 34
36 describestheinsertionandsearchoperationsofBloomlter. functionBloomFilterInsert(V,s) for8i2ZKdo 3. 4. endfunction 5. functionBloomFilterSearch(V,s) for8i2ZKdo 7. ifV[hi(s)]6=1then 8. returnfalse 9. endfor 10. returntrue 11. endfunction BloomFilterOperations (b)Figure37. ScenariosoftheproblemscausedbyBloomlter.(a)Boundaryproblem.(b)Anoutboundpacketarrivesbeforeitsmatchedinboundpacketwitht2t1<. AlthoughBloomlterhasbetterperformanceinthesenseofspace/timetradeo,itcannotbedirectlyappliedtoourapplicationbecauseofthefollowingproblems:1.Bloomlterdoesnotprovideremovalfunctionality.Sinceonebitinthevectormaybemappedbymorethanoneitem,itisunsuitabletoremovetheitembysettingallbitsindexedbyitshashresultsto0.2.Bloomlterdoesnothavecountingfunctionality.AlthoughthecountingBloomlter[ 22 ]canbeusedforcounting,itreplacesabitwithacounter,whichsignicantlyincreasesthespacecomplexity. 34
PAGE 35
37(a) ).Aninboundpacketarrivesattimet012[ti;ti+1)whereasitsmatchedoutboundpacketarriveswithinnextperiod.Theinboundpacketiscountedasanunmatchedinboundpacketeventhought02t01<.Therefore,boundaryeectincreasesthefalsealarmrate.4.Inpreviousdiscussion,wedidnotconsiderthescenariothatanoutboundpacketmayarrivebeforeitsmatchedinboundpacket(Figure 37(b) ).Whentheoutboundpacketarrivesattimet01,itssignatureisnotinthebuer,sowedonothing.Attimet02,itsmatchedinboundpacketarrives,whoseinboundsignaturewillberecorded.Asaresult,thelatterisregardedasanunmatchedinboundpacketduringperiod[ti;ti+1).Thisearlyarrivalproblemalsoincreasesthefalsealarmrate.Next,weproposeaBloomlterarrayalgorithmtoaddresstheaboveproblems. 3.4.2 .OurideaistodesignaBloomlterarray(BFA)withthefollowingfunctionalities,notavailableintheoriginalBloomlter[ 21 23 ]:1.Removalfunctionality:Weimplementinsertionandremovaloperationssynergisticallybyusinginsertionremovalpairvectors.Thetrickisthat,ratherthanremovinganoutboundsignaturefromtheinsertionvector,wecreatearemovalvectorandinserttheoutboundsignatureintotheremovalvector.2.Countingfunctionality:WeimplementthisbyintroducingcountersinBloomlterarray.Thevalueofacounterischangedbasedonthequeryresultfromaninsertion/removaloperation.3.Boundaryeectabatement:Weusemultipletimeslotsandaslidingwindowtomitigatetheboundaryeect.4.Resolvingtheearlyarrivalproblem:whichisachievedbystoringsignatureofnotonlyinboundpacketsbutalsooutboundpackets.Inthisway,whenaninboundpacketarrivesandthesignatureofitsmatchedoutboundpacketispresent,wedonotcountthisinboundpacketasanunmatchedone. 3.5.3 ). 35
PAGE 36
38 )consistsofthreefunctions,namely,ProcInbound,ProcOutboundandSample,whicharedescribedasbelow.FunctionProcInboundistoprocessinboundpackets.Itworksasbelow.Whenaninboundpacketarrivesduring[j;j+1),weincreaseCjby1andinsertitsinboundsignaturesintoIVjifnoneofthefollowingconditionsissatised:1.sisstoredinatleastoneRVj0,wherejw+1j0j;2.sisstoredinIVj.Condition1beingtruemeansthatthecorrespondingoutboundowofthisinboundpackethasbeenobservedpreviously;soweshouldnotcountitasanunmatchedinboundpacket.Condition2beingtruemeansthattheinboundow,towhichthisinboundpacketbelongs,hasbeenobservedduringthecurrentslotj;soweshouldnotcountthesameinboundowagain.Ifbothconditionsarefalse,weincreaseCjbyonetoindicateanewpotentialUIF(line 7 to 10 ). 36
PAGE 37
functionProcInbound(s) if9j0,jw+1j0j,suchthatBloomFilterSearch(RVj0,s)returnstruethen 4. ifBloomFilterSearch(IVj,s)returnstruethen 6. ifaandbarebothfalse 8. 9. BloomFilterInsert(IVj;s) 10. endif 11. endfunction 12. functionProcOutbound(s0) forj0jtojw+1 14. ifBloomFilterSearch(RVj0;s0)returnstrue 15. break 16. ifBloomFilterSearch(IVj0;s0)returnstrue 17. 18. endfor 19. BloomFilterInsert(RVj,s0) 20. endfunction 21. functionSample(j) returnCjw+1 endfunction Figure38. BloomFilterArrayAlgorithm FunctionProcOutboundistoprocessoutboundpackets.Itworksasbelow.Whenanoutboundpacketarrivesduring[j;j+1),wecheckwhetherweneedtoupdatecounterCj0foreachj0(jw+1j0j).Specically,foreachj0(jw+1j0j),decreaseCj0byoneifitsoutboundsignatures0satisesbothofthefollowingconditions:1.s0isnotcontainedinRVj0;2.s0iscontainedinIVj0.Condition1beingtruemeansthatnopacketfromtheoutboundowtowhichthisoutboundpacketbelongsarrivesduringthej0thtimeslot.Condition2beingtruemeansthatthematchedinboundowofthisoutboundpackethasbeenobservedinthej0thslot.SatisfyingbothconditionsmeansthatitsmatchedinboundowhasbeencountedasapotentialUIF;hence,uponthearrivaloftheoutboundpacket,weneedtodecreaseCj0by 37
PAGE 38
13 startsalooptoiteratej0fromjtojw+1.Condition1ischeckedinlines 14 to 15 andCondition2ischeckedinlines 16 to 17 .Notethattheloopexits(line 15 )ifRVj0containss0;thisisbecauseanoutboundpacketofthesameowarrivedinthatj0thslotandhencethebuerofthejthslot(foreachj
PAGE 39
39 showstheimprovedversionofBFA,basedontheroundrobinslidingwindow. 39
PAGE 40
24 ]asthehashfunction.SinceMD5takesanynumberofbitsasinput,wecanorganizekeyiandxintoabitvectorandapplyMD5toit.Usingkeyedhashfunctions,therstconcern(varyingK)canbeaddressedstraightforwardly.Specically,whenKischanged,wesimplygenerateacorrespondingnumberofrandomkeys.ApplyingtheseKkeystothesamekernelhashfunction,weobtainKhashfunctions.Hence,ourmethodhastwoadvantages:1)thenumberofhashfunctionscanbespeciedonthey;2)hashfunctionsaredeterminedonthey,insteadofbeingstoredapriori,resultinginstoragesaving.Thesecondconcern(changinghashfunctions)canalsobeaddressedifthekeysareperiodicallychanged.Evenifthekernelhashfunctionisdisclosed,itisstillverydicult,ifnotimpossible,foranattackertoguessthechangingrandomkeys.Notethatthecollisionprobabilityofthehashfunctionsisnotaectedduetotheuseofkeyedhashfunctions.Inthecaseofrandomkeyedhashfunctions,thecollisionprobabilityofhi(x)dependsonnotonlythecollisionprobabilityofhbutalsothecorrelationbetweenkeyiandx.Sincerandomnumbergeneratortechniquesaresomaturethatwecanassumeindependencebetweenkeyiandx,introductionofrandomkeyshasnoeectonthecollisionprobability. 3.6.1 ,weanalyzethespace/timetradeoforthethreealgorithms.Section 3.6.2 addresseshowtooptimallychooseparametersofBFA. 40
PAGE 41
21 ].However,theanalysisbyBloom[ 21 ]isnotdirectlyapplicabletooursettingduetothefollowingreasons:1.AstaticdatasetwasassumedbyBloom[ 21 ].However,ourfeatureextractiondealswithadynamicdataset,i.e.,thenumberofelementsinthedatasetchangesovertime.Hence,newanalysisforadynamicdatasetisneeded.Inaddition,Bloom[ 21 ]onlyconsideredthesearchoperationduetotheassumptionofstaticdatasets.Ourfeatureextraction,ontheotherhand,requiresthreeoperations,i.e.,insertion,search,andremoval,fordynamicdatasets.2.Bloom[ 21 ]assumedbitcomparisonhardwareintimecomplexityanalysis.However,currentcomputersusuallyuseword(ormultiplebit)comparison,whichismoreecientthanbitcomparisonhardware.Hence,itisnecessarytoanalyzethecomplexitybasedonwordcomparison.3.ThetimecomplexityobtainedbyBloom[ 21 ]didnotincludehashfunctioncalculations.However,hashfunctioncalculationdominatestheoveralltimecomplexity,e.g.,calculatingonehashfunctionbasedonMD5takes64clockcycles[ 25 ],whileonewordcomparisonusuallytakeslessthan8clockcycles[ 26 ].Fortheabovereasons,wedevelopnewanalysisforthehashtableandBloomlter,respectively.Inaddition,weanalyzetheperformanceofBFAandusenumericalresultstocomparethethreealgorithms.Table 32 liststhenotationsusedintheanalysis. Table32: Notationsforcomplexityanalysis NotationDescription 35 )checksifitsinboundsignaturesisinthetable.Because 41
PAGE 42
3.4.1 ,thehashtablehas`cellsofb+1bitseach,suchthatMh=`(b+1).GiventheconditionthatNowshavebeenrecordedbythehashtable,theemptyratiois=`N `=MhN(b+1) 42
PAGE 43
3{5 )toEquation( 3{4 ),weobtaintheexpectationofThE[Th]=1 3{6 )givesthespace/timetradeo(i.e.,Mhvs.Th)ofthehashtablemethod. 3.4.2 ).ThechoiceofMbwillaecttheaccuracyofthesearchfunction,BloomFilterSearch(seeFigure 36 ).Thereasonisthefollowing.WhensignaturesofNowsarestoredinV,,denotingthepercentageofentriesofVwithvalue0,is=1K Mb: MbK: 43
PAGE 44
MbK; 3{5 ).FromEquation( 3{10 ),itcanbeobservedthatdecreaseswithMbifKisxed.BasedonEquation( 3{10 ),wecandenoteMbasafunctionofandKasbelowMb=R(;K): 3{11 )givesthespacecomplexityofBloomlterasafunctionofcollisionprobabilityandthenumberofhashfunctions.Now,letusconsiderthetimecomplexityofBloomlter.DenotebyTbtherandomvariablerepresentingthenumberofhashfunctioncalculations.FunctionBloomFilterInsertalwayscalculatesalltheKhashfunctions,thatis,TbjfBloomFilterInsertisexecutedgK; 3.6.1 ).Ingeneral,Pr[Tb=xjN=nandBloomFilterSearchisexecuted]=8>><>>:(1)x1x
PAGE 45
R(;K)iK R(;K)4=n(;K): 3{15 ),wegettheexpectationofTbundertheconditionthatBloomFilterSearchisexecuted,i.e.,E[TbjBloomFilterSearchisexecuted]=1 3{17 )givesthetimecomplexityofBloomlterintermsofnumberofhashfunctioncalculations. 3.6.1 canbeappliedheresinceBFAisoriginatedfromstandardBloomlter.However,therearesomedierencesbetweenthesetwoschemes.AsdescribedinSection 3.5 ,BFAhasmultiplebuerssuchasIVj,RVj,andCj,j2Zw.Therefore,thestoragesizeforBFA,denotedbyMa(inbits),isw(2Mv+L),whereMvisthesizeofeachinsertionorremovalvector,andListhesizeofeachcounterinbits. 45
PAGE 46
3{10 ),thecollisionprobabilityis=1 MvK: 3{11 ),MvisafunctionofandK.WedeneMv=R(;K): 3{20 )givesthespacecomplexityofBFA.Now,letusconsiderthetimecomplexityofBFA.DenotebyTatherandomvariablerepresentingthenumberofhashfunctioncalculationsforBFA.RecallthatBFA(Figure 39 )denesthreefunctions,ProcInbound,ProcOutbound,andSample.Obviously,TajfSampleisexecutedg0: 3{12 )).2.Otherwise,atleastoneofaandbistrue;thenatleastoneofthesearchoperations,i.e.,BloomFilterSearch(RVj0,s),j0=(Iw+1)%w,(Iw+2)%w,:::;I%w,andBloomFilterSearch(IVI,s),returnstrue.ThisalsomeansthatKhashfunctionshavebeencalculated(seeEquation( 3{13 )).Therefore,inanycase,ProcInboundcalculatesalltheKhashfunctions.Furthernotethat,althoughBloomFilterSearchexecutesuptow+1searchoperations,andatmostoneinsertionoperation,thetotalnumberofhashfunctioncalculationsintheseoperations 46
PAGE 47
3{21 ),( 3{22 ),and( 3{23 )andassuming(Rpi+Rpo)1,whichisalwaystrueinourdesignofBFA,wehaveE[Ta]=01 (Rpi+Rpo)+1+K(Rpi+Rpo) 3{24 )and( 3{20 ),weobtaintherelationshipbetweenMaandTaasbelowMa=w[2R(;E[Ta])+L]: Table33: Space/timecomplexityforhashtable,Bloomlter,andBFA AlgorithmSpacecomplexityTimecomplexity HashtableMh(freevariable)Equation( 3{6 )BloomlterEquations( 3{10 )and( 3{11 )Equations( 3{15 ),( 3{16 ),and( 3{17 )BFAEquation( 3{18 ),( 3{19 ),and( 3{20 )Equation( 3{24 ) Table 33 liststhespacecomplexityandtimecomplexityforhashtable,Bloomlter,andBFAalgorithms. 47
PAGE 48
Figure310. Space/timetradeoforthehashtable,BFAwith=0:1%,andBFAwith=1% Figure 310 showsMvs.E[T]forthehashtablescheme,BFAwithcollisionprobability1%,andBFAwithcollisionprobability0.1%.InFigure 310 ,Xaxisrepresentsthetimecomplexity(i.e.,theexpectednumberofhashfunctioncalculations)andYaxisrepresentsthespacecomplexity(i.e.,thenumberofbitsneededforstorage).FromFigure 310 ,wecanseethatthecurveofBFAisbelowthecurveofthehashtable.ItmeansBFAuseslessspaceforagiventimecomplexity.Therefore,BFAachievesbetter 48
PAGE 49
310 showsthatforthehashtablescheme,MhisamonotonicdecreasingfunctionofE[Th].Theobservationmatchesourintuitionthatthelargertable,thesmallercollisionprobabilityforhashfunctions,resultinginlesshashfunctioncalculations.FurthernotethatMhapproachesR(b+1)whenE[Th]increases.ThisistheminimumspacerequiredtotolerateuptoRows.ForBFA,MaisnotamonotonicfunctionofE[Ta],whichapproximatelyequalsK.Wehavethefollowingobservations.CaseA:Forxedstoragesize,thesmallerK,thelargertheprobabilitythatallKhashfunctionsoftwodierentinputsreturnthesameoutputs,whichisthecollisionprobability.Inotherwords,thesmallerK,thelargerstoragesizerequiredtoachieveaxedcollisionprobability.Thatis,K#)Ma".CaseB:SinceaninputtoBFAmaysetKbitsto\1"inavectorV,hencethelargerK,themorebitsinVwillbesetto\1"(nonempty),whichtranslatesintoalargercollisionprobability.Inotherwords,thelargerK,thelargerstoragesizerequiredtoachieveaxedcollisionprobability.Thatis,K")Ma".CombiningCasesAandB,itcanbearguedthatthereexistsavalueofKorE[Ta]thatachievestheminimumvalueofMa,givenaxedcollisionprobability.ThisminimumpropertycanbeusedtoguidetheparametersettingforBFA,whichwillbeaddressedinSection 3.6.2 49
PAGE 50
3{25 ),threeparameters,Ma,E[Ta],and,arecoupled.Sincethecollisionprobabilitycriticallyaectsthedetectionerrorrateinournetworkanomalydetection,anetworkoperatormaywanttochooseanupperboundontheacceptablecollisionprobabilityandthenminimizethestoragerequired,i.e., minE[Ta]Ma;subjectto AccordingtoEquation( 3{25 ),thesolutionof( 3{26 )isasbelowMa=minE[Ta]Ma=minE[Ta]w[2R(;E[Ta])+L]; (b)Figure311. Relationamongspacecomplexity,timecomplexity,andcollisionprobability.(a)Mavs..(b)E[Ta]vs.. Figure 311 showsMavs.,andE[Ta]vs.underthesamesettingasthatinSection 3.6.1 .FromFigure 311(a) ,itcanbeobservedthatMadecreaseswhen
PAGE 51
311(b) ,oneobservesthatgenerally,E[Ta]decreaseswhenincreases.ThismaybebecausethesmallerE[Ta]orK,thelargertheprobabilitythatallKhashfunctionsoftwodierentinputsreturnthesameoutputs,whichisthecollisionprobability. 3{18 ),itcanbeobservedthatdecreaseswiththeincreaseofMvifKisxed;inotherwords,MvdecreaseswiththeincreaseofifKisxed.Further,fromEquations( 3{19 )and( 3{25 ),itcanbeinferredthatMadecreaseswiththeincreaseofifE[Ta]isxed(notethatE[Ta]K).ThisisshowninFigure 312 .Fromthegure,itcanbeobservedthatthetwolinesintersectatavalueofcollisionprobability,denotedbyc.ThisvalueiscriticalfortheparametersettingofBFA.Ifanetworkoperatorhasadesirablecollisionprobability,whichisgreaterthanc,thenitshouldchooseE[Ta]=4sincethisparametersettinggivesbothsmallertimecomplexityandsmallerspacecomplexity.Wecallthisproperty`competitiveoptimality'sincethereisnotradeobetweentimecomplexityandspacecomplexityinthiscase.Ontheotherhand,ifanetworkoperatorhasadesirablecollisionprobability,whichissmallerthanc,thenitneedstomakeatradeobetweenspacecomplexityandtimecomplexity. 3.7.1 comparestheperformanceoftheBFAalgorithmwiththatofthehashtablealgorithm.InSection 3.7.2 ,weshowtheperformanceofthecompletefeatureextractionsystem,whichusestheBFAalgorithm. Simulationsettings.WeapplythehashtablealgorithmandtheBFAalgorithmtothetimeseriesofsignaturesextractedfromrealtractraces,whichwerecollectedbyAucklandUniversity[ 27 ].Tomakeafaircomparisonwithrespectivetothenumerical 51
PAGE 52
Spacecomplexityvs.collisionprobabilityforxedtimecomplexity. resultsinSection 3.6.1 ,weusethesame96bitsignature,i.e.,SA,DA,SP,andDP,andletR=250;000packets/secondand=80seconds,whichtranslatesto250;00080=20Minputsignaturesforeachsimulation.ThesesignaturesarepreloadedintomemorybeforethebeginningofsimulationssothatI/Ospeedofharddrivedoesnotaecttheexecutiontimeofsimulations.Foreachsimulationrunofthehashtablealgorithm,wespecifythememorysizeMh,andmeasurethealgorithmperformanceintermsoftheaveragenumberofhashfunctioncalculationspersignaturequeryrequest,denotedby^Th,andtheexecutiontime.DuetotheLawofLargeNumbers,^Thapproachestheexpectednumberofhashfunctioncalculationspersignaturequeryrequest,i.e.,E[Th]inEquation( 3{6 ),ifwerunthesimulationmanytimeswiththesameMh.Inoursimulations,werunthehashtablealgorithmtentimes;eachtimewithadierentsetofinputsignaturesbutwiththesameMh.ForeachsimulationrunoftheBFAalgorithm,wespecifythememorysizemaandthenumberofhashfunctionsK,andmeasurethealgorithmperformanceintermsofthecollisionfrequency,denotedby^,andtheexecutiontime.ThecollisionfrequencyisdenedastheratioofthenumberofcollisionoccurrencesinBloomFilterSearchtothe 52
PAGE 53
313 showsaverageprocessingtimeperqueryvs.memorysizeforthehashtablealgorithm,BFAalgorithmwith^=0.1%,andBFAalgorithmwith^=1%. Figure313. Memorysize(inbits)vs.averageprocessingtimeperquery(ins) FromFigure 313 ,weobservethat1)comparedtothehashtablealgorithm,theBFAalgorithmrequireslessmemoryspaceforthesametimecomplexity(averageprocessingtimeperquery),whichwaspredictedinSection 3.6 ,and2)theBFAalgorithmwith^=1%hasabetterspacecomplexity/timecomplexitytradeothantheBFAalgorithmwith^=0.1%butatcostofhighercollisionprobability,whichispredictedbythenumericalresultsinFigure 310 .Figure 314 showsaverageprocessingtimeperqueryvs.averagenumberofhashfunctioncalculationsperquery.Itcanbeobservedthattheaverageprocessingtimeperquerylinearlyincreaseswiththeincreaseoftheaveragenumberofhashfunctioncalculationsperquery.Thatis,thelargertheaveragenumberofhashfunctioncalculationsperquery,thelargertheaverageprocessingtimeperquery.Forthisreason,insteadofrunningsimulationstoobtainthetimecomplexity(i.e.,theaverage 53
PAGE 54
Averageprocessingtimeperquery(ins)vs.averagenumberofhashfunctioncalculationsperquery. processingtimeperquery),inSection 3.6.1 ,weusedtheaveragenumberofhashfunctioncalculationsperquerytorepresentthetimecomplexityofthehashtablealgorithmandtheBFAalgorithm. 315 comparesthesimulationsresultsandthenumericalresultsobtainedfromtheanalysisinSection 3.6 ,forbothhashtablealgorithmandBFAalgorithmintermsofspacecomplexityvs.timecomplexity.InFigure 315(a) ,thenumericalresultagreeswellwiththesimulationresult,exceptwhentheaveragenumberofhashfunctioncalculationsperqueryiscloseto1.FromEquation( 3{6 ),iftheexpectednumberofhashfunctioncalculationsapproaches1,therequiredmemorysizeapproachesinnity;incontrast,simulationswithalargeMhmaynotgiveaccurateresults,duetolimitedmemorysizeofacomputer.Thiscausesthebigdiscrepancybetweenthenumericalresultandthesimulationresultwhentheaveragenumberofhashfunctioncalculationsperqueryiscloseto1.Whentheaveragenumberofhashfunctioncalculationsperqueryisgreaterthanorequaltotwo,itisobservedthatsimulationalwaysrequiresmorememorythanthenumericalresult.Thisisduetothefactthatpracticalhashfunctionisnotperfect.Thatis,entriesinthehashtablearenot 54
PAGE 55
(b)Figure315. Comparisonofnumericalandsimulationresults.(a)Hashtablealgorithm.(b)BFAalgorithmwith=1%. equallylikelytobeaccessed.Hence,Equation( 3{2 )doesnotholdperfectly,neitherdoesEquation( 3{3 ).Asaresult,theaveragenumberofhashfunctioncalculationsperqueryinsimulationislargerthanthatpredictedbyEquation( 3{6 ).Figure 315(b) showsthatthenumericalresultagreeswellwiththesimulationresultforallthevaluesoftheaveragenumberofhashfunctioncalculationsperqueryunderourstudy. 3.2 .Incontrast,theexperimentinSection 3.7.1 doesnotinvolvetheinteractionamongthethreelevels. 27 ]asthebackgroundtrac.ThisdatasetconsistsofpacketheaderinformationoftracbetweentheInternetandAucklandUniversity.TheconnectionisOC3(155Mb/s)forbothdirections. 55
PAGE 56
18 ]byrandomlyinsertingTCPSYNpacketswithrandomsourceIPaddressesintothebackgroundtraceduringspeciedtimeperiods.Specically,synchronizedattacksaresimulatedduring14000{16000secondand28000{32000secondinbothtraces.Inaddition,asynchronousattacksarelaunchedduring5000052000secondintrace1andduring5700059000secondintrace2.Theaverageattackrateis1%ofthepacketrateofthebackgroundtracduringthesameperiod.TodetectTCPSYNoodattacks,wechooseasthesignatureofinboundpacketsandforoutboundones.ThusthetwowaymatchingfeatureisthenumberofunmatchedinboundTCPSYNpacketsinonetimeslot.Theaverageowrateis2480ows/second.Therefore,wesetR=2480.Wefurtherset=0:1%,K=8,w=8,and=10.Then,bysolvingEquation( 3{18 )forMvandrequiringMvtobeapowerof2,weobtainMv=215bits.Thecomputerusedforourexperimentshasone2.4GHzCPUand1GBmemory.Forcomparison,wealsoextractthenumberofinboundSYNpacketsinaslot. 56
PAGE 57
316 ).FromFigure 316 ,itcanbeobservedthatthefeaturesarerathernoisy,especiallyforthefeatureofthenumberofSYNpackets.FromFigs. 316(a) and 316(c) ,wecanhardlydistinguishtheslotsunderthelowvolumesynchronizedattacksfromtheslotswithoutattacks(byvisualinspection).Incomparison,itismucheasiertoidentifytheslotsunderthesynchronizedattacks(byvisualinspection)whenthenumberofunmatchedSYNpacketsisusedasthefeature(seeSlot14001600andSlot28003200inFigs. 316(b) and 316(d) 57
PAGE 58
(b) (c) (d)Figure316. Featuredata:(a)NumberofSYNpackets(link1),(b)NumberofunmatchedSYNpackets(link1),(c)NumberofSYNpackets(link2),and(d)NumberofunmatchedSYNpackets(link2). 58
PAGE 59
4.1.1 introducestheReceiverOperatingCharacteristicscurve.ItisusedinSection 4.5 asthemetricstocompareperformanceofdierentclassicationmethods.Wedescribethethresholdbasedalgorithm,changepointalgorithm,andBayesiandecisiontheoryinSections 4.1.2 4.1.3 ,and 4.1.4 ,respectively. 28 ]isatypicalmethodtoquantifytheperformanceofdetectionalgorithms.Itisaplotofdetectionprobabilityvs.falsealarmprobability.Inpractice,weestimatedetectionprobabilityandfalsealarmprobabilitybythefractionoftruepositivesandthefractionoftruenegatives,respectively.Hence,toobtainanROCcurve,oneneedstomeasurethefollowingquantitiesf:thenumberoffalsealarms,i.e.,thenumberofslotsinwhichthedetectionalgorithmdeclares`abnormal'giventhatnoanomalyactuallyhappensintheseslots;n:thenumberofslotsinwhichnoanomalyhappens;d:thenumberofslotsinwhichthedetectionalgorithmdeclares`abnormal'giventhatnetworkanomaliesactuallyhappenintheseslots;a:thenumberofslotsinwhichnetworkanomalieshappen.Thefalsealarmprobabilityandthedetectionprobabilityofthedetectionalgorithmcanbeestimatedbyf=nandd=a,respectively.Byvaryingparametersofdetectionalgorithms,wecanobtaindierentpairsoffalsealarmprobabilityanddetection 59
PAGE 60
28 ].Inthispaper,wewillusetheROCcurvetocomparetheperformanceofdierentdetectionalgorithms.Next,weintroducesomebasicclassicationalgorithms. 11 { 13 17 ].However,existingstudiesonlyconsiderthechangefromnormalstatetoabnormalstate,whichmeansthatthenumberoffalsealarmscanbeverylargeafternetworkanomaliesend.Tofacilitatethediscussion,wedenethefollowingparametersusedinCUSUMinTable 41 : Table41: ParametersusedinCUSUM ParameterDescription (ti)Theobservedtracfeatureattheendoftimesloti.nTheexpectationof(ti)innormalstates.aTheexpectationof(ti)inabnormalstates.Withoutlosinggenerality,hereweassumethatn
PAGE 61
(4{1)IntheCUSUMalgorithm,ifS(ti)issmallerthanathresholdHCUSUM,declarethatthenetworkstateisnormal;otherwise,declarethatthestateisabnormal.Fromthediscussionabove,wenotethattwoparameters,i.e.,aandHCUSUM,needtobedetermined.However,wecannotuniquelydeterminethesetwoparameters.Toovercomethisproblem,weshallintroduceanotherparameter,i.e.,thedetectiondelay,denotedasD.Accordingtothechangepointtheory,wehaveD aa: 4{2 ),wecanobtainHCUSUM=D(aa): 4{3 ). 11 { 13 ]onlyconsideronechange,i.e.,fromthenormalstatetotheabnormalstate.Inpractice,thisapproachmayleadtoalargenumberoffalsealarmsaftertheendofattacks.Tomitigatethehighfalsealarmissueoftheexistingalgorithms,whichwecallsingleCUSUMalgorithms,wedevelopadualCUSUMalgorithm.Inthisalgorithm,oneCUSUMwillbeusedtodetectthechangefromthenormaltotheabnormalstate,whileanotherCUSUMisresponsiblefordetectingthechangefromtheabnormaltothenormalstate.Themethodofsettingparametersfor 11 ]and[ 12 ],a=(an)=2;thusonlythedetectiondelayisneeded. 61
PAGE 62
4.5.2 ).OurmachinelearningalgorithmisbasedonBayesiandecisiontheory,whichisintroducedinnextsection. 29 ].Itiscomposedofthefeaturespace,D,whichmightbeamultidimensionalEuclideanspace;Ustatesofnature,~H=fHu;u2ZUg;priorprobabilitydistribution,P(H),H2~H;likelihoodprobabilities,p(jH),2D,H2~H;lossfunction,(H;H),H;H2ZU,whichdescribesthelossincurredforclassifyinganobjecttobeofclassHwhenitsstateofnatureisclassH.Notethat,inthepaper,P()representsaprobabilitymassfunction(PMF)[ 30 ]andp()aprobabilitydensityfunction(pdf)[ 30 ].Giventheobservedfeature,,ofanobject,theBayesiandecisiontheoryclassiesittobeofclass^Hsuchthat^H=argminHXH2~H(H;H)P(Hj): 30 ],P(Hj)=p(jH)P(H) 62
PAGE 63
4{4 )isequivalentto^H=argminHXH2~H(H;H)p(jH)P(H): 4{6 )givestheBayesiancriterionforpatternclassication.Asimplelossfunctionisdenedtobe(Hu;Hu)=~(u)I(u=u);for8u2ZU; 4{7 )speciesthatmisclassicationinduceszerolossandcorrectclassicationinducesnegativeloss,whichactuallyachievesgain.ApplyingEquation( 4{7 )toEquation( 4{6 ),theBayesiancriterionissimpliedto,^u=argmaxu2ZU~(u)p(jHu)P(Hu): 4{9 )themaximumgaincriterion.Furthernotethat,scalingallgainfactorsbyasamefactordoesnotchangethecriterionspeciedbyEquation( 4{9 ).Hence,wecanalwaysset~(0)=1.Bytuningothergainfactors,wecangenerateROCcurveforBayesiandecisiontheory.Inthischapter,weextendedtheBayesiandecisiontheoryfornetworkanomalydetection.Theremainderofthischapterisorganizedasfollows.InSection 4.2 ,weestablishBayesianmodelsfornetworkanomalydetection.Sections 4.3 and 4.4 solvetwofundamentalproblemsofBayesianmodel,i.e.,trainingproblemandclassicationproblem,respectively.Section 4.5 showsoursimulationresultsandSection 4.6 concludesthischapter. 63
PAGE 64
4.2.1 generalizestheBayesianmodelfornetworkanomalydetectionontracmonitorsandlocalanalyzers.Insection 4.2.2 ,weextendthismodeltothewholeautonomoussystem.Section 4.2.3 introducesthehiddenMarkovtreemodeltodecreasethecomputationcomplexityofthegeneralmodeldenedinSection 4.2.2 4.1.4 Figure41. Generativeprocessingraphicalrepresentation,inwhichthetracstategeneratesthestochasticprocessoftrac. Inthecontextofnetworkanomalydetection,therearetwostatesofnatureofanedgerouter,i.e.,~H=fH0;H1g,whereH0representsnormalstate,inwhichcasenoabnormalnetworktracenterstheASthroughthatedgerouter;H1representsabnormalstate,inwhichcaseabnormalnetworktracenterstheASthroughtheedgerouter. 64
PAGE 65
31 ]todepictthiscauseeectrelationshipinFigure 41 (b)Figure42. Extendedgenerativemodelincludingtracfeaturevectors:(a)originalmodeland(b)simpliedmodel. Denoteby,2D,thefeatureextractedfromtrac,whereDisthefeaturespace.Mostimportantly,inselectionoftheoptimalfeatures,weseekforthemostdiscriminativestatisticalpropertiesofthetrac.Alsonotethatitispossibletoemploymultiplefeaturesinthedetectionprocedure,inwhichcaseisavector.Sincefeaturesaresuccinctrepresentationsofthevoluminoustrac,weextendtheabovemodelinFigure 41 totheoneillustratedinFigure 42(a) .Onceisextractedfrom,weassumethatrepresentswell.Itmeansthatwemayoperateonlyoverlowerdimensional,whichreducescomputationalcomplexity.Therefore,wesimplifythemodelinFigure 42(a) tothatillustratedinFigure 42(b) ,whereisdismissed.Sincethefeatureismeasurable,itiscalledobservablerandomvector,andisdepictedbyarectangularnodeinFigure 42(b) .Thenetworkstategeneratingthetracfeatureistobeestimated.Wecallithiddenrandomvariable,anddepictitbyaroundnodeinFigure 42(b) .Now,thegoalbecomestoestimatethehiddenstategiventheobservable 65
PAGE 66
4{9 ))speciestheestimate,^u,tobe^u=argmaxu2Z2~(u)p(j=u)P(=u): 4.3 ). Figure43. Generativeindependentmodelthatdescribesdependenciesamongtracstatesandtracfeaturevectors. AnAShasmanyedgerouters,eachofwhichhasmultiplelinks.Tracmonitorsdeployedonlinksandlocalanalyzersonedgeroutersextractfeaturesandmakedecisionsindependently.Therefore,theonelinkmodelinFigure 42(b) isfurtherextendedtothemoregeneralmodelfortheASasillustratedinFigure 43 ,wherestandsforthenumberofedgerouters.ThelimitationofthedetectionmodelinFigure 43 isthatitassumesedgeroutersaremutuallyindependent.ThisisduetothefactthattracmonitorsandlocalanalyzersonlyhavelocalinformationofthewholeAS.Althoughitissuitabletodetectnetworkanomaliesaccompaniedwithhightracvolumeonsinglelink,itisnotsuitableforlowvolumenetworkanomalydetection.Weaddressthislimitationbyintroducingspatialcorrelationinnextsection. 66
PAGE 67
Figure44. Generativedependentmodelthatdescribesdependenciesamongedgerouters. IntroducingthespatialcorrelationintotheindependentmodelinFigure 43 resultsinthedependentmodelasillustratedinFigure 44 .Thedierencebetweentwomodelsisthat,fromtheviewpointofaglobalanalyzer,edgeroutersarenolongindependent.Asaresult,statisticaldependenceamongstatesofedgeroutersisrepresentedbythenondirectionalconnections.Notethattheindependentmodelcanberegardedasaspecialcaseofthedependentone.Alsonotethatwestillassumethatfeaturesextractedfromoneedgerouterareindependentofthestatesofotheredgerouters. 67
PAGE 68
4{10 ).Wefurtherassumegainfactorsareindependentofnodeindex,i.e.,~(ui)=~(ui0)wheneverui=ui0(0or1),nomatterwhetheriisequaltoi0ornot.Thenthemaximumgaincriterion(seeEquation( 4{9 ))forthedependentmodelis^~u=argmax~u~(~u)p(~j~=~u)P(~=~u)=argmax~u"Yi=1~(ui)p(iji=ui)#P(~=~u): 4{15 )directly,weneedtoexhaustivelycomputep(~j~)P(~)foreachpossiblecombinationof~,whichresultsinaO(2)complexity.ForalargeAS,itisintractable.Weintroducedahierarchicalstructuretoreducecomputationcomplexity,whichisthetopicofthenextsection. 44 becomescomputationallyintractableisthatweassumeedgeroutersarefullydependent.Aroughunderstandingisthat,ifwebreaksomedependenceinFigure 44 ,wecanreducethecomputationcomplexity.Ontheotherhand,wewouldliketoaccountforthedependenciesamongasmanynodesaspossibletoprovideaccuratedetection.Tobalancethesetwoconicting 68
PAGE 69
45 Figure45. HiddenMarkovtreemodel.Forannodei,(i)denotesitsparentnodeand(i)denotesthesetofitschildrennodes. ThemotivationofapplyingHMTmodelisthatweassumeedgeroutersarenotequallycorrelated.Instead,edgerouterstopologicallyclosetoeachotherhavehighmutualcorrelations.Basedonthisassumption,weclusteredgeroutersaccordingtothetopologyofASandformatreestructure,asdepictedinFigure 45 .Withoutlossofgenerality,Figure 45 plotsaquadtreestructure,i.e.,eachnode,exceptleafnodes,hasfourchildren.Tofacilitatefurtherdiscussion,eachnodeintheHMTisassignedanintegernumber,beginningwith0,fromtoptobottom.Thatis,node0isalwaysarootnode 42 liststhenotationsusedintherestofthepaperforHMT.IntheHMT,eachleafnodestandsforanedgerouter.ZeropaddingvirtualedgeroutersareintroducedwhenthenumberofedgeroutersisnotapowerofB.Statesofthesezeropaddingvirtualnodesarealwaysnormalandfeaturesarealways0.Nonleaf 69
PAGE 70
Notationsforhiddenmarkovtreemodel NotationDescription iTherandomvariablerepresentingthestateofnodei.iTherandomvariable/vectorrepresentingthefeature(s)measuredatnodei.~Tfi;i2Tg,whereTisasubtreeoftheHMT.LThenumberoflevelsoftheHMT.ThesetofallnodesintheHMT.lThesetofnodesatlevell,l2ZL,intheHMT.Specically,0representsthesetofrootnodesandL1leafnodes.BThenumberofchildrennodesofeachnode,exceptleaves.Forexample,B=4forquadHMT,asillustratedinFigure 45 .(i)Theparentnodeofnodei,wherei=20.(i)Thesetofchildrennodesofnodei,wherei=2L1.TiThesetofancestornodesofnodei,wherei=20,includingnodei.R(i)Therootnodeofthesubtreecontainingnodei,wherei2.TiThesubtreewhoserootisnodei,wherei2.TinjTinTj.TniTR(i)nTi. nodesrepresentclustersofedgerouters.FeaturesofnodesaredenedinEquation( 4{16 ).i=8>><>>:Featuresmeasuredatthecorrespondingedgerouteri2L1(i.e.,leafnode)1 (4{18) 70
PAGE 71
4.2.1 and 4.2.2 ,weemploymaximumgaincriterion(seeEquation( 4{9 ))toestimatenodestates,i.e.,^ui=argmax~u~(~u)P(~=~uj~)=argmaxfui0;i02Tig24Yi02TinfR(i)g~(ui0)P(i0=ui0j(i0)=u(i0);~)35~(uR(i))P(R(i)=uR(i)j~); 4{19 ),wereducethecomputationcomplexityfromO(2)(seeSection 4.2.2 )toO(B).ThisisthemajoradvantageofintroducingHMTmodel.ThedetailsaregiveninSection 4.4 .SolvingEquation( 4{19 )requiresknowledgeofP(ij(i);~),for8i2n0,andP(i=uij~),for8i20.ByBayesianformula[ 30 ],P(i=uij(i)=u(i);~)=P(i=ui;(i)=u(i)j~) 4{19 )translatestoestimatingP(i;(i)j~);8i2n0; 4{21 )and( 4{22 )inclosedformisdicult.Weproposedabeliefpropagation(BP)algorithm,describedinSection 4.3 ,toestimatethemecientlygivenknowledgeofpriorprobabilities:P(i=0),i20;likelihood:p(iji),i2;transitionprobabilities:P(ij(i)),i2n0. 71
PAGE 72
2,i20.Thatis,statesofrootnodesareequallylikelytobenormalorabnormal.Otherparameterssuchaslikelihoodandtransitionprobabilitiesareestimatedfromtrainingdata.ThisiscoveredinSection 4.3 .Afterthat,classicationusingmaximumgaincriterionisdescribedinSection 4.4 4.3.1 ,wedescribeestimationoflikelihoodp(iji),i2.EstimationoftransitionprobabilitiesP(ij(i)),i2n0arepresentedinSection 4.3.2 30 ]iswidelyemployedinmanyapplications.ThepdfofaddimensionalmultivariateGaussiandistributionwithmeanvectorandvariancematrixisN(x;;)4=1 (2)d=2jj1=2exp1 2(x)t1(x): 46 plotsthepdfoftheunivariateGaussiandistributionN(x;0;1).ItisobservedthatGaussiandistributionisaunimodaldistribution[ 30 ],i.e.,itspdfonlyhasonepeak. 72
PAGE 73
ProbabilitydensityfunctionoftheunivariateGaussiandistributionN(x;0;1). Figure47. Histogramofthetwowaymatchingfeaturesmeasuredatarealnetworkduringnetworkanomalies. However,multiplepeaksmayexistintheempiricaldistributionofiji.Forexample,Figure 47 showsthehistogramofthetwowaymatchingfeaturesmeasuredinarealnetworkduringDDoSattacks.Ithastwopeaks.Hence,theunimodalGaussiandistributionisnotsuitable.Inthepaper,weadopttheGaussianmixturemodel(GMM)tomodelthelikelihooddistribution. 73
PAGE 74
4{26 ),thelikelihoodestimationtranslatestoestimatingGMMparametersi;u(g),i;u(g),andi;u(g)for8i2,8u2f0;1g,and8g2f1;:::;Gg,withconstraintGXg=1i;u(g)=1: 74
PAGE 75
4{26 ).Unfortunately,Nechyba[ 32 ]showedthatMLmethodforGstateGMMwithG>1hasnoclosedformsolution.Inaddition,aGstateGMMhasa3Gdimensionalcontinuousparameterspace.ExhaustivesearchingnumericalsolutionforMLestimateinsuchaparameterspaceiscomputationalintractable. 1. Input:(k)i;u,k2f1;:::;Kug;(0)i;u(g),(0)i;u(g),and(0)i;u(g),g2f1;:::;Gg. 2. Output:^i;u(g),^i;u(g),and^i;u(g),g2f1;:::;Gg. 3. 4. repeat 5. 7. (j+1)i;u(g)=PKuk=1#(j)i;u(g;(k)i;u)k(k)i;u(j+1)i;u(g)k2 10. untilconverge 11. ^i;u(g)=(j)i;u(g),^i;u(g)=(j)i;u(g),^i;u(g)=(j)i;u(g),8g2f1;:::;Gg. Figure48. TheEMalgorithmforestimatingp(iji=u),i2,u2f0;1g. Apracticalsolutiontothisissueistheexpectationmaximization(EM)algorithm[ 29 30 ].Nechyba[ 32 ]derivedEMalgorithmforGMMindetail.Figure 48 illustratesthealgorithm.TheEMalgorithmrequiresinitialvaluesfortheparameters,asdenotedby(0)i;u(g),(0)i;u(g),and(0)i;u(g)inFigure 48 .Ateachiterationj,theEMalgorithmusesparametersestimatedatiterationj1tocalculatenewestimates.AlthoughbothEMandML 75
PAGE 76
4{28 ).Asaresult,EMalgorithmconvergesmuchfasterthannumericalMLmethod.However,thedisadvantageofEMalgorithmisthatitconvergestoalocalmaximaratherthantheglobalone.Specically,initialvaluesofparametersdeterminethelocalmaximatowhichtheEMalgorithmconverges.Inpractice,wehavepriorknowledgeofnetworkfeatures,whichhelpstochooseinitialvaluesofparameters.Tillnow,wepresentschemestoestimatelikelihoodofHMT.Innextsection,weestimatetransitionprobabilities. 49 and 410 showthepseudocodefortransitionprobabilityestimation.Innexttwosections,weexplainthetwogures. 49 showsthepseudocodetoestimatetransitionprobabilities.ThefunctionTransProbEstimatetakesthreesetsofarguments:1.likelihoodestimatedinSection 4.3.1 ,fp(iji);i2g;2.trainingfeatures,f~(k);k2f1;:::;Kgg.Itreturnstheestimateoftransitionprobabilities,i.e.,P(ij(i));8i2n0:Beforetheiterations,wesettheinitialtransitionprobabilitiestobe1 2atline 5 ofFigure 49 .Thisisequivalenttoassumenormalstateandabnormalstatetobeinitially 76
PAGE 77
functionTransProbEstimate(:::) Argument1:likelihood,fp(iji);i2g. 3. Argument2:trainingdata,f~(k);k2f1;:::;Kgg. 4. Return:transitionprobabilityestimate:P(i=uij(i)),i2n0. 5. 2,for8i2n0,8u;u02f0;1g. 6. 7. repeat 8. fork1toK 10. endfor 11. for8i2n0 (4{30) 13. endfor 14. 15. untilconverge 16. Figure49. Iterativelyestimatetransitionprobabilities. equallylikely.Then,ateachiteration,weupdatetheestimateoftransitionprobabilitiesuntilitconverges.Theupdateprocedureisthefollowing.First,weiteratethetrainingfeatureset.Foreachfeature,weuseBPalgorithm(seeFigure 410 )toestimatetheposteriorprobabilitiesgiventhatfeature.ThedetailsoftheBPalgorithmisdiscussedinthenextsection.ThreesetsofargumentsarepassedtotheBPalgorithm:estimateoftransitionprobabilitiesobtainedatthepreviousiteration,P(j)(ij(i));i2n0;1.likelihood,fp(iji);i2g,whichistheargumentpassedtofunctionTransProbEstimate; 77
PAGE 78
functionBP(:::) 2. Argument1:transitionprobabilities,P(ij(i));i2n0. 3. Argument2:likelihood,fp(iji);i2g. 4. Argument3:trainingfeature,~. 5. Return:posteriorprobabilities,nP((i)j~);P(i;(i)j~);i2n0o. 6. i(0)=i(1)=1 2,for8i20(i.e.,roots). 7. 8. Topdownpass,i.e.,fromroottoleaf: 9. forl1;:::;L1 10. for8i2l;8u2f0;1g,let 11. i(u)=Xu02f0;1gPi=uj(i)=u0(i)(u0)p((i)ju(i)=u0) (4{31) 12. endfor 13. endfor 14. Bottomuppass,i.e.,fromleaftoroot; 15. forlL2;:::;0 16. for8i2l,8u2f0;1g,let 17. 18. endfor 19. endfor 20. (4{33) 21. [Pu00i(u00)i(u00)]Pu00Pi=u00j(i)=u0i(u00) Figure410. Beliefpropagationalgorithm. 78
PAGE 79
4{30 )for8i2n0andsteptothenextiteration.Whentheestimatesconverge,iterationstopsandFunctionTransProbEstimatereturnsestimatesobtainedatthelastiteration.ThevalidityofEquation( 4{30 )isshowninthefollowing.For8i2n0,P(ij(i))=ZP(ij(i);~=~)p(~)d~=E~hP((ij(i);~))i; 30 ],weestimateE~hP((ij(i);~))iby1 4{35 )and( 4{36 ),weobtainEquation( 4{30 ).Next,wediscussthebeliefpropagationalgorithm,whichiscalledbyfunctionTransProbEstimate. 33 { 35 ],alsoknownasthesumproductalgorithm,e.g.,[ 36 { 39 ],isanimportantmethodforcomputingapproximatemarginaldistributions.Inthispaper,weapplytheBPalgorithmtoestimatingposteriorprobabilities(Figure 410 ). 79
PAGE 80
2andi(1)=1 2forallrootnodesi(seeline 6 ofFigure 410 ),becauseweassumerootnodesareequallylikelytobeinnormalorabnormalstate.Wheni2L1,i.e.,leafnodes,~Ti=~i,thereforei(u)=p(iji=u),u2f0;1g(seeline 7 ofFigure 410 ).Thenitpropagatesbeliefonthetreerootsandleavestoothernodesintopdownpassandbottomuppass,respectively.Duringthetopdownpass,FunctionBPiteratesfromroottoleaf.Ateachlevell,weupdatethetransitoryvariablesi(u)byEquation( 4{31 ),whichisproveninAppendix A.1 .Notethat,(i)(u0)inEquation( 4{31 )isobtainedinthepreviousiteration,i.e.,iterationatlevell1.Duringthebottomuppass,FunctionBPiteratesfromleaftoroot.ateachlevell,weupdatethetransitoryvariablesi(u)byEquation( 4{32 ),whichisproveninAppendix A.2 .Alsonotethat,j(u0)isobtainedinthepreviousiteration,i.e.,iterationatlevell+1. 80
PAGE 81
4{33 )and( 4{34 ).ThesetwoequationsareproveninAppendix A.3 and A.4 ,respectively.TheestimatedposteriorprobabilitiesareusedinFunctionTransProbEstimatetoupdateestimatesoftransitionprobabilities(seeFigure 49 ).Tillnow,weestablishedtheHMTmodelanddescribedapproachestoestimateitsmodelparametersfromtrainingdata.Next,wepresentnetworkanomalydetectionapproachesusingthefullydeterminedHMTmodel. 4{19 ).WerewritethecriterioninEquation( 4{39 ),^ui=argmaxfui0;i02Tig24Yi02TinfR(i)g~(ui0)P(i0=ui0j(i0)=u(i0);~)35~(uR(i))PR(i)=uR(i)j~; 411 ,showsthepseudocodefortheclassicationalgorithm.FunctionViterbiDecodeHMTtakesthreearguments.Thersttwoarguments,thetransitionprobabilitiesandthelikelihood,areestimatedduringtrainingphase.Thelastoneistheextractedfeatures,basedonwhichweperformanomalydetection.Itreturnstheestimatesofnodestates.Amongthem,weareonlyinterestedinstatesofleaf 81
PAGE 82
1. functionViterbiDecodeHMT(:::) 2. Argument1:transitionprobabilities,P(ij(i));i2n0. 3. Argument2:likelihood,fp(iji);i2g. 4. Argument3:feature,~. 5. Return:estimatednodestates,f^ui;i2g. 6. 7. 8. for8i20 ^ui=argmaxuiP(i=uij~) 10. endfor 11. forl1;:::;L1,for8i2l ^ui=argmaxui~(ui)Pi=uij(i)=^u(i);~24Yi02T(i)nR(i)~(^ui0)Pi0=^ui0j(i0)=^u(i0);~35~(^uR(i))PR(i)=^uR(i)j~ 13. endfor Figure411. ViterbialgorithmforHMTdecoding. ByBayesianformula,thetermsinEquation( 4{39 )canbecomputedbyPi0=ui0j(i0)=u(i0);~=Pi0=ui0;(i0)=u(i0)j~ 82
PAGE 83
4{40 )and( 4{41 )inFigure 411 )tosolveEquation( 4{43 )inawaysimilartoHMTtransitionprobabilityestimation(seeFigure 410 ).ObtainingsolutionstoEquation( 4{43 )forallnodes,wecansolveEquation( 4{39 ).AbruteforcesolutiontoEquation( 4{39 )istoexhaustivelycompute24Yi02TinfR(i)g~(ui0)P(i0=ui0j(i0)=u(i0);~)35~(uR(i))PR(i)=uR(i)j~; 44 ),whosecomplexityisO(2):Inthispaper,weappliedViterbialgorithm[ 40 { 42 ]tosolvingEquation( 4{39 )inaniterativemanner,reducingthecomputationalcomplexitytoO(B).ThemotivationofViterbialgorithmisthefollowing.ItiteratesfromtoplevelofaHMTtobottomlevel.Ateachiteration,itestimatesthenodestatesinthatlevelinawaythat,whencombinedwithstatesestimatedinupperlevels,\best"explainstheobservedfeatures.Thatis,ateachiteration,Viterbialgorithmalwaysselectsthelocalmaxima. 83
PAGE 84
4{39 ),Viterbialgorithmisecientandhasgoodperformanceempirically.ThecomputationcomplexityofViterbialgorithmisO(B),muchbetterthanthedependentmodel,asillustratedinFigure 44 ,whosecomplexityisO(2).TheperformanceimprovementresultsfromthefactthatViterbialgorithmdoesnotexhaustivelytestallpossiblenodestatecombinations.Instead,wedecomposethedecodingproblemintomultiplestages,eachofwhichdecodesnodestatesatonelevelinHMT.Atlevell,weestimatenodestatesoflevell,i.e.,fi;i2lg,basedonresultsobtainedduringpreviousstages,i.e.fi;i2f0;:::;l1gg.Insuchaprocedure,eachnodeisonlyaccessedfortwice.Hencethecomplexityislineartothenumberofnodes,whichisBdlogBe1BL 412 .Ateachedgerouter,twomonitorsareplacedtomeasuretheinboundandoutboundtrac 84
PAGE 85
ExperimentNetwork betweenasubnetandthevictimnetwork,respectively.Forconvenience,wedenotealinkastheroutebetweenanedgerouterandthevictimsubnet. 3.7.2 [ 27 ].Sincewedonothaverealdatatracesobtainedfrom16dierentlinks,weusetherealtractracemeasuredononelink(betweentheInternetandAucklandUniversity)in16dierentdaystocreatetractracesfor16dierentlinks.Fortheabnormaltrac,werandomlygenerateTCPSYNoodattacksintothebackgroundtrace.Specically,wegenerateseveralattackscenarios.Foreachscenario,werandomlyselecttheabnormallinksandattackdurations.AttacktraconeachlinkisgeneratedinthesamewayasinSection 3.7.2 .Thatis,werandomlyinsertTCPSYNpacketswithrandomsourceIPaddressesintothebackgroundtracofthatlink.TheaveragepacketrateofTCPSYNattacktraconeachselectedlinkis1%ofthetotalpacketrateonthelink.Foreachattackscenario,attacksoneachoftheselectedlinksarelaunchedduringalmostthesameperiodtosimulatethesynchronizedDDoSattacks.Sincetheattacktraconeachlinkislow(just1%),weeectivelysimulatelowvolumeattacktrac. 85
PAGE 86
3 ,i.e.,thenumberofunmatchedinboundSYNpacketsinonetimeslot.TheparametersettingoftwowaymatchingfeaturesextractionissameasinSection 3.7.2 .Forconvenience,wesummarizetheparametersinTable 43 Table43: Parametersettingoffeatureextractionfornetworkanomalydetection NotationDescription Forcomparisonpurpose,wealsomeasurethenumberofSYNpacketsandSYN/FINratio[ 11 ]inaslot. Performanceofdierentschemes. FeatureDetectionalgorithmDetectionprobabilityFalsealarmprobability SYN/FINratioCUSUM0.1740.129SYNCUSUM0.520.129SYNMachinelearning0.6560.123UnmatchedSYNCUSUM0.6900.130UnmatchedSYNMachinelearning0.9730.115 Table 44 comparestheperformanceofdierentschemes,wherethebenchmarkistheschemein[ 11 ],i.e.,theCUSUMschemewithSYN/FINratioasthefeature;forthebenchmarkscheme,weusethesameparametersettingasthatin[ 11 ];wecomparethebenchmarkwithCUSUMandourmachinelearningalgorithmunderdierentfeatures.Tomakefaircomparison,wemakethefalsealarmprobabilityofeachschemealmostthesameandcomparethedetectionprobability.FromTable 44 ,itcanbeseenthat,thebenchmarkscheme(`SYN/FINratio'+CUSUM)performsverypoorlyindetectinglowvolumeDDoSattacks.Incontrast,aCUSUMalgorithmwiththenumberofSYN 86
PAGE 87
Figure413. Performanceofthresholdbasedandmachinelearningalgorithmswithdierentfeaturedata Figure 413 comparestheROCcurveofthethresholdbasedschemedescribedinSection 4.1.2 andourmachinelearningalgorithmundertwodierentfeatures,i.e.,thenumberofSYNpackets(denotedby`SYN')andthenumberofunmatchedSYNpackets(denotedby`UMSYN').Weobservethat,forthesamedetectionalgorithm,usingthenumberofunmatchedSYNpacketscansignicantlyimprovetheROCperformance,comparedtousingthenumberofSYNpackets.Inotherwords,giventhesamefalsealarmprobability,thedetectionprobabilityismuchhigherwhenusingthenumberofunmatchedSYNasfeature.AnotherimportantobservationfromFigure 413 isthatgiventhesamefeaturedata,ourmachinelearningalgorithmcan(signicantly)improvetheROC,comparedtothethresholdbasedscheme;e.g.,forthesamefalsealarmprobabilityof0.05,ourmachinelearningalgorithmachievesadetectionprobabilityof0.93,whilethethresholdbasedschemeonlyachievesadetectionprobabilityof0.72.Thisisduetothefactthatour 87
PAGE 88
Figure414. Performanceoffourdetectionalgorithms InFigure 414 ,wecomparetheROCperformanceoffourdetectionalgorithms(thethresholdbased,thesingleCUSUM,thedualCUSUMdescribedinSection 4.1.3 ,andourmachinelearningalgorithm)underthesamefeature,i.e.,thenumberofunmatchedSYNpackets.ForthesingleCUSUMandthedualCUSUMalgorithm,thedetectiondelayDischosenfrom1to10slotsandtheparameteraioflinkiisdeterminedbyai=(DattackDnormal)i 414 )ofourmachinelearningalgorithmisthebestamongallthealgorithms.WealsoseethatthedualCUSUMoutperformsthesimplethresholdbasedalgorithmandthesingleCUSUMalgorithmhastheworstROCperformance. 88
PAGE 89
27 ].WetestedourmachinelearningalgorithmsforalargeIPaddressspace,i.e.,theIPaddressspacecanbethewholeIPaddressspacefortheInternet. 89
PAGE 90
5 and 6 ,wefocusonthesecondpartofourresearch,i.e.,networkcentrictracclassication.Thischaptermotivatesthesignicanceandpointsoutthechallengesofthisissue,andshowsweaknessofexistingsolutions. 90
PAGE 91
43 44 ].InprincipletheTCPandUDPserverportnumberscanbeusedtoidentifythehigherlayerapplication,bysimplyidentifyingtheserverportandmappingthisporttoanapplicationusingtheInternetAssignedNumbersAuthority(IANA)listofregisteredports[ 45 ].However,portbasedapplicationclassicationhaslimitationsduetotheemergenceofnewapplicationsthatnolongerusexed,predictableportnumbers.Forexample,nonprivilegedusersoftenhavetouseportsabove1024tocircumventoperatingsystemaccesscontrolrestrictions;orcommonapplicationslikeFTPallowsthenegotiationofunknownandunpredictableserverportstobeusedforthedatatransfer;orproprietaryapplicationsmaydeliberatelytrytohidetheirexistenceorbypassportbasedltersbyusingstandardports.Forexample,serverport80isbeingusedbyalargevarietyofnonwebapplicationstocircumventrewallswhichdonotlterport80trac;others(e.g.,Skype)tendtousedynamicports.Amorereliabletechniqueinvolvesstatefulreconstructionofsessionandapplicationinformationfrompacketcontents[ 46 { 48 ].Althoughthisavoidsrelianceonxedportnumbers,itimposessignicantcomplexityandprocessingloadontheclassicationdevice, 91
PAGE 92
49 50 ]havetakenthisstatisticalapproachtoclassifytracintop2p,multimediastreaming,interactiveapplication,andbulktransferapplication.Unfortunately,althoughthesepapersaddressedtheproblemofdistinguishingmultimediatracfromotherapplications,theyhavenotaddressedtheproblemofdistinguishingvoicetracfromvideotrac.Oneproblemistoseparatestreamingtracfromotherapplications,andadierentproblemistodetectandcorrectlyclassifyvoiceandvideoandclearlyseparatethetwoapplicationsfromeachother.Intheextremecase,voiceand/orvideodatastreamsmightevenbebundledtogetherinthesameexactowwithotherapplications.TheseproblemsarecommonformanyapplicationslikeSkype,GtalkandMSNthatallowuserstomixvoiceand/orvideostreamswithchatand/orletransfertracinthesameexact5tupleow,denedas.Insuchcases,oneowmaycarry 92
PAGE 93
93
PAGE 94
5.2 ,weintroducetherelatedworkintheareaofpatternclassicationmethodologies.Section 5.3 describestheweaknessesofmetricspreviouslyusedbyotherworkswhenappliedinourcontextandhighlightsthenewtracfeaturesthatconstitutethefoundationofourapproach.Section 5.4 summarizesthischapter. 49 ],theauthorsproposedthecombinationofaveragepacketsizewithinaowandtheinterarrivalvariabilitymetric,e.g.denedastheratioofthevariancetotheaverageinterarrivaltimesofpacketswithinaow,asapowerfulmetrictodenefairlydistinctboundariesforthreegroupsofapplications:(i)bulkdatatransferlikeFTP,(ii)interactivelikeHTTP,and(iii)streaminglikevoice,video,gaming,etc.Severalclassicationtechniques,likenearestneighborandKnearestneighbor,werethentestedusingtheabovetracfeatures.Althoughthispreliminarystudyhasprovedthattheapproachofpatternclassicationhasgreatpotentialforaproperapplicationclassication,itprovesthatmuchmoreworkstillremains,e.g.exploitingotheralternativefortracfeaturesandclassicationtechniques.Moreover,althoughthefeaturesextractedaresimpleandfeasibletobeimplementedonthey,thelearningalgorithmiscomplexandtheoutcomeboundariesamongthethreefamiliesofapplicationsareheavilynonlinearandtimedependent.SimilartoRef.[ 49 ],Karagiannisetal.[ 50 ]proposedanovelapproach,calledBLINC,thatexploitsnetworkrelatedpropertiesandcharacteristics.Thenoveltyofthisapproach 94
PAGE 95
50 ]isinterestingfromaconceptualperspectiveandprovedtoperformreasonablywellforavarietyofdierentapplications,itisstillpronetolargeestimationerrorsforstreamingapplications.Moreover,itshighcomplexityandlargememoryconsumptionremainsanopenissueforhighspeedapplicationclassication.OtherpapersusingpatternclassicationappearedlatelyinliteraturebutmorefocusedonspecicapplicationdetectionlikePeertoPeer[ 46 ]andchat[ 51 ].Moreimportantly,tothebestofourknowledge,noneoftheexistingworkhasbeenabletoseparatevoicetracfromvideotracortoindicatethepresenceofvoicetracorvideotracinahybridowthatcontainstracfrombothvoice/videoandotherapplicationssuchasletransfer. 51 weshowtheresultsobtainedwhenusingthecombinationofaveragepacketsizeandtheinterarrivalvariabilitymetricproposedbyRoughanet
PAGE 96
49 ].Althoughthismetricperformedverywellinseparatingstreaming,letransfer,transactionalandinteractiveapplications,itperformspoorlywhenusedtofurtherseparateapplicationswithinthesamefamily,asvoice,videoorvoiceandvideomixedwithotherapplicationslikeletransfer,e.g.hybridows.Figure 51 clearlyhighlightsthecompleteabsenceofanydistinctboundaryandheavyoverlappingbetweenvoiceandvideotrac.Thereasonswhythepair(averagepacketsize,interarrivalvariabilitymetric)cannotseparatevideofromvoiceareasbelow.First,thepacketsizeforvideo/voiceiscontrolledbythepacketizationstrategyofthevideo/voiceapplicationdesigner[ 52 ];hence,avideoapplicationmayproducesimilaraveragepacketsizetothatforvoice(Figure 51 ).Second,randomendtoenddelayintheInternetcauseslargevariationsintheinterarrivalvariabilitymetricfordierentvideo/voiceows. Figure51. Averagepacketsizeversusinterarrivalvariabilitymetricfor5applications:voice,video,letransfer,mixofletransferwithvoiceandvideo. Table51: Commonlyusedspeechcodecandtheirspecications StandardCodecMethodInterPacketDelay(ms) G.711[ 53 ]PCM.125G.726[ 54 ]ADPCM.125G.728[ 55 ]LDCELP.625G.729[ 56 ]CSACELP10G.729A[ 56 ]CSACELP10G.723.1[ 57 ]MPMLQ30G.723.1[ 57 ]ACELP30 96
PAGE 97
Figure52. Interarrivaltimedistributionforvoiceandvideotrac 51 listssomespeechcodecstandardsandtheassociatedIPDsthatarerequiredforacorrectimplementationofthoseprotocols.PacketsleavingthetransmittermighttraversealargenumberoflinksintheInternetbeforereachingtheproperdestination.Alongthis 97
PAGE 98
Packetsizedistributionforvoiceandvideotrac traveling,packetsmightexperiencerandomdelayduetocongestionatrouters'interfaces.Asaconsequence,theinterarrivaltimesbetweenpacketsatthereceivermightbeseverelyaectedbyrandomnoise,e.g.jitter,andthusthismetricmightnotrepresentareliablecandidatefeatureforarobustclassicationmethodology.Althoughthisproblemdoesexist,wenotehowtheinterarrivaltimesbetweenpacketswithinthesameowstillshowsastrongregularitywhenstudiedinthefrequencydomainatthereceiverside.Asanexample,inFigure 52 ,weshowthedistributionsoftheinterarrivalpackettimesatthereceiversidewhenusingSkypetotransmitrespectivelyvoiceonlyandvideoonlybetweentwohosts,onelocatedinUniversityAineastcoastandtheotherinUniversityBinwestcoastofUSA.Aswecansee,thedistributionsforbothvideoandvoicearecenteredaround0.03second.Ontheotherhands,Figure 53 showsthedistributionsofthepacketsizesforbothvoiceandvideo.Asyoucansee,bothvoiceandvideoarecharacterizedbysimilardistributionforpacketsizelessthan200bytes.Althoughvideotracgeneratespacketsizeoflargerthan200,theselargerpacketscannotbereliablyusedtoseparatevideofromvoicesinceotherapplicationssuchaschatorletransfermightalsogeneratetheselargerpackets.Asaconsequence,packetinterarrivaltimeorpacketsizeisaweakfeaturewhenconsideredinthetemporaldomain. 98
PAGE 99
54 and 55 respectivelyforvoiceandvideo.Wecanseethatsomeregularityforbothvoiceandvideoexistfordierenttraces,althoughtheregularityisnotquitestrong.ThisresultholdstrueforallexperimentsconductedwhentransmittingSkypevoiceandvideopacketsovertheInternetfromUniversityAtoUniversityB. Figure54. Powerspectraldensityoftwosequences/tracesoftimevaryinginterarrivaltimesforvoicetrac 58 ],i.e.,Intraframes(Iframe)andPredictedframes(Pframe).AnIframeisaframecodedwithoutreferencetoanyframeexceptitself.Itservesasthestartingpointforadecodertoreconstructthevideostream.APframemaycontainbothimagedataandmotionvectordisplacements 99
PAGE 100
Powerspectraldensityoftwosequencesoftimevaryinginterarrivaltimesforvideotrac and/orcombinationsofthetwo.Itsdecodingneedsreferencetopreviouslydecodedframes.PacketscontainingIframesarelargerthanthosecontainingPframes.Usually,thenumberofPframesbetweentwoconsecutiveIframesisconstant.Hence,onecanobserveastrongperiodicvariationofpacketsizeduetotheinterleavingofIframesandPframescomposingvideodatastreams.VoicestreamshavesimilarphenomenonifLinearPredictionCoding(LPC),e.g.,codeexcitedlinearprediction(CELP)voicecoder,isemployed.Asanexample,Figs. 56 and 57 showthepowerspectraldensityofvoiceandvideopacketsizes,respectively. 58 and 59 showhowtheregularitieshiddeninvoiceandvideodatastreamscanbeampliedwhencombiningthetwofeaturestogetherinonesinglestochasticprocessthatwillbedescribedlaterinthepaper.NotehowthetwoimportantfrequenciesareampliedandclearlyvisibleinthePSDplots.ThereasonwhythereisapeakinthePSDforvoice(seeFigure 58 )isthatvoiceapplicationsusuallyproduceclosetoconstantpacketrateduetoconstantinterpacketdelayofthewidelyusedspeechcodecslistedinTable 51 ;e.g.,thepeakof33HzinFigure 58 correspondsto30msinterpacket 100
PAGE 101
Powerspectraldensityoftwosequencesofdiscretetimepacketsizesforvoicetrac Figure57. Powerspectraldensityoftwosequencesofdiscretetimepacketsizesforvideotrac delay.Comparedtovoice,videoapplicationshaveaatterPSD.Thereasonisasbelow.ThenumberofbitsinanIframeofvideodependsonthetextureoftheimage(e.g.,theIframeofablackboardimageproducesmuchlessbitsthanthatofacomplicatedowerimage),resultinginalargerangeinthenumberofpacketsinanIframe,e.g.,from1packettoafewhundredpacketsproducedbyanIframe.Theframerateisusuallyconstant(e.g.,30frames/sisastandardrateinUSA),i.e.,aframewillbegeneratedevery 101
PAGE 102
Figure58. Powerspectraldensityoftwosequencesofcontinuoustimepacketsizesforvoicetrac Figure59. Powerspectraldensityoftwosequencesofcontinuoustimepacketsizesforvideotrac 102
PAGE 103
5.2 )arenotabletoaccuratelydistinguishbetweenvoiceandvideoows.Byanalyzingthepropertiesofvoiceandvideodatastreams,ourintuitionistoexploitthestrongregularitiesresidinginpacketinterarrivaltimesandtheassociatedpacketsizes.Inthischapter,weanalyzefourtypesofmetricsthatexploittheregularities,1.packetinterarrivaltimeandpacketsizeintimedomain;2.packetinterarrivaltimeinfrequencydomain;3.packetsizeinfrequencydomain;4.combiningpacketinterarrivaltimeandpacketsizeinfrequencydomain.Byanalyzingpropertiesandillustratingguresofthefourtypesofmetrics,weshowthatcombiningpacketinterarrivaltimeandpacketsizeinonesinglestochasticprocessgeneratesdistinctivefeaturetoclassifyvoiceandvideostreams. 103
PAGE 104
VOVClassierSystemArchitecture Werstpresenttheoverallarchitectureofoursystem(VOVClassier,Figure 61 ),andprovideahighleveldescriptionofthefunctionalitiesofeachofitsmodules.Generallyspeaking,VOVClassierisanautomatedlearningsystemthatusespacketheadersfromrawpacketscollectedothewire,organizethemintotransportnetworkowsandprocesstheminrealtimetosearchforvoiceandvideoapplications.VOVClassierrsttrainsvoiceandvideodatastreamsseparatelybeforebeingusedinrealtimeforclassication.Duringthetrainingphase,VOVClassierextractsfeaturevectors,whichisasummary(alsoknownasastatistic)ofrawtracbitstream,andmaintaintheirstatisticsinmemory.Duringtheonlineclassicationphase,aclassiermakesdecisionbymeasuringsimilaritymetricsbetweenthefeaturevectorextractedfromontheynetworktracandthefeaturevectorsextractedfromtrainingdata.Flowswithhighvaluesofsimilaritymetricwiththevoice(orvideo)featuresareclassiedasvoice(orvideo);datastreamswithlowvaluesofsimilaritywithvoice/videoareclassiedasotherapplications.Ingeneral,VOVClassieriscomposedoffourmajormodulesthatoperateincascade:(i)FlowSummaryGenerator(FSG),(ii)FeatureExtractor(FE)viaPowerSpectral 104
PAGE 105
5.3 wehaveshownasvoiceandvideodatastreamsthatarecharacterizedbypacketsthatareverysmallinsize.Asaconsequence,thismoduleltersoutallpacketswhosesizeissmallerthanaprespeciedthresholdP.TheprocessedowistheninternallydescribedintermsofpacketsizesandinterarrivaltimesbetweenpacketswithinanygenericowFS.FS=fhPi;Aii;i=1;:::;Ig; 61 )areusedduringboththetrainingandtheclassicationphases. 105
PAGE 106
106
PAGE 107
6.2 6.3 ,and 6.4 describethecomponents,FeatureExtractor,Voice/VideoSubspaceGenerator,andVoice/VideoClassier,respectively.InSection 6.5 ,weconductexperimentsontraccollectedbetweentwouniversitiesusingSkype,MSN,andGTalk.Section 6.6 summarizesthischapter. 5.3 theextractionandprocessingofsimpletracfeaturesdoesnotsolvetheproblemofdetectingandseparatingvoiceandvideodatastreamsfromotherapplications.InthissectionwerstintroducethepreliminarystepsthatwetaketotransformeachgenericowFSobtainedfromFSGintoastochasticprocessthatcombinestheinterarrivaltimesandpacketsizes.Thenwedescribehowtousepowerspectraldensity(PSD)analysisasapowerfulmethodologytoextractsuchhiddenkeyregularitiesresidinginrealtimemultimedianetworktrac. Powerspectraldensityfeaturesextractionmodule.Cascadeofprocessingsteps. EachowFSextractedfromtheFSGisforwardedtotheFEmodulethatappliesseveralstepsincascade(Figure 62 ).First,anyFSextracted(seeEquation( 6{1 ))ismodeledasacontinuousstochasticprocessasillustratedinEquation( 6{2 ):P(t)=X2FSP(tA); 107
PAGE 108
6{2 )isrepresentedasasummationofdeltafunctions,itsspectrumspansthewholefrequencydomain.InordertocorrectlyreshapethespectrumofPh(t)toavoidaliasingwhenitissampledatintervalTs,weapplyalowpasslter(LPF)characterizedbyitsimpulseresponsehLPF(t).Ph(t)canthenbemathematicallydescribedasfollowing:Ph(t)=P(t)hLPF(t)=X 2FSPh(tA): Powerspectraldensitydenition. 108
PAGE 109
59 ])canbecomputedas:($;y)=1Xk=r(k;y)ej$k;$2[;); 6{5 )cantakeanyvalue,werestrictitsdomaintobewithin[;)because($;y)=($+2;y).AccordingtoEquations( 6{5 )and( 6{6 ),thecomputationofthePSDforadigitalsignaltheoreticallyrequirestohaveaccesstoaninnitelongtimesequence.Sinceinreality,wecannotassumetohaveinnitedigitalsequencesatourdisposal,weneedtofacetheproblemonwhichtechniquecanbeusedinourcontexttoestimatethepowerspectraldensitywithanadmissibleaccuracy.Inliterature,twodierentfamiliesofPSDestimationareavailable:parametricandnonparametric.Parametricmethodshaveshowntoperformbetterundertheassumptionthattheunderlyingmodeliscorrectandaccurate.Furthermore,thesemethodsaremoreinterestingfromacomputationalcomplexityperspectiveastheyrequiretheestimationoffewervariableswhencomparedwithnonparametricmethods.Inourresearch,weemployparametricmethodtoestimatePSD.Thedetailsarepresentedinthenextsection. 109
PAGE 110
6{7 )canberegardedasobtainingasignalbylteringwhitenoiseofpower"2throughalterwithtransferfunctionB($) (6{10)StartingfromEquation( 6{7 ),threetypesofmethodsarederived:ifp>0andq=0,onemodelsfy(i)gi2Zasanautoregressive(AR(p))signal;1.ifp=0andq>0,onemodelsfy(i)gi2Zasamovingaverage(MA(q))signal;2.otherwise,itismodeledasanautoregressivemovingaverage(ARMA(p;q))signal.BasedontheAR,MA,orARMAassumptions,onecanestimatethecoecientsinEquation( 6{7 )andhencePSD.Ingeneral,noneofthesethreemodelsoutperformstheothertwobutrathertheirperformancearestrictlyrelatedtothespecicshapeofthesignalunderconsideration.Duetothefactthatthesignalweprocessarecharacterizedbystrongregularitiesinthetimedomain,wedecidedtoadopttheARmodel.ThereasonisthattheARequationcanmodelspectrumwithnarrowpeaksbyplacingzerosofA($)closetotheunitcircle. 110
PAGE 111
6{10 )canbewrittenasy(i)+pXt=1aty(it)="(i): 6{12 )byy(ik)andtakingexpectationatbothsides,oneobtainsr(k;y)+pXt=1atr(kt;y)=Ef"(i)y(ik)g: 6{14 )canberewritteninmatrixform,i.e.,266666664r(0;y)r(1;y)r(p;y)r(1;y)r(0;y).........r(1;y)r(p;y)r(0;y)3777777752666666641a1...ap377777775=266666664"20...0377777775 6{15 )isknownastheYuleWalkermethodforARspectralestimation[ 59 ].Givendatafy(i)gIi=1,onerstestimatestheautocovariancesequence8>><>>:^r(k;y)=1 6{15 )becomesasystemofp+1linearequationswithp+1unknownvariables, 111
PAGE 112
6{17 )and( 6{18 ),isnotgoodenoughintermsoftimecomplexity.Equation( 6{17 )computestheinversionofcovariancematrix,whosetimecomplexityisO(p3)[ 60 ,page755].Inaddition,inmostapplications,thereisnoaprioriinformationaboutthetrueorderp.Tocopewiththat,theYuleWalkersystemofequations,Equation( 6{15 ),hastobesolvedforp=1uptop=pmax,wherepmaxissomeprespeciedmaximumorder.ThetimecomplexityisO(p4max).Inthispaper,weusetheLevinsonDurbinAlgorithm(LDA)[ 59 ]toreducetimecomplexity.ItestimatesARsignalcoecientsrecursivelyintheorderp.Tofacilitatefurtherdiscussionandtoemphasizetheorderp,wedenoteRp+14=266666664r(0;y)r(1;y)r(p;y)r(1;y)r(0;y)r(p+1;y).........r(p;y)r(1;y)r(0;y)377777775; 112
PAGE 113
6{15 )asRp+12641~ap375=264"2p0375 1. functionLDA(:::) 2. Argument1:data,fy(i)gIi=1. 3. Argument2:order,p. 4. Return:parametersofAR(p)model,~apand"2p. 5. 6. (6{23) 7. (6{24) 8. fort1;:::;p rt4=[r(t;y);r(t1;y);;r(1;y)]T 10. endfor Figure63. LevinsonDurbinAlgorithm. Figure 63 givestheLDAalgorithmtoestimatecoecientsofAR(p)modelgivendatafy(i)gIi=1. 113
PAGE 114
6{17 )and( 6{18 ). 1. functionPSDEstimate(:::) 2. Argument1:data,fy(i)gIi=1. 3. Argument2:order,p. 4. Return:PSD($;y). 5. (6{30) 6. Figure64. ParametricPSDEstimateusingLevinsonDurbinAlgorithm. OncetheARmodelisestimated,onecanestimatePSDofsignalfy(i)gIi=1.TheprocedureisgiveninFigure 64 6{4 ))tobesecondorderstationary.ThenitsPSDcanbeestimatedas:($;Pd)=PSDEstimatefPd(i)gIdi=1;p 62 ).Thus,onecanfurtherformulatethePSDintermsofrealfrequencyfasf(f;Pd)=2f Fs;Pd;f2Fs 114
PAGE 115
6{33 )showstherelationshipbetweentheperiodiccomponentsofastochasticprocessinthecontinuoustimedomainandtheshapeofitsPSDinthefrequencydomain.f(f;Pd)isacontinuousfunctioninfrequencydomain.Tohandleitinacomputer,weneedtodosamplinginfrequencydomain.Inotherwords,weselectaseriesoffrequencies,0f1
PAGE 116
61 ,wecollectvoiceandvideotrainingowsduringthetrainingphase.AfterprocessingtherawpacketdatathroughthefeatureextractionmoduleviaPSD,oneobtainstwosetsoffeaturevectors,(1)4=n~1(1);~1(2);:::;~1(N1)o; 61 ]andMetricMultidimensionalScaling(MDS)[ 62 ],whichidentifylinearstructure,andISOMAP[ 63 ]andLocallyLinearEmbedding(LLE)[ 64 ],whichidentifynonlinearstructure.Unfortunately,allthesemethodsassumethatdataareembeddedinonesinglelowdimensionalsubspace.Thisassumptionisnotalwaystrue.Forexample,asdierentsoftwareusesdierentvoicecoding,itismorereasonabletoassumethatthePSDfeaturevectorofvoicetracisarandomvectorgeneratedfromamixturemodelthanasinglemodel.Insuchcase,itismorelikelythatthereareseveralsubspacesinwhichthefeaturevectorsareembedded.Thesameholdsforvideofeaturevectors.Asaresult,abetterschemeistorst,clusterthetrainedfeaturevectorsintoseveralgroups,knownassubspacedecomposition;andsecond,toidentifythesubspacestructureofeachgroup,knownassubspacebasesidentication.Wedescribethetwostepsinthefollowingsections. 116
PAGE 117
65 ]proposedamethodtodecomposesubspacesaccordingtotheminimumcodinglengthcriteria.Theideaistoviewthedatasegmentationproblemfromtheperspectiveofdatacoding/compression.Supposeonewantstondacodingscheme,C,whichmapsdatain2RMNtobitsequence.Asallelementsarerealnumbers,innitelongbitsequenceisneededtodecodewithouterror.Hence,onehastospecifythetolerabledecodingerror,,toobtainamappingwithnitecodinglength,i.e.,~nC1C~n22;for8n=1;:::;N: 65 ]thatthecodinglengthisupboundedbyLC()L()=N+K N2T+K 117
PAGE 118
6{39 )),intermsofminimumcodinglengthcriteria,shouldminimizecodinglengthofthesegmenteddata,i.e.,min^LC(;)=min(KXk=1LC(k)+KXk=1jkjlog2k 6{45 )isthesummationofcodinglengthofeachgroup,andthesecondoneisthenumberofbitsneededtoencodingmembershipofeachitemofintheKgroups.Theoptimalpartitionisachievedinthefollowingway.Letthesegmentationschemeberepresentedbythemembershipmatrix,k4=diag([1k;2k;:::;Nk])2RNN; 65 ,page34]provedthatthecodinglengthisboundedasfollows.^LC(;)KXk=1tr(k)+K 6{45 )and( 6{48 ),oneachievesaminimaxcriterion^=argminhmaxC^LC(;)i=argmin^L(;): 118
PAGE 119
1. functionMCLPartition(:::) 2. Argument1:setoffeaturevectors,=n~(1);:::;~(N)o Return:partitionof,=f1;:::;K;1[:::[K=;i\j=;for8i6=j;g. 4. Initialization:=nn~(1)o;n~(2)o;:::;n~(N)o;o whiletruedo 6. (6{51) 7. ifL(1[2)L(1;2)0then 8. break 9. else 10. =(nf1;2g)[f1[2g endif 12. endwhile 13. return Figure65. Pairwisesteepestdescentmethodtoachieveminimalcodinglength. ThereisnoclosedformsolutionforEquation( 6{49 ).Hong[ 65 ,page41]proposedapairwisesteepestdescentmethodtosolveit(Figure 65 ).Itworksinabottomupway.Itstartswithapartitionschemethatassignseachelementoftoapartition.Then,ateachiteration,thealgorithmndstwosubsetsoffeaturevectorssuchthatbymergingthesetwosubsets,onecandecreasethecodinglengththemost(Equation( 6{51 )).Thisprocedurestopswhennofurtherdecreaseofcodinglengthcanbeachievedbymerginganytwosubsets.Usingtheabovemethod,weobtainapartitionofvoicefeaturevectorset(1), 119
PAGE 120
61 ]algorithmtoidentifysubspacebasesforeachsegmentation,n(i)k;k=1;:::;Ki;i=1;2o; 66 showsthealgorithm. 1. functionh~;^U;^;U;i=IdentifyBases2RMN; =h~1~;~2~;:::;~jj~i DoeigenvaluedecompositiononTsuchthatT=UUT; 5. 6. ^U=[~u1;~u2;:::;~uJ1] 7. U=[~uJ;~uJ+1;:::;~uM] 8. ^=diag21;:::;2J1 =diag2J;:::;2M endfunction Figure66. FunctionIdentifyBasesidentiesbasesofsubspace. InFigure 66 ,argumentrepresentsthefeaturevectorsetofonesegmentationandisauserdenedparameterwhichspeciesthepercentageofenergyretained,e.g.,90% 120
PAGE 121
6.4 .ApplyingthefunctionIdentifyBasesonallsegmentations,weobtainh~(i)k;^U(i)k;^(i)k;U(i)k;(i)ki=IdentifyBases((i)k) (6{56)for8k=1;:::;Ki,8i=1;2.Thesearetheoutputsofsubspaceidenticationmodule,andhencetheresultsoftrainingphase,inFigure 61 .Duringtheclassicationphase,theseoutputsareusedassystemparameters,whichwillbepresentedinthenextsection. 6.3 ,wepresentedanapproachtoidentifysubspacesspannedbyPSDfeaturevectorsoftrainingvoiceandvideoows.Specically,oneobtainsthefollowingparameters:h~(i)k;^U(i)k;^(i)k;U(i)k;(i)ki 121
PAGE 122
67 showstheprocedureofthevoice/videoclassier. 1. functiontype=VoiceVideoClassify~;A;V For8i=1;2,8k=1;:::;Kid(i)k=NormalizedDistance~;~(i)k;U(i)k;(i)k. 3. For8i=1;2,di=minkd(i)k. 4. ifd1V type=VOICE. 6. elseifd1>Aandd2
PAGE 123
61 tonetworktracclassication.Beforethat,werstdescribeexperimentsettingsinSection 6.5.1 6.5.2 ,twosetsofexperimentsareconductedontracgeneratedbySkype.InSection 6.5.3 ,othertwosetsofexperimentsareconductedontracgeneratedbySkype,MSN,andGTalk.Foreachsetoftheexperiments,weuseReceiverOperatingCharacteristics(ROC)curves[ 28 ,page107]astheperformancemetric.ROCcurveisacurveofdetectionprobability,PD,vs.falsealarmprobability,PFA,where,PDjH4=P(TheestimatedstateofnatureisHjThetruestateofnatureisH); 67 ),oneisabletogenerateROCcurve.Duringtheexperiments,wecollectednetworktracfromthreeapplications,i.e.,Skype,MSN,andGTalk.Foreachapplication,tracwascollectedintwoscenarios.Fortherstscenario,twolabcomputerslocatedinUniversityAandUniversityBrespectivelywerecommunicatingwitheachother.Therewasdirectconnectionbetweentwopeers.Forthesecondone,weusedarewalltoblockdirectconnectionbetweenthetwopeerssuchthattheapplicationwasforcedtouserelaynodes.Todoclassication,wechoserst10secondsofeachow,i.e.,Amax10seconds.WesetTs=0:5milliseconds.Hence,Id=20;000. 123
PAGE 124
68 showstheROCcurvesofclassifyingvoiceandvideoows. (b)Figure68. TheROCcurvesofsingletypedowsgeneratedbySkype,(a)VOICEand(b)VIDEO. WethenconductexperimentsonhybridSkypeows.Inotherwords,eachowmaybeoftypeVOICE,VIDEO,FILE+VOICE,FILE+VIDEO,ornoneoftheabove.Figure 69 plotstheROCcurvesofthesevetypes,respectively. 124
PAGE 125
(b) (c) (d)Figure69. TheROCcurvesofhybridowsgeneratedbySkype,(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. SimilartoSection 6.5.2 ,werstconsiderthescenariowheneachowcarriesonetypeoftrac.Bytuningthethresholds,,A,andV,wegenerateROCcurvesofclassifyingvoiceandvideoows(Figure 610 ).Wethenconductexperimentsonhybridows.Figure 611 showstheROCcurvesofclassifyingVOICE,VIDEO,FILE+VOICE,andFILE+VIDEOows. 68 69 610 ,and 611 ,weshowsometypicalvaluesofPDandPFApairsinTable 61 .OnecanseethefollowingphenomenafromTable 61 125
PAGE 126
(b)Figure610. TheROCcurvesofsingletypedowsgeneratedbySkype,MSN,andGTalk:(a)VOICEand(b)VIDEO. Table61: TypicalPDandPFAvalues. SkypeSkype+MSN+GTalkPFA(PD)SingleHybridSingleHybrid VOICE0(1)0(1)0(.995).002(.986)VIDEO0(.993)0(.965)0(.952)0(.948) 61 ,onenotesthatclassicationofVOICEtracismoreaccuratethanthatofVIDEO.Specically,wecanachieve100%accurateclassicationofSkypevoiceows.Thisisduetothefactthatvoicetrachashigherregularitythanvideodoes(Figure 58 andFigure 59 ).Onecanimmediatelytellthedominantperiodiccomponentat33Hzinthevoiceows.Thisfrequencycorrespondstothe30millisecondIPDoftheemployedvoicecoding.Ontheotherhand,VideoPSDshavepeaksat0.Itmeansthatnonperiodiccomponentdominatesinvideoows.OnecanseethatPSDsofthetwovideoowsareclosetoeachother.ThatisthereasonwhyourapproachachieveshighclassicationaccuracybyusingPSDfeatures. 61 ,onecanseethattheclassicationofsingletypedowsismoreaccuratethanthatofhybridows.Mixingmultipletypesoftractogetherislikeincreasingnoise.Hence,itisnotsurprisingthatclassicationaccuracyisreduced. 126
PAGE 127
(b) (c) (d)Figure611. TheROCcurvesofhybridowsgeneratedbySkype,MSN,andGTalk:(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. 127
PAGE 128
6.3 ,decomposesmultiplesubspacesintheoriginalhighdimensionalfeaturespace.Asaresult,PSDfeaturevectorsofSkypeandGTalkarelikelytobewithindierentsubspacesthanthoseofMSN.Therefore,wecanstillclassifytracaccurately. 128
PAGE 129
129
PAGE 130
130
PAGE 131
131
PAGE 132
132
PAGE 133
4{31 )Proof:Xu02f0;1gPi=uj(i)=u0(i)(u0)p((i)ju(i)=u0)=Xu02f0;1gPi=uj(i)=u0p((i)=u0;~Tn(i))p((i)j(i)=u0)=Xu02f0;1gp(i=u;(i)=u0;~Tni)=p(i=u;~Tni)=i(u): 4{32 )Proof:p(iji=u)Yj2(i)Xu02f0;1gP(j=u0ji=u)j(u0)=p(iji=u)Yj2(i)Xu02f0;1gP(j=u0ji=u)p(~Tjjj=u0)=p(iji=u)Yj2(i)p(~Tjji=u)=p(~Tiji=u)=i(u) (A{2) 133
PAGE 134
4{33 )Proof:i(u)i(u) Pu0pi=u0;~Tnip~Tniji=u=pi=u;~ Pu0pi=u0;~=pi=u;~ (A{3) 4{34 )Proof:i(u)(i)(u0)Pi=uj(i)=u0(i)(u0) [Pu00i(u00)i(u00)]Pu00Pi=u00j(i)=u0i(u00)=p~Tiji=up~T(i)j(i)=u0Pi=uj(i)=u0p(i)=u0;~Tn(i) hPu00pi=u00;~Tnip~Tiji=u00ihPu00Pi=u00j(i)=u0p~Tiji=u00i=pi=u;~Tij(i)=u0p(i)=u0;~Tn(i)p~T(i)j(i)=u0 hPu00pi=u00;~ihPu00pi=u00;~Tij(i)=u0i=pi=u;(i)=u0;~Ti[Tn(i)p~T(i)j(i)=u0 (A{4) 134
PAGE 135
[1] P.Mockapetris,\Domainnamesconceptsandfacilities,"RFC1034. [2] P.Mockapetris,\Domainnamesimplementationandspecication,"RFC1035. [3] \Videoondemand,"Wikipedia.[Online].Available: http://en.wikipedia.org/wiki/Video on demand [4] D.Wu,Y.T.Hou,W.Zhu,Y.Q.Zhang,andJ.M.Peha,\Streamingvideoovertheinternet:Approachesanddirections,"IEEETrans.CircuitsSyst.VideoTechnol.,vol.11,pp.282{300,Mar.2001. [5] K.Nichols,S.Blake,F.Baker,andD.Black,\Denitionofthedierentiatedserviceseld(dseld)intheipv4andipv6headers,"RFC2474. [6] S.Blake,D.Black,M.Carlson,E.Davies,Z.Wang,andW.Weiss,\Anarchitecturefordierentiatedservices,"RFC2475. [7] P.Almquist,\Typeofserviceintheinternetprotocolsuite,"RFC1349. [8] S.S.Kim,A.L.N.Reddy,andM.Vannucci,\Detectingtracanomaliesusingdiscretewavelettransform,"inProceedingsofInternationalConferenceonInformationNetworking(ICOIN),vol.III,Busan,Korea,Feb.2004,pp.1375{1384. [9] C.M.Cheng,H.T.Kung,andK.S.Tan,\Useofspectralanalysisindefenseagainstdosattacks,"inProceedingsofIEEEGlobecom2002,vol.3,Taipei,Taiwan,Nov.2002,pp.2143{2148. [10] A.Hussain,J.Heidemann,andC.Papadopoulos,\Aframeworkforclassifyingdenialofserviceattacks,"inProceedingsofACMSIGCOMM,Karlsruhe,Germany,Aug.2003. [11] H.Wang,D.Zhang,andK.G.Shin,\DetectingSYNoodingattacks,"inProc.IEEEINFOCOM'02,NewYorkCity,NY,June2002,pp.1530{1539. [12] T.Peng,C.Leckie,andK.Ramamohanarao,\DetectingdistributeddenialofserviceattacksusingsourceIPaddressmonitoring,"DepartmentofComputerScienceandSoftwareEngineering,TheUniversityofMelbourne,Tech.Rep.,2002.[Online].Available: http://www.cs.mu.oz.au/tpeng [13] R.B.Blazek,H.Kim,B.Rozovskii,andA.Tartakovsky,\Anovelapproachtodetectionof\denialofservice"attacksviaadaptivesequentialandbatchsequentialchangepointdetectionmethods,"inProc.IEEEWorkshoponInformationAssuranceandSecurity,WestPoint,NY,June2001,pp.220{226. [14] S.MukkamalaandA.H.Sung,\Detectingdenialofserviceattacksusingsupportvectormachines,"inProceedingsofIEEEInternationalConferenceonFuzzySystems,May2003. 135
PAGE 136
[15] S.Savage,D.Wetherall,A.Karlin,andT.Anderson,\Practicalnetworksupportforiptraceback,"inProc.ofACMSIGCOMM'2000,Aug.2000. [16] A.Lakhina,M.Crovella,andC.Diot,\Characterizationofnetworkwideanomaliesintracows,"inProc.ACMSIGCOMMConferenceonInternetMeasurement'04,Oct.2004. [17] H.Wang,D.Zhang,andK.G.Shin,\Changepointmonitoringforthedetectionofdosattacks,"IEEETransactionsonDependableandSecureComputing,no.4,pp.193{208,Oct.2004. [18] J.MirkovicandP.Reiher,\Ataxonomyofddosattacksandddosdefensemechanisms,"inProc.ACMSIGCOMMComputerCommunicationsReview'04,vol.34,Apr.2004,pp.39{53. [19] J.B.PostelandJ.Reynolds,\Filetransferprotocol,"RFC959,Oct.1985.[Online].Available: http://www.faqs.org/rfcs/rfc959.html [20] K.Lu,J.Fan,J.Greco,D.Wu,S.Todorovic,andA.Nucci,\Anovelantiddossystemforlargescaleinternet,"inACMSIGCOMM2005,Philadelphia,PA,Aug.2005. [21] B.H.Bloom,\Space/timetradeosinhashcodingwithallowableerrors,"Commun.ACM,vol.13,no.7,pp.422{426,July1970. [22] L.Fan,P.Cao,J.Almeida,andA.Z.Broder,\Summarycache:Ascalablewideareawebcachesharingprotocol."IEEE/ACMTrans.Netw.,vol.8,no.3,June2000. [23] F.Chang,W.changFeng,andK.Li,\Approximatecachesforpacketclassication,"inIEEEINFOCOM2004,vol.4,Mar.2004,pp.2196{2207. [24] R.Rivest,\Themd5messagedigestalgorithm,"RFC1321,Apr.1992.[Online].Available: http://www.faqs.org/rfcs/rfc1321.html [25] http://www.hdldh.com/pdf/hcr 7910.pdf [26] D.A.PattersonandJ.L.Hennessy,ComputerOrganizationandDesign:TheHardware/SoftwareInterface.SanFrancisco,CA:MorganKaufmann,1998,ch.5,6. [27] \AucklandIVtracedata,"2001.[Online].Available: http://wand.cs.waikato.ac.nz/wand/wits/auck/4/ [28] L.L.Scharf,Statisticalsignalprocessing:detection,estimation,andtimeseriesanalysis.AddisonWesley,1991. [29] R.O.Duda,P.E.Hart,andD.G.Stork,PatternClassication,2nded.WileyInterscience,Oct.2000. [30] G.CasellaandR.L.Berger,StatisticalInference,2nded.DuxburyPress,June2001.
PAGE 137
[31] B.J.Frey,GraphicalModelsforMachineLearningandDigitalCommunication.Cambridge,MA:MITPress,1998. [32] M.C.Nechyba,\Maximumlikelihoodestimationformixturemodels:theemalgorithm,,"2003,coursenote.[Online].Available: http://mil.u.edu/nechyba/ eel6825.f2003/course materials/t4.em theory/em notes.pdf [33] Y.Weiss,\Correctnessoflocalprobabilitypropagationingraphicalmodelswithloops,"NeuralComputation,vol.12,pp.1{4,2000. [34] J.S.Yedidia,W.T.Freeman,andY.Weiss,\Generalizedbeliefpropagation,"AdvancesinNeuralInformationProcessingSystems,vol.13,pp.689{695,Dec.2000. [35] J.Pearl,ProbabilisticReasoninginIntelligentSystems:NetworksofPlausibleInference.MorganKaufmann,Sept.1988. [36] S.M.AjiandR.J.McEliece,\Thegeneralizeddistributivelaw,"IEEETrans.Inform.Theory,vol.46,pp.325{343,Mar.2000. [37] T.Richardson,\Thegeometryofturbodecodingdynamics,"IEEETrans.Inform.Theory,vol.46,pp.9{23,Jan.2000. [38] R.J.McEliece,D.J.C.McKay,andJ.F.Cheng,\Turbodecodingasaninstanceofpearlsbeliefpropagationalgorithm,"IEEEJ.Select.AreasCommun.,vol.16,pp.140{52,Feb.1998. [39] F.KschischangandB.Frey,\Iterativedecodingofcompoundcodesbyprobabilitypropagationingraphicalmodels,"IEEEJ.Select.AreasCommun.,vol.16,pp.219{230,Feb.1998. [40] A.J.Viterbi,\Errorboundsforconvolutionalcodesandanasymptoticallyoptimumdecodingalgorithm,"IEEETrans.Inform.Theory,vol.13,pp.260{269,Apr.1967. [41] J.G.DAVIDFORNEY,\Theviterbialgorithm,"inProceedingsoftheIEEE,vol.61,Mar.1973,pp.268{278. [42] L.R.RABINER,\Atutorialonhiddenmarkovmodelsandselectedapplicationsinspeechrecognition,"inProceedingsoftheIEEE,vol.77,Feb.1989,pp.257{286. [43] D.Moore,K.Keys,R.Koga,E.Lagache,andkclay,\TheCoralReefsoftwaresuiteasatoolforsystemandnetworkadministrators,"inUsenixLISA.(2001),Dec.2001.[Online].Available: citeseer.ist.psu.edu/moore01coralreef.html [44] C.Logg,\Characterizationofthetracbetweenslacandtheinternet,"July2003.[Online].Available: http://www.slac.stanford.edu/comp/net/slacnetow/html/SLACnetow.html [45] I.A.N.Authority,\Portnumbers,"Aug.2006.[Online].Available: http://www.iana.org/assignments/portnumbers
PAGE 138
[46] T.Karagiannis,A.Broido,N.Brownlee,kcclay,andM.Faloutsos,\Isp2pdyingorjusthiding?"inIEEEGlobecom2004,2004. [47] S.Sen,O.Spatscheck,andD.Wang,\Accurate,scalableinnettworkidenticationofp2ptracusingapplicationsignatures,"inWWW,2004. [48] K.Wang,G.Cretu,andS.J.Stolfo,\Anomalouspayloadbasednetworkintrusiondetection,"in7thInternationalSymposiumonRecentAdvancedinIntrusionDetection,Sept.2004,pp.201{222. [49] M.Roughan,S.Sen,O.Spatscheck,andN.Dueld,\Classofservicemappingforqos:Astatisticalsignaturebasedapproachtoiptracclassication,"inACMInternetMeasurementConference,Taormina,Italy,2004. [50] T.Karagiannis,K.Papagiannaki,andM.Faloutsos,\Blinc:multileveltracclassicationinthedark,"inSIGCOMM'05:Proceedingsofthe2005conferenceonApplications,technologies,architectures,andprotocolsforcomputercommunications.NewYork,NY,USA:ACMPress,2005,pp.229{240. [51] C.Dewes,A.Wichmann,andA.Feldmann,\Ananalysisofinternetchatsystems,"inIMC'03:Proceedingsofthe3rdACMSIGCOMMconferenceonInternetmeasurement.NewYork,NY,USA:ACMPress,2003,pp.51{64. [52] D.Wu,T.Hou,andY.Q.Zhang,\Transportingrealtimevideoovertheinternet:Challengesandapproaches,"ProceedingsoftheIEEE,vol.88,no.12,pp.1855{1875,December2000. [53] ITUT,\G.711:Pulsecodemodulation(pcm)ofvoicefrequencies,"ITUTRecommendationG.711,1989.[Online].Available: http://www.itu.int/rec/TRECG.711/e [54] ITUT,\G.726:40,32,24,16kbit/sadaptivedierentialpulsecodemodulation(adpcm),"ITUTRecommendationG.726,1990.[Online].Available: http://www.itu.int/rec/TRECG.726/e [55] ITUT,\G.728:Codingofspeechat16kbit/susinglowdelaycodeexcitedlinearprediction,"ITUTRecommendationG.728,1992.[Online].Available: http://www.itu.int/rec/TRECG.728/e [56] ITUT,\G.729:Codingofspeechat8kbit/susingconjugatestructurealgebraiccodeexcitedlinearprediction(csacelp),"ITUTRecommendationG.729,1996.[Online].Available: http://www.itu.int/rec/TRECG.729/e [57] ITUT,\G.723.1:Dualratespeechcoderformultimediacommunicationstransmittingat5.3and6.3kbit/s,"ITUTRecommendationG.723.1,2006.[Online].Available: http://www.itu.int/rec/TRECG.723.1/en [58] Y.Wang,J.Ostermann,andY.Q.Zhang,VideoProcessingandCommunications,1sted.PrenticeHall,2002.
PAGE 139
[59] P.StoicaandR.Moses,SpectralAnalysisofSignals,1sted.UpperSaddleRiver,NJ:PrenticeHall,2005. [60] T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein,IntroductiontoAlgorithms,2nded.DuxburyPress,Sept.2001. [61] L.I.Smith,\Atutorialonprincipalcomponentsanalysis,"Feb.2002.[Online].Available: http://www.cs.otago.ac.nz/cosc453/student tutorials/principal components.pdf [62] K.V.DeunandL.Delbeke,\Multidimensionalscaling,"UniversityofLeuven.[Online].Available: http://www.mathpsyc.unibonn.de/doc/delbeke/delbeke.htm [63] J.B.Tenenbaum,V.deSilva,andJ.C.Langford,\Aglobalgeometricframeworkfornonlineardimensionalityreduction,"Science,vol.290,pp.2319{2323,Dec.2000. [64] S.T.RoweisandL.K.Saul,\Nonlineardimensionalityreductionbylocallylinearembedding,"Science,vol.290,pp.2323{2326,Dec.2000. [65] W.Hong,\Hybridmodelsforrepresentationofimagerydata,"Ph.D.dissertation,UniversityofIllinoisatUrbanaChampaign,Aug.2006.
PAGE 140
JieyanFanwasbornonJul26,1979inShanghai,China.Theonlychildinthefamily,hegrewupmostlyinhishometown,graduatingfromtheHighSchoolAliatedtoFudanUniversityin1997.HeearnedhisB.S.andM.S.inelectricalengineeringfromShanghaiJiaoTongUniversity,Shanghai,China,in2001and2004,respectively.HeiscurrentlyaPh.D.candidatewithelectricalandcomputerengineering,UniversityofFlorida,Gainesville,FL.Hisresearchinterestsarenetworksecurityandpatternclassication.UponcompletionofhisPh.D.program,JieyanwillbeworkinginYahoo!Inc,Sunnyvale,CA. 140

