Network Centric Traffic Analysis

Material Information

Network Centric Traffic Analysis
Fan, Jieyan
Place of Publication:
[Gainesville, Fla.]
University of Florida
Publication Date:
Physical Description:
1 online resource (140 p.)

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Wu, Dapeng
Committee Members:
Li, Tao
Yang, Liuqing
Chen, Shigang
Graduation Date:


Subjects / Keywords:
Algorithms ( jstor )
Analyzers ( jstor )
Arithmetic mean ( jstor )
Distance functions ( jstor )
Feature extraction ( jstor )
Machine learning ( jstor )
Mathematical vectors ( jstor )
Simulations ( jstor )
Spectral energy distribution ( jstor )
Voice data ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
classification, network, security, traffic
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Electrical and Computer Engineering thesis, Ph.D.


Over the past few years, the Internet infrastructure has become a critical part of the global communications fabric. Emergence of new applications and protocols (such as voice-over Internet Protocol, peer-to-peer, and video on demand) also increases the complexity of Internet. All these trends increase the demand for more reliable and secure service. This has affected the interest of Internet service providers (ISP) in network centric traffic analysis. Our study considers network centric traffic analysis from the two perspectives that most interest ISPs: network centric anomaly detection, and network entric traffic classification. In the first part, we focus on network centric anomaly detection. Despite the rapid advance in networking technologies, detection of network anomalies at high-speed switches/routers is still far from maturity. To push the frontier, two major technologies need to be addressed. The first is efficient feature-extraction algorithms/hardware that can match a line rate in the order of Gb/s. The second is fast and effective anomaly detection schemes. Our study addresses both issues. The novelties of our scheme are the following. First, we design an edge-router based framework that detects network anomalies as they first enter an ISP?s network. Second, we propose the so-called two-way matching features, which are effective indicators of network anomalies. We also design data structure to extract the features efficiently. Our detection scheme exploits both temporal and spatial correlations among network traffic. Simulation results show that our scheme can detect network anomalies with high accuracy, even if the volume of abnormal traffic on each link is extremely small. In the second part, we focus on network centric traffic classification. Nowadays, VoIP and IPTV become increasingly popular. To tap the potential profits that VoIP and IPTV offer, carrier networks must efficiently and accurately manage and track the delivery of IP services. Yet, the emergence of a bloom of new zero-day voice and video applications such as Skype, Google Talk, and MSN pose tremendous challenges for ISPs. The traditional approach of using port numbers to classify traffic is infeasible because it uses a dynamic port number. The proliferation of proprietary protocols and usage of encryption techniques make application-level analysis infeasible. Our study focus on a statistical pattern classification technique to identify multimedia traffic. In particular, we focus on detecting and classifying voice and video traffic. We propose a system (VOVClassifier ) for voice and video traffic classification that uses the regularities residing in multimedia streams. Experimental results demonstrate the effectiveness and robustness of our approach. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2007.
Adviser: Wu, Dapeng.
Statement of Responsibility:
by Jieyan Fan.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Fan, Jieyan. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
663881414 ( OCLC )
LD1780 2007 ( lcc )


This item has the following downloads:















































































































































Full Text

10 ~ E[Ta]=4



10-3 10-2 n110.0156 10-1

Figure 3-12. Space complexity vs. collision probability for fixed time complexity.

results in Section 3.6.1, we use the same 96-bit signature, i.e., SA, DA, SP, and DP, and

let R=250,000 packets/second and F=80 seconds, which translates to 250,000 x 80 20M

input signatures for each simulation. These signatures are preloaded into memory before

the beginning of simulations so that I/O speed of hard drive does not affect the execution

time of simulations.

For each simulation run of the hash table algorithm, we specify the memory size

i and measure the algorithm performance in terms of the average number of hash

function calculations per signature query request, denoted by Th, and the execution

time. Due to the Law of Large Numbers, Th approaches the expected number of hash

function calculations per signature query request, i.e., E[Th] in Equation (3-6), if we run

the simulation many times with the same .i In our simulations, we run the hash table

algorithm ten times; each time with a different set of input signatures but with the same


For each simulation run of the BFA algorithm, we specify the memory size ma and

the number of hash functions IC, and measure the algorithm performance in terms of

the collision frequency, denoted by t, and the execution time. The collision frequency is

defined as the ratio of the number of collision occurrences in BloomFilterSearch to the

The optimal partition (see Equation (6-39)), in terms of minimum coding length

criteria, should minimize coding length of the segmented data, i.e.,

min Lc (T; 1 m) min Ic ()+ k log2 (6-45)
k1 kL J

where 1I denotes the partition scheme. The first term in Equation (6-45) is the summation

of coding length of each group, and the second one is the number of bits needed to

encoding membership of each item of T in the K groups.

The optimal partition is achieved in the following way. Let the segmentation scheme

be represented by the membership matrix,

Ilk diag ([lk, T2k,... Nk]) E RNXN, (6-46)

where 7 k denotes the probability that vector Q(n) belongs to subset k, such that
YTnk = 1, for Vn 1,..., N (6-47)
k= 1

and diag(-) denotes converting a vector to a diagonal matrix.

Hi i. -[r, page 34] proved that the coding length is bounded as follows.

< r tr (f1k) + K logKdet (I +tr () K q4lk ')
k-I 2 tr (Ik 2

f tr (u~k) 10t2
A(q; II), (6-48)

where tr (.) denotes the trace of a matrix, and det (.) denotes matrix determinant.

Combining Equations (6-45) and (6-48), one achieves a minimax criterion

S= argmin max ~c (1; 1)] arg min, ('; 1). (6-49)
n c n -

nodes, which represent whether an edge router is in abnormal state. Next, we explain the
algorithm in detail.

1. function ViterbiDecodeHMT(...)
2. Argument 1: transition probabilities, {P(R ,|(i)); i E \ 0o}.
3. Argument 2: likelihood, {p(|il0i); i c E}.
4. Argument 3: feature, 0.
5. Return: estimated node states, {ui; i E E}.

{P(Q ,(i)| ), P(A i, QP()| i); i 2 \ Eo}
BP {P( | ( Q,()); i e \ o} {p( I| i); i e } (

P (Qi, QPy) r
P (Q~i )


for V i E 0o
Ui = arg n -::
end for
for l -- ,...,L

P(A = u ,l)

-, for V i C EE

ii =argmaxs(ui()P (OQ



13. end for

Figure 4-11. Viterbi algorithm for HMT decoding.

By B ,i, ii formula, the terms in Equation (4-39) can be computed by

P (Q6i Ui," QPW() UP(W)4

P Qv I ( )



P (Qi I (i), )

UjjQ 6i)-itPhi)'I) -

O(i,)P (Q/,

P (Qi, Ui, I QPW) UPW), )

6.5 Experiment Results

In this section, we demonstrate the experiment results of applying the system

presented in Figure 6-1 to network traffic classification. Before that, we first describe

experiment settings in Section 6.5.1.

6.5.1 Experiment Settings

We perform four sets of experiments. In Section 6.5.2, two sets of experiments are

conducted on traffic generated by Skype. In Section 6.5.3, other two sets of experiments

are conducted on traffic generated by Skype, MSN, and GTalk.

For each set of the experiments, we use Receiver Operating C'.,,,,i. I/ ristics (ROC)

(< iL -[2! page 107] as the performance metric. ROC curve is a curve of detection

probability, PD, vs. false alarm probability, PFA, where,

PDniH P (The estimated state of nature is -H The true state of nature is '), (6-59)

PFA\H =P (The estimated state of nature is IThe true state of nature is not H), (6 60)

where H can be voice, video, file+voice, and file+video. By tuning parameters 8,, OA, and

Ov ( see Figure 6-7), one is able to generate ROC curve.

During the experiments, we collected network traffic from three applications, i.e.,

Skype, MSN, and GTalk. For each application, traffic was collected in two scenarios. For

the first scenario, two lab computers located in University A and University B respectively

were communicating with each other. There was direct connection between two peers. For

the second one, we used a firewall to block direct connection between the two peers such

that the application was forced to use relay nodes.

To do classification, we chose first 10 seconds of each flow, i.e., Ama, < 10 seconds.

We set T, 0.5 milliseconds. Hence, Id = 20, 000.

099- 099
098 098
097- 097 -
096 096
0 95- 095 -
094 094
093 093
092 092
091- 091
09 09
0 02 04 06 08 1 0 02 04 06 08 1

(a) (b)

Figure 6-10. The ROC curves of single-typed flows generated by Skype, MSN, and GTalk:
(a) VOICE and (b) VIDEO.

Table 6-1: Typical PD and PFA values.
Skype Skype+MSN+GTalk
PFA (PD) Single Hybrid Single Hybrid
VOICE 0(1) 0(1) 0(.995) .002(.986)
VIDEO 0(.993) 0(.965) 0(.952) 0(.948)

Voice flows vs. video flows. From Table 6-1, one notes that classification of

VOICE traffic is more accurate than that of VIDEO. Specifically, we can achieve 10' .

accurate classification of Skype voice flows. This is due to the fact that voice traffic has

higher regularity than video does (Figure 5-8 and Figure 5-9). One can immediately tell

the dominant periodic component at 33Hz in the voice flows. This frequency corresponds

to the 30-millisecond IPD of the employ, -1 voice coding. On the other hand, Video PSDs

have peaks at 0. It means that non-periodic component dominates in video flows. One can

see that PSDs of the two video flows are close to each other. That is the reason why our

approach achieves high classification accuracy by using PSD features.

Single-typed flows vs. hybrid flows. From Table 6-1, one can see that the

classification of single-typed flows is more accurate than that of hybrid flows. Mixing

multiple types of traffic together is like increasing noise. Hence, it is not surprising that

classification accuracy is reduced.

To formulate the model, we define a random variable Q : H -- Z2, such that

Q(H,) = u, u e Z2. (4-10)

Furthermore, denote by A the traffic observed by traffic monitors. Since network

state induces stochastic process of traffic, we employ the widely-used graphic model

representation [31] to depict this cause-effect relationship in Figure 4-1.

(a) (b)

Figure 4-2. Extended generative model including traffic feature vectors: (a) original model
and (b) simplified model.

Denote by 4, I E c the feature extracted from traffic, where ED is the feature space.

Most importantly, in selection of the optimal features, we seek for the most discriminative

statistical properties of the traffic. Also note that it is possible to employ multiple features

in the detection procedure, in which case ) is a vector. Since features are succinct

representations of the voluminous traffic, we extend the above model in Figure 4-1 to the

one illustrated in Figure 4-2(a). Once ) is extracted from A, we assume that ) represents

A well. It means that we may operate only over lower-dimensional 4, which reduces

computational complexity. Therefore, we simplify the model in Figure 4-2(a) to that

illustrated in Figure 4-2(b), where A is dismissed.

Since the feature is measurable, it is called observable random vector, and is depicted

by a rectangular node in Figure 4-2(b). The network state generating the traffic feature

is to be estimated. We call it hidden random variable, and depict it by a round node in

Figure 4-2(b). Now, the goal becomes to estimate the hidden state f given the observable

where F, -. Equation (6- 33) shows the relationship between the periodic components

of a stochastic process in the continuous-time domain and the shape of its PSD in the

frequency domain.

f (f; pd) is a continuous function in frequency domain. To handle it in a computer,

we need to do sampling in frequency domain. In other words, we select a series of


0< fl < fi 2' (6

and define the PSD feature vector as
= f (fl;'Pd) f (f2;Pd), f (fM;Pd) (6-35)

c e RM is the feature vector we use to perform classification.

In the next Section, we introduce a new technique that we use to translate the

characteristic of these high-dimensional feature vectors into a more tractable low

dimensional space.

6.3 Subspace Decomposition and Bases Identification on PSD Features

In _r inl,, scientific and engineering problems, the data of interest can be viewed

as drawn from a mixture of geometric or statistical models instead of a single one.

Such data are often referred to in different contexts as "mixed," or "multi-modal," or

"multi-model," or "heterogeneous," or "hybrid." Subspace decomposition is a general

method for modeling and segmenting such mixed data using a collection of subspaces,

also known in mathematics as a subspace arrangement. By introducing certain new

algebraic models and techniques into data clustering, traditionally a statistical problem,

the subspace decomposition methodology offers a new spectrum of algorithms for data

modeling and clustering that are in many aspects more efficient and effective than (or

complementary to) traditional methods, e.g., principle component ii 1, -i (PCA),

Expectation Maximization (EM), and K-Means clustering.

N(0,1) distribution










1 2 3 4

Figure 4-6. Probability density function of the univariate Gaussian distribution A/ (x; 0, 1).


0 100 200 300 400 500 600

Figure 4-7. Histogram of the two-way matching features measured at a real network

during network anomalies.

However, multiple peaks may exist in the empirical distribution of |l Qi. For

example, Figure 4-7 shows the histogram of the two-way matching features measured

in a real network during DDoS attacks. It has two peaks. Hence, the unimodal Gaussian

distribution is not suitable. In the paper, we adopt the Gaussian mixture model (GMM)

to model the likelihood distribution.

-3 -2 -1 0





0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time Slot

SYN (Link

2) --

0 1000 2000 3000 4000 5000
Time Slot

6000 7000 8000 9000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time Slot

300 ---





0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time Slot

Figure 3-16. Feature data: (a) Number of SYN packets (link 1), (b) Number of unmatched
SYN packets (link 1), (c) Number of SYN packets (link 2), and (d) Number of unmatched
SYN packets (link 2).

m 111




rather than the source addresses, are distributed. However, the paper did not resolve an

important problem, i.e., how to extract features efficiently to match a high line rate in

the order of Gb/s. We proposed a data structure called Bloom filter array to address this


3.2 Hierarchical Feature Extraction Architecture

Network anomaly detection is not an easy task, especially at high-speed routers.

One of the main difficulties arises from the fact that the data rate is too high to afford

complicated data processing. An anomaly detection algorithm usually works with traffic

features instead of the original traffic data itself. Traffic features can be regarded as

succinct representations of the voluminous traffic, e.g., the traffic data rate is a feature of

the traffic.

We focus on presenting our feature extraction architecture for network anomaly

detection. We also cover extraction schemes for some simple features, such as data rate

and SYN/FIN(RST) ratio. The more advanced features, the so-called two-way matching

features, are discussed later.

3.2.1 Three-Level Design

Incoming packets

Level 1 Filter


| SYN Rate | YN/FIN(RST) Ratio| 2D Matching Feature Extraction

Figure 3-1. Hierarchical structure for feature extraction.

To efficiently extract features from traffic, we design a three-level hierarchical

structure (Figure 3-1), where incoming packets are processed by level-one filters, then

by level-two filters, and finally by (level-three) feature extraction modules. Level-one filters

machine learning algorithm exploits the spatial correlation among traffic on multiple links,

while the threshold-based scheme only uses the traffic on one link.

Single CUSUM
Threshold ---
0.8 Machine Learning -/ /


5 0.4


0 ......4- "i r ........-----------
0.0001 0.001 0.01 0.1 1
False Alarm Probability

Figure 4-14. Performance of four detection algorithms

In Figure 4-14, we compare the ROC performance of four detection algorithms (the

threshold-based, the single-CUSUM, the dual-CUSUM described in Section 4.1.3, and our

machine learning algorithm) under the same feature, i.e., the number of unmatched SYN

packets. For the single-CUSUM and the dual-CUSUM algorithm, the detection delay D is

chosen from 1 to 10 slots and the parameter ai of link i is determined by

ai = (Dattack Dormal) V 1 < < 16,

where Dattack and Dnormal are the average number of unmatched inbound SYN packets in

attack and normal conditions, respectively.

The ROC performance (Figure 4-14) of our machine learning algorithm is the

best among all the algorithms. We also see that the dual-CUSUM out-performs the

simple threshold-based algorithm and the single-CUSUM algorithm has the worst ROC


4.5.3 Discussion

We would like to point out that, besides detecting network anomalies with low

volume traffic, our machine learning algorithm is also able to detect high volume anomaly,

probability, which give the ROC curve [28]. In this paper, we will use the ROC curve

to compare the performance of different detection algorithms.

Next, we introduce some basic classification algorithms.

4.1.2 Threshold-Based Algorithm

The idea of the threshold-based algorithm is that if the feature value exceeds a

preset threshold, declare 'abnormal'; otherwise, declare 'normal'. Note that the detection

operation is conducted in each slot. By tuning the threshold for the feature value, we

can obtain different pairs of false alarm probability and detection probability, resulting

in the ROC curve. Given the ROC curve and the desired false alarm probability, one can

determine the value of the threshold for detection operation.

Although it is the simplest method, threshold-based algorithm can only be used

when the features make significant difference between normal and abnormal conditions.

Therefore, threshold-based algorithm is not suitable to detect low volume network


4.1.3 Change-Point Algorithm

In the literature, a simple change-point algorithm -non-parametric Cumulative

Summation (CUSUM) algorithm -has been widely used [11-13, 17]. However, existing

studies only consider the change from normal state to abnormal state, which means that

the number of false alarms can be very large after network anomalies end. To facilitate the

discussion, we define the following parameters used in CUSUM in Table 4-1:

Table 4-1: Parameters used in CUSUM
Parameter Description
S(ti) The observed traffic feature at the end of time slot i.
4n The expectation of K(ti) in normal states.
io The expectation of +(ti) in abnormal states. Without losing generality,
here we assume that 4, < 4a
K(ti) The adjusted variable, defined as i(ti) = i(ti) a, where a is a parameter such that
< a < 4,

router Y, then the so-called two-way matching features between subnet A and server B

shall be obtained at the global analyzer, which has the routing information of the ISP


The advantages of our framework design are that:

1. it is deploy, .1 on edge routers instead of systems of end users, such that it can detect
network anomalies in the first place they enter an AS;

2. it has no burden on core routers;

3. it is flexible in that detection of network anomalies can be made both locally and

4. it is capable of detecting low volume network anomalies accurately by exploiting
spatial correlations among edge routers.

The framework is designed to be an add-on service provided by ISP to protect end

users from network anomalies.

2.3 Summary

This chapter is concerned with design of network anomaly detection frameworks.

There are two types of frameworks (i.e., host-based and network based. Our design is of

the second type). Specifically, we designed a framework deploy, .1 on edge routers. It is

composed of three components, traffic monitors, local i i liv. i1, and global analyzers.

This framework is flexible in that it can detect network anomalies from both local view

and global view of the network. By exploiting spatial correlations among edge routers, our

framework is capable of detecting low volume network anomalies.

6.2.2 Power Spectral Density (PSD) Computation . ..
6.3 Subspace Decomposition and Bases Identification on PSD Features
6.3.1 Subspace Decomposition Based on Minimum Coding Length

6.3.2 Subspace Bases Identification . ......
6.4 Voice/Video Classifier . ..............
6.5 Experiment Results . ...............
6.5.1 Experiment Settings . ...........
6.5.2 Skype Flow Classification . ........
6.5.3 General Flow Classification . .......
6.5.4 D discussion . . . . . .
6.6 Sum m ary . . . . . . .


. 108
. 115
. 117

. . . 120
. . . 12 1
. . . 123
. . . 123
. . . 124
. . . 124
. . . 125
. . . 128

. . . 129

Summary of Network Centric Anomaly Detection
Summary of Network Centric Traffic Classification


A PROOFS . . .

Equation (431) . ...
Equation (4-32) . ...
Equation (433) . ...
Equation (4-34) . ...


REFERENCES ...............

BIOGRAPHICAL SKETCH .............................. .......

1.2 Introduction to Network Centric Traffic Classification

Besides network anomaly detection, classification of normal network traffic is also

of practical significance to both enterprise network administrators and ISPs. Along

with the rapid emergence of new types of network applications such as VoIP, VoD, and

P2P file exchange, quality of service (QoS) becomes a more and more important issue.

For example, transmission of real-time voice and video has bandwidth, delay, and loss

requirements. However, there is no QoS guarantee for these real-time applications over the

current best-effort network. Many schemes are proposed to address this problem. On the

other hand, enterprise network administrators may want to restrict network bandwidth

used by disallowed VoIP, VoD, or P2P applications, if not totally block, which might be

too rude. That is, they want to limit the QoS of specific network traffic.

Wu et al.[4] summarized techniques for QoS provision for real-time streams from

the point of view of end hosts. These techniques include coding methods, protocols,

and requirements on stream servers. Another effective solution is from the point of view

of network carriers or ISPs. For example, ISPs can assign different forwarding priority

to different types of network traffic on routers. This is the motivation of differentiated

services (DiffServ)[5, 6].

DiffServ is a method designed to guarantee different levels of QoS for different classes

of network traffic. It is achieved by setting the "type of service" (TOS)[7] field, which

hence is also called DiffServ code point(DSCP)[5], in the IP header according to the class

of the network data, so that the better classes get higher numbers. Unfortunately, such

design highly depends on network protocols, especially proprietary protocols, observing

DiffServ regulations. In the worst case, if all protocols set TOS to the highest number, it

is even worse to employ DiffServ method.

For this reason, we believe a proper DiffServ scheme should be able to classify

network traffic on the fly, instead of relying on any tags in packet header. Thus, the

difficulty lies in accurate classification of network traffic in real-time.

Table 4-2: Notations for hidden markov tree model
Notation Description
fi The random variable representing the state of node i.
ti The random variable/vector representing the features) measured at node i.
4r {4i; i E T}, where T is a subtree of the HMT.
L The number of levels of the HMT.
2 The set of all nodes in the HMT.
El The set of nodes at level 1, 1 E ZL, in the HMT. Specifically, Eo represents the set
of root nodes and EL-1 leaf nodes.
B The number of children nodes of each node, except leaves. For example,
B = 4 for quad-HMT, as illustrated in Figure 4-5.
p(i) The parent node of node i, where i o0.
v(i) The set of children nodes of node i, where i EL_1.
T' The set of ancestor nodes of node i, where i Eo, including node i.
7(i) The root node of the subtree containing node i, where i E B.
% The subtree whose root is node i, where i E .

Ti TKY) \ 7>

nodes represent clusters of edge routers. Features of nodes are defined in Equation (4-16).

Features measured at the corresponding edge router i E EL_l(i.e., leaf node)

1 jE,(i) 4 i i ZEL-(i.e., non-leaf node)

One notes that only features of leaf nodes have physical nii iii:.- i.e., features measured

at corresponding edge routers. Features of a non-leaf node are assumed to be average of

features of its child nodes.

We have two assumptions for the HMT:

1. Node state only depends on state of its parent, if it is known, i.e.,

P(QilQ,j e E,j / i) =P(QQQ(i)), i E \ So (4-17)

2. Features measured at a node only depends on state of that node, if it is known, i.e.,

p(AN A) =p(-AN ), i E B (4-18)

to sell and deliver IP services to their customers. Unfortunately the emergence of a bloom

of new zero-d-, voice and video applications over IP, like Skype, Google Talk (Gtalk),

MSN, etc, the proliferation of new peer-to-peer protocols that now allow the usage of voice

and video among other applications, and the continuing growth of the usage of encryption

techniques to protect data confidentiality, lead to tremendous revenue leakage for ISPs due

to their inefficiency in detecting these new applications and thus lack of proper actions.

The result from unmanaged commercial traffic adds up to loss of hundreds of millions of

dollars annually and poses a solid road block to the profitability of ISPs' VoIP and IPTV


As a consequence, it is imperative for ISPs to identify robust solutions for detecting

voice and video over IP data-streams. The most common approach for identifying

applications on an IP network is to associate the observed traffic with an application

based on TCP or UDP port numbers [43, 44]. In principle the TCP and UDP server

port numbers can be used to identify the higher 1- v.r application, by simply identifying

the server port and mapping this port to an application using the Internet Assigned

Numbers Authority (IANA) list of registered ports [45]. However, port-based application

classification has limitations due to the emergence of new applications that no longer

use fixed, predictable port numbers. For example, non-privileged users often have to use

ports above 1024 to circumvent operating system access control restrictions; or common

applications like FTP allows the negotiation of unknown and unpredictable server ports to

be used for the data transfer; or proprietary applications may deliberately try to hide their

existence or bypass port-based filters by using standard ports. For example, server port 80

is being used by a large variety of non-web applications to circumvent firewalls which do

not filter port-80 traffic; others (e.g., Skype) tend to use dynamic ports.

A more reliable technique involves stateful reconstruction of session and application

information from packet contents [46-48]. Although this avoids reliance on fixed port

numbers, it imposes significant complexity and processing load on the classification device,

Assume the length of a slot is 7. Then, we have F = w x 7. The data structure of BFA is

as follows:

* An array of bit vectors {IVj} (j E Z+), where IVj is the jth insertion vector holding
inbound signatures in slot [rj, Tj+l), where Tj+l = j + 7.

* An array of bit vectors {RVj} (j E Z+), where RVj is the jth removal vector holding
outbound signatures in slot [rT, Tj+i).

* An array of counters {C,} (j E Z+), where Cj is used to count the number of UIF in
slot [Tj, Tj+,).

Since the two-way flows need to be matched within a time interval of length F, we

only need to keep information within a time window of length F. That is, if the current

slot is [7-j, -j+l), only {IVj_,w+,... ,IVj}, {RVj-_w+,... ,RVj}, and {Cj_,+I,...,Cj} are

kept in memory.

3.5.2 Algorithm

Our algorithm for BFA (Figure 3-8) consists of three functions, namely, ProcInbound,

ProcOutbound and Sample, which are described as below.

Function ProcInbound is to process inbound packets. It works as below. When

an inbound packet arrives during [ T, 9Tj+), we increase Cj by 1 and insert its inbound

signature s into IVj if none of the following conditions is satisfied:

1. s is stored in at least one RVj,, where j w + 1 < j' < j;

2. s is stored in IVj.

Condition 1 being true means that the corresponding outbound flow of this inbound

packet has been observed previously; so we should not count it as an unmatched inbound

packet. Condition 2 being true means that the inbound flow, to which this inbound packet

belongs, has been observed during the current slot j; so we should not count the same

inbound flow again. If both conditions are false, we increase Cj by one to indicate a new

potential UIF (line 7 to 10).

1. function ProcInbound(s)
2. a false, b false
3. if 3j', j w + 1 < < j, such that BloomFilterSearch(RVj,,s) returns true then
4. a -- true
5. if BloomFilterSearch(IVj,s) returns true then
6. b true
7. if a and b are both false
8. Cj Cj + 1
9. BloomFilterInsert(IVj, s)
10. end if
11. end function
12. function ProcOutbound(s')
13. for j' j to j w + 1
14. if BloomFilterSearch(RVj,, s') returns true
15. break
16. if BloomFilterSearch(IVjy, s') returns true
17. cj, j, 1
18. end for
19. BloomFilterInsert(RVj, s')
20. end function
21. function Sample(j)
22. return Cj-,w+l
23. end function
Figure 3-8. Bloom Filter Array Algorithm

Function ProcOutbound is to process outbound packets. It works as below. When an

outbound packet arrives during [rj, Tj+,), we check whether we need to update counter Cy

for each j' (j w + 1 < j' < j). Specifically, for each j' (j w + 1 < j < j), decrease Cj

by one if its outbound signature s' satisfies both of the following conditions:

1. s' is not contained in RVj,;

2. s' is contained in IVy'.

Condition 1 being true means that no packet from the outbound flow to which this

outbound packet belongs arrives during the j'th time slot. Condition 2 being true means

that the matched inbound flow of this outbound packet has been observed in the j'th slot.

Satisfying both conditions means that its matched inbound flow has been counted as a

potential UIF; hence, upon the arrival of the outbound packet, we need to decrease C, by


First of all, thank my advisor Professor Dapeng Wu for his great inspiration, excellent

guidance, deep thoughts, and friendship. I also thank my supervisory committee members,

Professors Shigang C'!i, i, Liuqing Yang, and Tao Li, for their interest in my work.

I also express my appreciation to all of the faculty, staff, and my fellow students

in the Department of Electrical and Computer Engineering. In particular, I extend my

thanks to Dr. Kejie Lu for his helpful discussions.


4.1 Introduction

The third issue in network anomaly detection is the classification algorithm. In

this section, we introduce three basic detection algorithms, threshold-based algorithm,

change-point algorithm, and B iv i i, decision theory. Our machine learning algorithm

derives from the B i, i ,i decision theory.

This section is organized as below. Section 4.1.1 introduces the Receiver Operating

CIi o ':teristics curve. It is used in Section 4.5 as the metrics to compare performance of

different classification methods. We describe the threshold-based algorithm, change-point

algorithm, and B ,i, -i oi decision theory in Sections 4.1.2, 4.1.3, and 4.1.4, respectively.

4.1.1 Receiver Operating Characteristics Curve

Receiver Operating C'li,,,. /. ristics (ROC) curve [28] is a typical method to quantify

the performance of detection algorithms. It is a plot of detection probability vs.

false alarm probability. In practice, we estimate detection probability and false alarm

probability by the fraction of true positives and the fraction of true negatives, respectively.

Hence, to obtain an ROC curve, one needs to measure the following quantities

* Af: the number of false alarms, i.e., the number of slots in which the detection
algorithm declares 'abnormal' given that no anomaly actually happens in these slots;

* A,: the number of slots in which no anomaly happens;

* Ad: the number of slots in which the detection algorithm declares 'abnormal' given
that network anomalies actually happen in these slots;

* Aa: the number of slots in which network anomalies happen.

The false alarm probability and the detection probability of the detection algorithm

can be estimated by Af/A, and Ad/Aa, respectively. By varying parameters of detection

algorithms, we can obtain different pairs of false alarm probability and detection

For the same scenario that one needs to estimate AR model from order 1 up to pmax,
the time complexity of LDA is 0 (p~,), much better than the direct solution given by
Equations (6-17) and (6-18).

1. function PSDEstimate(...)
2. Argument 1: data, {y(i)} i.
3. Argument 2: order, p.
4. Return: PSD [w(~; y).

[ap, _C ] LDA({y(i)}0 ,p) (6-30)
(; y) =- -t2 (6-31)
It + EY ate-a

Figure 6-4. Parametric PSD Estimate using Levinson-Durbin Algorithm.

Once the AR model is estimated, one can estimate PSD of signal {y(i)} 1. The
procedure is given in Figure 6-4.
PSD feature vector.
According to the above discussion, we now define the PSD feature vector of a flow
as the following. Let us assume {Pd(i)}d, (see Equation (6-4)) to be second-order
stationary. Then its PSD can be estimated as:

b (7;Pd) PSDEstimate ({Pd(i)} d ,) (6-32)

where w E [-7r, 7) and p is the pre-specified order.
Recall that {Pd(i)}!I are obtained by sampling a continuous-time signal Ph(t) at
time interval T, (see Figure 6-2). Thus, one can further formulate the PSD in terms of real
frequency f as

bf(f; 'Pd) ( ; P ed) ( F- s ) (633)
F, 2 2

A.3 Equation (4 33)

Tr(u)' (u)
E., Ti(u')v (u')
p OQ = U, pv) P (?=v I U)
K,, P (Q U u) p (^ | U)
p Qi = u,
K,, P (j U1,
p [Qi u,
p( )
P(p(i) = u) (A-3)

A.4 Equation (4-34)
(, u)Vp(i)( ')P (,, = U, p( ) p R ') p(j(')
[ ,,, Ti(u(u")] [ ,( P (P -(=2 u" Q) = u') v (u")]
P ( i -u)P (TK\. Qp() U') P (Q UQp() U) p (ip()) U',)

p (i =u (T I Qp(i) U') P (pp(j) u', p?\^)) P KW | p(i) U')
[E i=P (i U', )I [YP (Q i I|p(i ) = u)

P Hp (fe |p(i) /)
L p (i = U ", pp() = U', U p(i) Ti, p(i) = U

p ( ui = Qp() = U',

P(Qj = u, Qp(i) = u'\^) (A-4)

goals, we propose to use a hierarchical model, the hidden Markov tree (HMT) model, as

depicted in Figure 4-5.



Figure 4-5. Hidden Markov tree model. For an node i, p(i) denotes its parent node and
v(i) denotes the set of its children nodes.

The motivation of applying HMT model is that we assume edge routers are not

equally correlated. Instead, edge routers topologically close to each other have high mutual

correlations. Based on this assumption, we cluster edge routers according to the topology

of AS and form a tree structure, as depicted in Figure 4-5. Without loss of generality,

Figure 4-5 plots a quad-tree structure, i.e., each node, except leaf nodes, has four children.

To facilitate further discussion, each node in the HMT is assigned an integer number,

beginning with 0, from top to bottom. That is, node 0 is ah--,i a root node2 Table 4-2

lists the notations used in the rest of the paper for HMT.

In the HMT, each leaf node stands for an edge router. Zero-padding virtual edge

routers are introduced when the number of edge routers is not a power of B. States of

these zero-padding virtual nodes are al--x i- normal and features are ah--,l 0. Non-leaf

2 A HMT might have multiple roots, depending on the number of edge routers and the
number of levels.

i.e., E2, a .. ., ap. Its solution is
a,1 (0; /) ... r(-- + p ;l/) (l)
S (6-17)

p (p- ; y) ... (0; y) r(p)

(2 =(0; y) + (-t; y)at. (6-18)

Levinson-Durbin algorithm (LDA). The direct solution of Yule-Walker method,

i.e., Equations (6-17) and (6-18), is not good enough in terms of time complexity.

Equation (6-17) computes the inversion of covariance matrix, whose time complexity is

O (p3)[60, page 755]. In addition, in most applications, there is no a priori information

about the true order p. To cope with that, the Yule-Walker system of equations,

Equation (6-15), has to be solved for p = 1 up to p = pmax, where pmax is some

prespecified maximum order. The time complexity is O (p4ax).

In this paper, we use the Levinson-Durbin Algorithm (LDA)[59]to reduce time

complexity. It estimates AR signal coefficients recursively in the order p. To facilitate

further discussion and to emphasize the order p, we denote

r(0; Y) r(- ; y) r(-p; y)

A l r (1; y) r(0; y) r(-p+ 1;y) (69)
,p+i (619)

r(p;y}) r (1;y) r(0;y)

ap^ (6-20)


We show the features extracted from the two traces, specifically, the number of SYN

packet arrivals and the number of unmatched inbound SYN packet arrivals during a slot

(Figure 3-16).

From Figure 3-16, it can be observed that the features are rather noisy, especially for

the feature of the number of SYN packets. From Figs. 3-16(a) and 3-16(c), we can hardly

distinguish the slots under the low volume synchronized attacks from the slots without

attacks (by visual inspection). In comparison, it is much easier to identify the slots under

the synchronized attacks (by visual inspection) when the number of unmatched SYN

packets is used as the feature (see Slot 1400 1600 and Slot 2800 3200 in Figs. 3-16(b)

and 3-16(d).

3.8 Summary

This chapter is concerned with design of data structure and algorithms for network

anomaly detection, more specifically, feature extraction for network anomaly detection.

Our objective is to design efficient data structure and algorithms for feature extraction,

which can cope with a link with a line rate in the order of Gbps. We proposed a novel

data structure, namely the Bloom filter array, to extract the so-called two-way matching

features, which are shown to be effective indicators of network anomalies. Our key

technique is to use a Bloom filter array to trade off a small amount of accuracy in feature

extraction, for much less space and time complexity. Different from the existing work, our

data structure has the following properties: 1) .;,ir i.' Bloom filter, 2) combination of a

sliding window with the Bloom filter, and 3) using an insertion-removal pair to enhance

the Bloom filter with a removal operation. Our analysis and simulation demonstrate

that the proposed data structure has a better space/time trade-off than conventional


Next, we discuss classification algorithm based on extracted features.

Equation (4-4) is equivalent to

=argmin X(H* H)p( |H)P(H). (4-6)

Equation (4-6) gives the B li- -i ,i criterion for pattern classification.

A simple loss function is defined to be

X(H,., H,) = (u)(u* = u), for V u e Zu, (4-7)

where <(u) is the gain factor, typically positive, representing the gain obtained by

correctly detecting H,, and Z(-) is the indicator function such that

1 if x is true
I(x) = (4-8)
0 if x is false

Equation (4-7) specifies that misclassification induces zero loss and correct classification

induces negative loss, which actually achieves gain. Applying Equation (4-7) to Equation (4-6),

the B li-,-i ,i criterion is simplified to,

S= arg max ((u)p(| H,) P(H,). (4-9)

We call Equation (4-9) the maximum gain criterion. Further note that, scaling all gain

factors by a same factor does not change the criterion specified by Equation (4-9). Hence,

we can ah--,i-x set x(0) = 1. By tuning other gain factors, we can generate ROC curve for

B i, -i io decision theory.

In this chapter, we extended the B i, -i i, decision theory for network anomaly

detection. The remainder of this chapter is organized as follows. In Section 4.2, we

establish B li, -i models for network anomaly detection. Sections 4.3 and 4.4 solve

two fundamental problems of B i-, -i i, model, i.e., training problem and classification

problem, respectively. Section 4.5 shows our simulation results and Section 4.6 concludes

this chapter.

In our experiment, we use two 24-hour traces as the background traffic. We simulate

network anomalies caused by TCP SYN flood attacks [18] by randomly inserting TCP

SYN packets with random source IP addresses into the background trace during specified

time periods. Specifically, synchronized attacks are simulated during 14000 16000 second

and 28000 -32000 second in both traces. In addition, .i-vnchronous attacks are launched

during 50000-52000 second in trace 1 and during 57000-59000 second in trace 2. The

average attack rate is 1 of the packet rate of the background traffic during the same


To detect TCP SYN flood attacks, we choose < SA, DA, SP, DP > as the signature

of inbound packets and < DA, SA, DP, SP > for outbound ones. Thus the two-way

matching feature is the number of unmatched inbound TCP SYN packets in one time slot.

The average flow rate is 2480 flows/second. Therefore, we set R = 2480. We further set

S= 0.1 -C = 8, w = 8, and 7 = 10. Then, by solving Equation (3-18) for ifl.. and

requiring i .. to be a power of 2, we obtain f.. = 215 bits. The computer used for our

experiments has one 2.4G Hz CPU and 1GB memory. For comparison, we also extract the

number of inbound SYN packets in a slot.


The average processing rate is measured to be 265, 000 packets/second. Hence,

the algorithm can deal with a line rate of 1 Gbps since the average Internet packet

size is about 500 bytes. Note that our test is offline and data is read from hard disk,

whose access speed is much lower than that of memory. In a real implementation,

data is captured by a high-speed network interface and maintained in the memory;

so the processing speed can be increased. Furthermore, in our test, a hash function is

implemented by software, which is also much slower than a dedicated hardware. Therefore,

it is reasonable to anticipate a higher processing rate if a dedicated hardware is used.

[31] B. J. Frey, Graphical Models for Machine Learning and D.:.I:l.i Communication.
Cambridge, MA: MIT Press, 1998.

[32] M. C. Nechyba, l\! ::iiiiiiii-likelihood estimation for mixture models:
the em algorithm,," 2003, course note. [Online]. Available: http:

[33] Y. Weiss, "Correctness of local probability propagation in graphical models with
loops," Neural Computation, vol. 12, pp. 1-4, 2000.

[34] J. S. Yedidia, W. T. Freeman, and Y. Weiss, "Generalized belief propagation,"
Advances in Neural Information Processing S,'l. mi, vol. 13, pp. 689-695, Dec. 2000.

[35] J. Pearl, Probabilistic Reasoning in Intelligent SI-. I"- Networks of Plausible
Inference. Morgan Kaufmann, Sept. 1988.

[36] S. M. Aji and R. J. McEliece, "The generalized distributive law," IEEE Trans. I,,1,1
Th(.-,;, vol. 46, pp. 325-343, Mar. 2000.

[37] T. Richardson, "The geometry of turbo-decoding dynamics," IEEE Trans. InfT, 11
Th(.-,;, vol. 46, pp. 9-23, Jan. 2000.

[38] R. J. McEliece, D. J. C. McKay, and J. F. C'!. i:- "Turbo decoding as an instance
of pearls belief propagation algorithm," IEEE J. Select. Areas Commun., vol. 16, pp.
140-52, Feb. 1998.

[39] F. Kschischang and B. Frey, 11I I, i i. decoding of compound codes by probability
propagation in graphical models," IEEE J. Select. Areas Commun., vol. 16, pp.
219-230, Feb. 1998.

[40] A. J. Viterbi, "Error bounds for convolutional codes and an .,-iipll ically optimum
decoding algorithm," IEEE Trans. Inform. Tht.' -; vol. 13, pp. 260-269, Apr. 1967.

[41] J. G. DAVID FORNEY, "The viterbi algorithm," in Proceedings of the IEEE, vol. 61,
Mar. 1973, pp. 268-278.

[42] L. R. RABINER, "A tutorial on hidden markov models and selected applications in
speech recognition," in Proceedings of the IEEE, vol. 77, Feb. 1989, pp. 257-286.

[43] D. Moore, K. Keys, R. Koga, E. Lagache, and k claffy, "The CoralReef software suite
as a tool for system and network administrators," in Usenix LISA. (2001), Dec. 2001.
[Online]. Available:

[44] C. Logg, "C'!i i:terization of the traffic between slac and the internet," July
2003. [Online]. Available:

[45] I. A. N. Authority, "Port numbers," Aug. 2006. [Online]. Available:

Usually it is difficult to estimate prior probabilities. In the paper, we simply assume
P(Q = 0) = P(Q = 1) = c i E 0o. That is, states of root nodes are equally likely to

be normal or abnormal. Other parameters such as likelihood and transition probabilities

are estimated from training data. This is covered in Section 4.3. After that, classification
using maximum gain criterion is described in Section 4.4.
4.3 Estimation of HMT Parameters

In this section, we describe estimation of HMT parameters. It is organized as follows.
In Section 4.3.1, we describe estimation of likelihood p(p|I Q), i E B. Estimation of

transition probabilities P(Q2lR p()), i cE \ 0o are presented in Section 4.3.2.
4.3.1 Likelihood Estimation

For the purpose of likelihood estimation, we collect two sets of training data.

* The set of features sampled in normal states, {1 k); k {1,..., Ko}}, where Ko is
(k) 0(k) j,(k) denotes the kth feature
the number of normal samples and 0k) )' i denotes te feature
measured at node i;

* The set of features sampled in abnormal states, {(k); k c {1,..., Ki}}, where K1 is
the number of abnormal samples and j { 'i; i E denotes the kth feature
measured at node i.

Gaussian mixture model. In order to effectively estimate the likelihood, we assume

that the random variables/vectors, |ilti, i E follow a statistical distribution model.
Then, likelihood estimation translates to model parameters estimation. We establish the

statistical model in the following.

Because of its good properties, Gaussian (normal) distribution [30] is widely emploiv, 1
in many applications. The pdf of a d-dimensional multivariate Gaussian distribution with

mean vector p and variance matrix E is

(X; (2 P, d 2 exp (x P)t-l(x ) (4 23)

Figure 4-6 plots the pdf of the univariate Gaussian distribution AV (x; 0, 1). It is observed
that Gaussian distribution is a unimodal distribution[30], i.e., its pdf only has one peak.

processes, we transform P(t) to a discrete-time sequence by applying sampling at

frequency F, = .

Because the signal defined in Equation (6-2) is represented as a summation of delta

functions, its spectrum spans the whole frequency domain. In order to correctly reshape

the spectrum of Ph(t) to avoid aliasing when it is sampled at interval T,, we apply a

low pass filter (LPF) characterized by its impulse response hLpF(t). Ph(t) can then be

mathematically described as following:

Ph(t) =P(t) hLpp(t)= Ph(t -A). (6-3)

After sampling at interval T, we obtain the following discrete-time sequence:

Pd(i) =Ph(iT) Ph(iT A), (6-4)
<'P,A> E5s

where i 1,..., Id A- A- + 1, Am is the arrival time of the last packet in the flow.

We note that the sampling interval T, cannot be arbitrarily chosen. If T, is too large,

then the spectrum of the flow F, contains information only related to low frequencies and

thus lacks of information related to the high frequency spectrum. On the other hand, if T,

is too small, then the length Id of the resulting discrete-time sequence will be very large,

resulting in very high complexity in computing the PSD of Fs. After an extensive analysis

of widely-used voice and video applications such as Skype, MSN, and GTalk, we observed

that choosing T, = 0.5 milliseconds is sufficient to extract all useful information for our


Next, we provide a methodology to extract the regularities residing in the signal P(t).

We achieve this by studying the extracted digital signal Pd(i) in the frequency domain

applying power spectral density analysis.

6.2.2 Power Spectral Density (PSD) Computation

Power spectral density definition.


To detect distributed TCP SYN attacks, we use the two-way matching features

described in '!O ipter 3, i.e., the number of unmatched inbound SYN packets in one

time slot. The parameter setting of two-way matching features extraction is same as in

Section 3.7.2. For convenience, we summarize the parameters in Table 4-3.

Table 4-3: Parameter setting of feature extraction for network anomaly detection
Notation Description
R 2480
1l 0.1.
/C 8
w 8
7 10 seconds
1i1. 215 bits

For comparison purpose, we also measure the number of SYN packets and SYN/FIN

ratio[11] in a slot.

4.5.2 Performance Comparison

Table 4-4: Performance of different schemes.
Feature Detection algorithm Detection probability False alarm probability
SYN/FIN ratio CUSUM 0.174 0.129
SYN CUSUM 0.52 0.129
SYN Machine learning 0.656 0.123
Unmatched SYN CUSUM 0.690 0.130
Unmatched SYN Machine learning 0.973 0.115

Table 4-4 compares the performance of different schemes, where the benchmark is

the scheme in [11], i.e., the CUSUM scheme with SYN/FIN ratio as the feature; for the

benchmark scheme, we use the same parameter setting as that in [11]; we compare the

benchmark with CUSUM and our machine learning algorithm under different features.

To make fair comparison, we make the false alarm probability of each scheme almost

the same and compare the detection probability. From Table 4-4, it can be seen that,

the benchmark scheme ('SYN/FIN ratio'+CUSUM) performs very poorly in detecting

low volume DDoS attacks. In contrast, a CUSUM algorithm with the number of SYN

achieve 1C(,' detection rate of both voice and video in the case of single-typed flow, e.g.

one application per 5-tuple, and 98.,' and 94.>' respectively for voice and video when

dealing with the more complex scenario of hybrid flows, e.g. voice, video and file transfer

bundled together in the same 5-tuple flow.

The rest of the chapter is organized as follows. In Section 5.2, we introduce the

related work in the area of pattern classification methodologies. Section 5.3 describes

the weaknesses of metrics previously used by other works when applied in our context

and highlights the new traffic features that constitute the foundation of our approach.

Section 5.4 summarizes this chapter.

5.2 Related Work

Existing work on traffic classification uses discriminating criteria such as the packet

size distribution per flow, the inter-arrival times between packets within the same flow

and other statistics captured across multiple flows. For example, in Ref. [49], the authors

proposed the combination of average packet size within a flow and the inter-arrival

variability metric, e.g. defined as the ratio of the variance to the average inter-arrival

times of packets within a flow, as a powerful metric to define fairly distinct boundaries for

three groups of applications: (i) bulk data lii.r er like FTP, (ii) interactive like HTTP,

and (iii)streaming like voice, video, -,nii.r etc. Several classification techniques, like

nearest-neighbor and K-nearest-neighbor, were then tested using the above traffic features.

Although this preliminary study has proved that the approach of pattern classification has

great potential for a proper application classification, it proves that much more work still

remains, e.g. exploiting other alternative for traffic features and classification techniques.

Moreover, although the features extracted are simple and feasible to be implemented

on-the-fly, the learning algorithm is complex and the outcome boundaries among the three

families of applications are heavily non-linear and time-dependent.

Similar to Ref. [49], Karagiannis et al.[50] proposed a novel approach, called BLINC,

that exploits network-related properties and characteristics. The novelty of this approach


[1] P. Mockapetris, "Domain names concepts and facilities," RFC 1034.

[2] P. Mockapetris, "Domain names implementation and specification," RFC 1035.

[3] "Video on demand," Wikipedia. [Online]. Available:

[4] D. Wu, Y. T. Hou, W. Zhu, Y.-Q. Z!i ii- and J. M. Peha, "Streaming video over
the internet: Approaches and directions," IEEE Trans. Circuits Syst. Video Technol.,
vol. 11, pp. 282-300, Mar. 2001.

[5] K. Nichols, S. Blake, F. Baker, and D. Black, "Definition of the differentiated services
field (ds field) in the ipv4 and ipv6 headers," RFC 2474.

[6] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, "An architecture
for differentiated services," RFC 2475.

[7] P. Almquist, "Type of service in the internet protocol suite," RFC 1349.

[8] S. S. Kim, A. L. N. Reddy, and M. Vannucci, "Detecting traffic anomalies using
discrete wavelet transform," in Proceedings of International Conference on Information
Networking (ICOIN), vol. III, Busan, Korea, Feb. 2004, pp. 1375-1384.

[9] C.-M. C(! i;,, H. T. Kung, and K.-S. Tan, "Use of spectral analysis in defense against
dos attacks," in Proceedings of IEEE Globecom 2002, vol. 3, Taipei, Taiwan, Nov.
2002, pp. 2143-2148.

[10] A. Hussain, J. Heidemann, and C. Papadopoulos, "A framework for classifying denial
of service attacks," in Proceedings of AC'If SIGCOMM, Karlsruhe, Germany, Aug.

[11] H. Wang, D. Z!h Ii: and K. G. Shin, "Detecting SYN flooding attacks," in Proc. IEEE
INFOCOM'02, New York City, NY, June 2002, pp. 1530-1539.

[12] T. Peng, C. Leckie, and K. Ramamohanarao, "Detecting distributed denial of service
attacks using source IP address monitoring," Department of Computer Science and
Software Engineering, The University of Melbourne, Tech. Rep., 2002. [Online].

[13] R. B. Blazek, H. Kim, B. Rozovskii, and A. Tartakovsky, "A novel approach to
detection of "denial-of-service" attacks via adaptive sequential and batch-sequential
change-point detection methods," in Proc. IEEE Workshop on Information Assurance
and S.. ;,, West Point, NY, June 2001, pp. 220-226.

[14] S. Mukkamala and A. H. Sung, "Detecting denial of service attacks using support
vector machines," in Proceedings of IEEE International Conference on F;,..; S.l,/; ii
May 2003.

Yet, the emergence of a bloom of new zero-d-v voice and video applications such as

Skype, Google Talk, and MSN poses tremendous challenges for ISPs. The traditional

approach of using port numbers to classify traffic is infeasible due to the usage of

dynamic port number. In the second part of our research, we focus on a statistical

pattern classification technique to identify multimedia traffic. Based on the intuitions that

voice and video data streams show strong regularities in the packet inter-arrival times

and the associated packet sizes when combined together in one single stochastic process,

we propose a system, called VOVClassifier, for voice and video traffic classification.

VOVClassifier is an automated self-learning system that classifies traffic data by extracting

features from frequency domain using Power Spectral Density analysis and grouping

features using Subspace Decomposition. We applied VOVClassifier to real packet traces

collected from different network scenarios. Results demonstrate the effectiveness and

robustness of our approach.

simulation results
numerical results

2 3 4 5 6 7
Average number of hash function calculations

-simulation results
numerical results

2 3 4 5 6 7
Average number of hash function calculations

Figure 3-15. Comparison of numerical and simulation results. (a) Hash table algorithm.
(b) BFA algorithm with =1.

equally likely to be accessed. Hence, Equation (3-2) does not hold perfectly, neither does

Equation (3-3). As a result, the average number of hash function calculations per query in

simulation is larger than that predicted by Equation (3-6).

Figure 3-15(b) shows that the numerical result agrees well with the simulation result

for all the values of the average number of hash function calculations per query under our


3.7.2 Experiment on Feature Extraction System

In this section, we show the performance of the complete feature extraction system

implemented on traffic monitors and local analyzers, which uses the BFA algorithm.

The reason of conducting this experiment is that we would like to know the

performance of the whole hierarchical feature extraction architecture presented in

Section 3.2. In contrast, the experiment in Section 3.7.1 does not involve the interaction

among the three levels.

Experiment settings. We use the trace data provided by Auckland University [27]

as the background traffic. This data set consists of packet header information of traffic

between the Internet and Auckland University. The connection is OC-3 (155 Mb/s) for

both directions.


Jieyan Fan was born on Jul 26, 1979 in Shanghai, ('!i i, The only child in the family,

he grew up mostly in his home town, graduating from the High School Affiliated to Fudan

University in 1997. He earned his B.S. and M.S. in electrical engineering from Shanghai

Jiao Tong University, Shanghai, C('l, in 2001 and 2004, respectively. He is currently

a Ph.D. candidate with electrical and computer engineering, University of Florida,

Gainesville, FL. His research interests are network security and pattern classification.

Upon completion of his Ph.D. program, Jieyan will be working in Yahoo! Inc,

Sunnyvale, CA.

The rest of this chapter is organized as the following. Sections 6.2, 6.3, and 6.4

describe the components, Feature Extractor, Voice / Video Subspace Generator, and Voice

/ Video Classifier, respectively. In Section 6.5, we conduct experiments on traffic collected

between two universities using Skype, MSN, and GTalk. Section 6.6 summarizes this


6.2 Feature Extractor (FE) Module via Power Spectral Density (PSD)

As explained in Section 5.3 the extraction and processing of simple traffic features

does not solve the problem of detecting and separating voice and video data streams from

other applications. In this section we first introduce the preliminary steps that we take to

transform each generic flow Ts obtained from FSG into a stochastic process that combines

the inter-arrival times and packet sizes. Then we describe how to use power spectral

density (PSD) analysis as a powerful methodology to extract such hidden key regularities

residing in real-time multimedia network traffic.

6.2.1 Modeling the network flow as a stochastic digital process


________, 1, ___]LPF ] Th-
Stochastic aqtF LPF Sample Estimate
Model hut Ts PSO

Figure 6-2. Power spectral density features extraction module. Cascade of processing

Each flow Fs extracted from the FSG is forwarded to the FE module that applies

several steps in cascade (Figure 6-2). First, any Fs extracted (see Equation (6-1)) is

modeled as a continuous stochastic process as illustrated in Equation (6-2):

P(t)= P6 (t A), (6 2)
< P,A> E-s

where 6(.) denotes the delta function. As the reader can notice, our model combines

packet arrival times and packet sizes together as a single stochastic process. Because

digital computers are more suitable to deal with discrete-time sequences than continuous-time

4. combining packet inter-arrival time and packet size in frequency domain.

It turns out that the last one is the most distinctive feature for classifying voice and

video traffic.

We then presented the VOVClassifier system to classify voice and video traffic.

VOVClassifier is composed of four 1i i, ri" modules that operate in cascade, flow summary

generator, feature extractor, voice / video subspace generator, and voice / video classifier.

The novelty of VOVClassifier is that

* we combine packet inter-arrival times and packet sizes of a network flow and model it
by a stochastic process;

* we estimate PSD feature vector to extract regularities residing in voice and video

* we use minimum coding length to decompose subspaces from the training feature
vectors and principal component analysis to identify bases of each subspace;

* we use minimum distance to subspaces as the similarity metric to perform classification.

The experiment results demonstrate the effectiveness and robustness of our approach.

the results of which are not shown here due to the space limit. The machine learning

algorithm is shown to be robust under realistic time-varying traffic patterns such as

the Auckland data traffic [27]. We tested our machine learning algorithms for a large

IP address space, i.e., the IP address space can be the whole IP address space for the


4.6 Summary

In this chapter, we propose a novel machine learning detection algorithm, based on

B i, -i in decision theory and hidden Markov tree model. The key idea of our algorithm is

to exploit spatial correlation of network anomalies. Our detection scheme has the following

nice properties:

* In addition to detecting network anomalies having high-data-rate on a link, our
scheme is also capable of accurately detecting attacks having low-data-rate on
multiple links. This is due to exploitation of spatial correlation of network anomalies.

* Our scheme is robust against time-varying traffic patterns, owing to powerful machine
learning techniques.

* Our scheme can be deploy, 1 in large-scale high-speed networks, thanks to use of
Bloom filter array to efficiently extract features.

With the proposed techniques, our scheme can effectively detect network anomalies

without modifying existing IP forwarding mechanisms at routers. Our simulation results

show that the proposed framework can detect DDoS attacks even if the volume of attack

traffic on each link is extremely small (i.e., 1 .). Especially, for the same false alarm

probability, our scheme has a detection probability of 0.97, whereas the existing scheme

has a detection probability of 0.17, which demonstrates the superior performance of our


Similar to Sections 4.2.1 and 4.2.2, we employ maximum gain criterion (see
Equation (4-9)) to estimate node states, i.e.,

ui = arg max ((u)P(Q = iu)

Sarg max I (u/,)P(A,'= ui/,QP(i,) = up(i),/)
i^ T e' \{iR(i)

SI (;T-,,) P(A ) W ,), (4-19)

for V i E L-_1. Applying Viterbi algorithm to solving Equation (4-19), we reduce the
computation complexity from O (2") (see Section 4.2.2) to O (Ba). This is the ni ii r
advantage of introducing HMT model. The details are given in Section 4.4.
Solving Equation (4-19) requires knowledge of P(Q~ i |p(i), for Vi E E \ Eo, and
P(j = usll), for Vi e Zo. By B ,, -i o, formula[30],

P(Q= us 1 P ,,Q) U)p(\|
P(li = Ui\l,(i) = Up(i), A) = --- (4-20)
P(Qp(i) = Up(i) |)
for Vi E E \ Eo. Therefore, solving Equation (4-19) translates to estimating

P(A0, Qp()| ), Vi B -\o, (4 21)


Estimating Equations (4-21) and (4-22) in closed form is difficult. We proposed a
belief propagation (BP) algorithm, described in Section 4.3, to estimate them efficiently
given knowledge of

* prior probabilities: P(, = 0), i E Eo;

* likelihood: p(o(| j ), i E E;

* transition probabilities: P(Q ISQ(i)), i E E \ o0.

have global information of the whole AS, detection approach employing spatial correlation

can only be deploy, ,1 in global analyzers.

When network anomaly happens, usually more than one edge router exhibits

abnormal symptoms. For example, when DDoS attacks are launched toward a victim

in an AS, the attack traffic enters the AS from multiple edge routers as the attack sources

are distributed. At each of those edge routers, the monitored traffic volume may be

low. That is, each traffic monitor or local analyzer observes a small deviation of traffic

features from normal distribution. However, the global analyzer, upon obtaining reports

from local analyzers, will observe small deviations of features from multiple edge routers

simultaneously. Employing spatial correlation contributes to low traffic network anomaly


Oi 02 a K

Figure 4-4. Generative dependent model that describes dependencies among edge routers.

Introducing the spatial correlation into the independent model in Figure 4-3 results

in the dependent model as illustrated in Figure 4-4. The difference between two models

is that, from the view point of a global analyzer, edge routers are no long independent.

As a result, statistical dependence among states of edge routers is represented by the

non-directional connections. Note that the independent model can be regarded as a special

case of the dependent one. Also note that we still assume that features extracted from one

edge router are independent of the states of other edge routers.


0 05 /

0 04


^ j I ) / \

0 01
0 100 200 300 400 500 600 700
Packet Size (Bytes)

Figure 5-3. Packet size distribution for voice and video traffic

traveling, packets might experience random d 1 iv due to congestion at routers' interfaces.

As a consequence, the inter-arrival times between packets at the receiver might be severely

affected by random noise, e.g. jitter, and thus this metric might not represent a reliable

candidate feature for a robust classification methodology. Although this problem does

exist, we note how the inter-arrival times between packets within the same flow still shows

a strong iiolai;, when studied in the frequency domain at the receiver side. As an

example, in Figure 5-2, we show the distributions of the inter-arrival packet times at the

receiver side when using Skype to transmit respectively voice only and video only between

two hosts, one located in University A in east coast and the other in University B in

west coast of USA. As we can see, the distributions for both video and voice are centered

around 0.03 second. On the other hands, Figure 5-3 shows the distributions of the packet

sizes for both voice and video. As you can see, both voice and video are characterized by

similar distribution for packet size less than 200 bytes. Although video traffic generates

packet size of larger than 200, these larger packets cannot be reliably used to separate

video from voice since other applications such as chat or file transfer might also generate

these larger packets. As a consequence, packet inter-arrival time or packet size is a weak

feature when considered in the temporal domain.

* extracting complicated features from traffic information obtained at a single edge

* making decisions based on local traffic information (i.e., one edge router);

* reporting decisions, feature data, and summary of traffic information (if necessary) to
a global analyzer.

The local analyzer can utilize temporal correlation of traffic to generate feature data.

2.2.3 Global Analyzer

A global analyzer is responsible for:

* extracting complicated features that require global information, such as routing
information, from traffic;

* analyzing feature data obtained from multiple local analyzers; and

* making decisions with global information obtained from multiple edge routers.

Global Analyzer

Local Analyzer Local Analyzer


Figure 2-4. Example of .i-vii, i ic traffic whose feature extraction is done by the global

The global analyzer has a global view of the whole network. Hence, it exploits

both temporal correlation and spatial correlation of traffic. Here it is important to note

that, some feature data must be obtained at the global analyzer if global information

is required. For example, in Figure 2-4, if the traffic from subnet A to server B passes

through edge router X, and the traffic from server B to subnet A passes through edge

As illustrated in Figure 6-1, we collect voice and video training flows during the

training phase. After processing the raw packet data through the feature extraction

module via PSD, one obtains two sets of feature vectors,

) { (1), (2),. .., (Ni)}, (6-36)

which is obtained through voice training data, where N1 is the number of voice flows; and

A (2) {' (t), (2), .. ,, (N2)} (6-37)

which is obtained through video training data, where N2 is the number of video flows.

To facilitate further discussion, let us also regard (i) as a M x Ni matrix, for i = 1, 2,

where each column is a feature vector. In other words, 1(i) e RMXN.

In this section, we present techniques to identify the low dimensional subspaces

embedded in RM, for both T1 and T2.

There are a lot of low dimensional subspace identification schemes, such as Principal

Components Analysis (PCA) [61] and Metric Multidimensional Scaling (Il )S) [62], which

identify linear structure, and ISOMAP [63] and Locally Linear Embedding (LLE)[64],

which identify non-linear structure.

Unfortunately, all these methods assume that data are embedded in one single

low-dimensional subspace. This assumption is not alv--x true. For example, as different

software uses different voice coding, it is more reasonable to assume that the PSD feature

vector of voice traffic is a random vector generated from a mixture model than a single

model. In such case, it is more likely that there are several subspaces in which the feature

vectors are embedded. The same holds for video feature vectors.

As a result, a better scheme is to first, cluster the trained feature vectors into several

groups, known as subspace decomposition; and second, to identify the subspace structure

of each group, known as subspace bases identification. We describe the two steps in the

following sections.

33 seconds for 30 frames/s; hence, the inter-arrival time between two packets in an I-frame

may span a large range, resulting in a flat PSD.

Trace 1
40 -Trace 2


) 25

U 2020



0 10 20 30 40 50
Frequency (Hz)

Figure 5-8. Power spectral density of two sequences of continuous-time packet sizes for
voice traffic

100 200 300
Frequency (Hz)

Trace 1
Trace 2

400 500

Figure 5-9. Power spectral density of two sequences of continuous-time packet sizes for
video traffic

5.4 Summary

In this chapter, we motivated the importance and presented challenges faced by

network traffic classification, specifically, detecting and classifying voice and video traffic.

To those who sparked my interest in science, opening for me the door to discovering

nature and letting me walk through it in my own way.

Then for 'Y' C T, ,' e k after segmentation if and only if

k =arg max lnk. (6-50)

1. function MCLPartition(...)
2. Argument 1: set of feature vectors, T = (1), (N)
3. Return: partition of T,

II {Il,..., K; l1 U... U\ K Tt, ri n J 0 for Vi j,}

4. Initialization:
II= { (1)} (2)},. ..,{(N),
5. while true do

(71,72) arg min C ( Uw ) (F ,*) (6-51)
7rl En,7rF EHn
7. if (TT U TT)- (TT*, TT) > 0 then
8. break
9. else
10. II (II\{ri, r2}) U {ri U 72}
11. end if
12. end while
13. return TT
Figure 6-5. Pairwise steepest descent method to achieve minimal coding length.

There is no closed form solution for Equation (6-49). IT. .4-[r, page41] proposed a

pairwise steepest descent method to solve it (Figure 6-5). It works in a bottom-up way.

It starts with a partition scheme that assigns each element of T to a partition. Then, at

each iteration, the algorithm finds two subsets of feature vectors such that by merging

these two subsets, one can decrease the coding length the most (Equation (6-51)). This

procedure stops when no further decrease of coding length can be achieved by merging any

two subsets.

Using the above method, we obtain a partition of voice feature vector set T(1),

[15] S. Savage, D. Wetherall, A. Karlin, and T. Anderson, "Practical network support for
ip traceback," in Proc. of ACIf SIGCOMM'2000, Aug. 2000.

[16] A. Lakhina, M. Crovella, and C. Diot, "C('!i i:terization of network-wide anomalies
in traffic flows," in Proc. AC'I[ SIGCOMM Conference on Internet Measurement '04,
Oct. 2004.

[17] H. Wang, D. Zi ir. and K. G. Shin, "C'!I i,i:,.-point monitoring for the detection
of dos attacks," IEEE Transactions on Dependable and Secure Conri'l.:,' no. 4, pp.
193-208, Oct. 2004.

[18] J. Mirkovic and P. Reiher, "A i:;:nm, of ddos attacks and ddos defense mechanisms,"
in Proc. ACI[ SIGCOMM Computer Communications Review '04, vol. 34, Apr. 2004,
pp. 39-53.

[19] J. B. Postel and J. Reynolds, "File transfer protocol," RFC 959, Oct. 1985. [Online].

[20] K. Lu, J. Fan, J. Greco, D. Wu, S. Todorovic, and A. Nucci, "A novel anti-ddos system
for large-scale internet," in AC'I SIGCOMM 2005, Philadelphia, PA, Aug. 2005.

[21] B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Commun.
AC(M, vol. 13, no. 7, pp. 422-426, July 1970.

[22] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, "Summary cache: A scalable wide-area
web cache sharing protocol." IEEE/ACM11 Trans. Netw., vol. 8, no. 3, June 2000.

[23] F. C'!I i!;, W. chang Feng, and K. Li, "Approximate caches for packet classification,"
in IEEE INFOCOM 2004, vol. 4, Mar. 2004, pp. 2196-2207.

[24] R. Rivest, "The md5 message-digest algorithm," RFC 1321, Apr. 1992. [Online].

[25] MD5 CRYPTO CORE FAMILY, HDL Design House, 2002. [Online]. Available:

[26] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The
Hardware/Software In.I, l-i ., San Francisco, CA: Morgan Kaufmann, 1998, ch. 5,6.

[27] "Auckland-IV trace data," 2001. [Online]. Available:

[28] L. L. Scharf, Statistical i..i:,l processing: detection, estimation, and time series
analysis. Addison Wesley, 1991.

[29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Cla--1..:;.'.n, 2nd ed.
Wiley-Interscience, Oct. 2000.

[30] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. Duxbury Press, June 2001.


7.1 Summary of Network Centric Anomaly Detection

In the first part of our study, we presented our achievement on network centric

anomaly detection. We first proposed a novel edge-router based framework to robustly

and efficiently detect network anomalies in the first place they happen. The key idea is

to exploit spatial and temporal correlation of abnormal traffic. The motivation of our

framework design is to use both spacial and temporal correlation among edge routers. The

framework consists of three types of components (i.e., traffic monitors, local analyzers,

and a global i, &i-v. r). Traffic monitors summarize traffic information on each single link

between edge router and user subnet. Local analyzers collect information on edge routers,

which is provided by traffic monitors, and reports to the global analyzer. The global

analyzer has a global view of the whole autonomous system and makes final decision. The

advantages of our framework design are the following.

1. It is deploy, .1 on edge routers instead of systems of end users, such that it can detect
network anomalies in the first place they enter an AS.

2. It has no burden on core routers;

3. It is flexible in that detection of network anomalies can be made both locally and

4. It is capable of detecting low volume network anomalies accurately by exploiting
spatial correlations among edge routers.

We then presented feature extraction for network anomaly detection. Based on

the framework, we designed the hierarchical feature extraction architecture. Different

components extract different features. For example, traffic monitors can extract features

such as packet rate, data rate, and SYN/FIN ratio. Local analyzers are able to extract

features such as SYN/SYN-ACK ratio, round-trip time, and two-way matching features

on one edge router. The global analyzer can extract two-way matching features from

the whole autonomous system. Specifically, we focus on the novel type of features,

6.5.2 Skype Flow Classification

In this section, we conduct experiments on Skype traffic. We first consider the

scenario when each Skype flow carries one type of traffic. In other words, in this set of

experiments, one flow is of type VOICE, VIDEO, or none of the above.

Figure 6-8 shows the ROC curves of classifying voice and video flows.

1 1
099 099
098 098
097 097
096 096
0 95 6 095
094 094
093 093
092 092
091 091
09 09
0 02 04 06 08 1 0 02 04 06 08

(a) (b)

Figure 6-8. The ROC curves of single-typed flows generated by Skype, (a) VOICE and (b)

We then conduct experiments on hybrid Skype flows. In other words, each flow

may be of type VOICE, VIDEO, FILE+VOICE, FILE+VIDEO, or none of the above.

Figure 6-9 plots the ROC curves of these five types, respectively.

6.5.3 General Flow Classification

Now, let us do the same experiments on network traffic generated by Skype, MSN,

and GTalk, as these are very common applications at present. In other words, a voice flow

now can be a VoIP flow generated by Skype, MSN, or GTalk. So are video flows. Note

that, GTalk does not support video conference. Similarly, two sets of experiments are

conducted, one on single-typed flows and the other on hybrid flows.

which must be kept powerful enough to perform concurrent analysis of a large number

of flows while applying techniques to search very complex protocol signatures that might

require processing of a large chunk of packet 1 ivload. The proliferation of proprietary

protocols, coupled with the growing trend in the usage of encryption techniques to ensure

data confidentiality, makes this approach infeasible. For example, Skype does not run on

any standard port, but randomly selects ports for its communication and use either the

TCP, or UDP or both for the data transfer. Furthermore, its use of a 256-bit encryption

algorithm and no visibility into neither the algorithm nor its keys makes its detection

even harder. All the above makes the general problem of the detection of VoIP and video

data-streams over IP challenging and yet it is of huge business interest.

A new emerging tendency in the research community to approach this problem is

to rely on pattern classification techniques. This new family of techniques formulate

the application detection problem as a statistical problem that develop discriminating

criteria based on statistical observations and distributions of various flow properties in the

packet traces. A few papers [49, 50] have taken this statistical approach to classify traffic

into p2p, multimedia 1r ii-:1i1- interactive application, and bulk transfer application.

Unfortunately, although these papers addressed the problem of distinguishing multimedia

traffic from other applications, they have not addressed the problem of distinguishing

voice traffic from video traffic. One problem is to separate streaming traffic from other

applications, and a different problem is to detect and correctly classify voice and video and

clearly separate the two applications from each other. In the extreme case, voice and/or

video data streams might even be bundled together in the same exact flow with other

applications. These problems are common for many applications like Skype, Gtalk and

MSN that allow users to mix voice and/or video streams with chat and/or file transfer

traffic in the same exact 5-tuple flow, defined as source IP address, destination IP address,

source port, destination port, and protocol numbers. In such cases, one flow may carry

extracted by global analyzers when an AS is not symmetric. However, the feature

extraction approaches used by local analyzers and global analyzers are same.

3.4 Basic Algorithms

This section presents two basic algorithms to process and store the two-way matching

features, namely, the hash table algorithm and the Bloom filter algorithm.

3.4.1 Hash Table Algorithm

The general procedure to extract the two-way matching features from traffic at an

local analyzer is:

1. The local analyzer maintains a buffer in memory;

2. When the traffic monitor captures an inbound packet, if its inbound signature is not
in the buffer, the local analyzer creates one entry for its signature and set the state of
that entry to "UNMATCHED";

3. When the traffic monitor captures an outbound packet, if its outbound signature is in
the buffer, the local analyzer sets the state of that entry to \ IATCHED";

4. At time ti+l, the local analyzer assigns the number of entries with state "UNMATCHED"
to D(ti).

So typically we need three operations: insertion, search and removal4

A basic algorithm to do this is to use a hash table. Suppose the signature extracted

from a packet is b bits long. We organize the buffer into a table, V, with f cells of b + 1

bits each. The extra one bit is the state bit. We also have C hash functions hi:S Z+ ,

where i E Zc {0, 1,..., /C 1}, and S is the data set of interest, e.g., signature domain.

The symbol Z stands for the set {0,..., 1}, where is an integer.

The operations of hash table algorithm are listed in Figure 3-5, where the argument s

is the signature extracted from a packet.

4 Setting the state to \! ATCHED" is actually the removal operation.


Over the past few years, the Internet infrastructure has become a critical part of the

global communications fabric. A survey by the Internet Systems Consortium (ISC) shows

that the number of hosts advertised in domain name system (DNS)[1, 2] has risen from

approximately 9,472,000 in January 1996 to 394,991,609 in January 2006. In addition, the

emergence of new applications and protocols, such as voice over Internet Protocol (VoIP),

pear-to-pear (P2P), and video on demand (VoD)[3], also increases the complexity of the

Internet. Accompanying this trend is an increasing demand for more reliable and secure

service. A 1i ir challenge for Internet service providers (ISP) is to better understand the

network state by analyzing network traffic in real time. Thus ISPs are very interested in

the problem of network centric traffic analysis.

We consider the network centric traffic analysis problem from two perspectives: 1)

network anomaly detection and 2) network centric traffic classification. We introduce the

two perspectives in the next two sections.

1.1 Introduction to Network Anomaly Detection

With the rapid growth of Internet, detection of network anomalies becomes a i, i i 1r

concern in both industry and academia since it is critical to maintain availability of

network services. Abnormal network behavior is usually the symptom of potential

unavailability in that:

* Network anomaly is usually caused by malicious behavior, such as denial-of-service
(DoS) attacks, distributed denial-of-service (DDoS) attacks, worm propagation,
network scans, or email spams;

* Even if it is caused by unintentional reasons, network anomaly is often accompanied
with network congestion or router failures.

However, detecting network anomalies is not an easy task, especially at high-speed

routers. One of the main difficulties arises from the fact that the data rate is too high to

afford complicated data processing. An anomaly detection algorithm usually works with

and EF the power of noise of the AR(p) signal. Thus, one can rewrite Equation (6-15) as


function LDA(...)
Argument 1: data, {y(i)} i1
Argument 2: order, p.
Return: parameters of AR(p) model, a, and F2

r(k; y) = y7iS*

r = al

I r(0; y)

k), for Vk 0, ,...,p

r(1; y)
r (0; y)

Ir(1; y) 12
r(0; y)

rt A



[r*(t; ),r*(t- ; y),.. r*(l;y)]T

[at, at-_, al]T
r(t + 1; y) + rtat






i2 (2 t l / 2' )
Ct+l C t t+
at t1 / 2
at+ = o + rt+l 1

10 ondr for

Figure 6-3. Levinson-Durbin Algorithm.

Figure 6-3 gives the LDA algorithm to estimate coefficients of AR(p) model given

data {y(i)} i.

1 E2

dp 0





8. for t <-- 1,...,p

Now define variable S(ti) by

0 i=0
S(ti) (4-1)
max(0,S(tji_) + )(ti)) i > 0

In the CUSUM algorithm, if S(ti) is smaller than a threshold 'HCUSUM, declare that

the network state is normal; otherwise, declare that the state is abnormal.

From the discussion above, we note that two parameters, i.e., a and 'HCUSUM, need

to be determined. However, we cannot uniquely determine these two parameters. To

overcome this problem, we shall introduce another parameter, i.e., the detection delay,

denoted as V. According to the change-point theory, we have

V 1 1
'HCUSUM (oa nr) -n a o aa

From Equation (4-2), we can obtain

'HcusuM- x (.a- a). (4-3)

Hence, once V and a are given, we can determine 'HCUSUM through Equation (4-3).1

Given a and 'HCUSUM, we can use the CUSUM algorithm to detect network anomaly.

We notice that the existing CUSUM algorithms [11-13] only consider one change,

i.e., from the normal state to the abnormal state. In practice, this approach may lead to

a large number of false alarms after the end of attacks. To mitigate the high false alarm

issue of the existing algorithms, which we call single-CUSUM algorithms, we develop a

dual-CUSUM algorithm. In this algorithm, one CUSUM will be used to detect the change

from the normal to the abnormal state, while another CUSUM is responsible for detecting

the change from the abnormal to the normal state. The method of setting parameters for

1 In Refs. [11] and [12], a = (4a ,)/2; thus only the detection delay is needed.

Although it does not guarantee to find the global optimal solution to Equation (4-39),

Viterbi algorithm is efficient and has good performance empirically.

The computation complexity of Viterbi algorithm is O (Ba), much better than

the dependent model, as illustrated in Figure 4-4, whose complexity is O (2"). The

performance improvement results from the fact that Viterbi algorithm does not exhaustively

test all possible node state combinations. Instead, we decompose the decoding problem

into multiple stages, each of which decodes node states at one level in HMT. At level 1, we

estimate node states of level 1, i.e., {ji; i E EE}, based on results obtained during previous

stages, i.e. {Qj; i E {Bo0,..., -1i}}. In such a procedure, each node is only accessed for

twice. Hence the complexity is linear to the number of nodes, which is

1 1 -L
Spose sl< B aa.
1 B-1 1 B-1

Till now, we described the HMT model, including its parameter training and

classification approaches. Next, we show the simulation results of applying the HMT

model to network anomaly detection.

4.5 Simulation Results

In this section, we evaluate the performance of the proposed schemes through


4.5.1 Experiment Setting

In our study, we develop a testbed to 1) extract various feature information, and 2)

analyze feature data by our machine learning algorithm and CUSUM algorithms. Next,

we describe the setting for networks, traffic traces, and feature extraction used in our



In our experiment, we assume that the ISP network consists of a core AS, a victim

subnet, and 16 edge routers that connect to 16 subnets, as illustrated in Figure 4-12. At

each edge router, two monitors are placed to measure the inbound and outbound traffic

6.3.1 Subspace Decomposition Based on Minimum Coding Length

The purpose of subspace decomposition is to partition the data set

T= {(1), (2),..., ^(N)} (6-38)

into non-overlapping K subsets such that

S-=iU T2 U .. U 'K. (6-39)

IH.' -[rf] proposed a method to decompose subspaces according to the minimum coding

length criteria. The idea is to view the data segmentation problem from the perspective of

data coding/compression.

Suppose one wants to find a coding scheme, C, which maps data in T E RMxN to

bit sequence. As all elements are real numbers, infinite long bit sequence is needed to

decode without error. Hence, one has to specify the tolerable decoding error, C, to obtain a

mapping with finite coding length, i.e.,

C-1 (~()) < 2, for Vn 1,...,N. (640)

Then the coding length of the coding scheme C is a function

Lc : RMxN -_ Z+. (6-41)

It is proven [65] that the coding length is up bounded by

N + K K N K p N
c ) < = N l2 det + )+ o2 (6-42)

P =N () (6-43)
i= 1

4=[ -(1) p,...J(N) (6-44)

Applying Equation (3-5) to Equation (3-4), we obtain the expectation of Th

1 R ,
E[Th] R 1 (3 6)
RF +t 11, n(b+ t1)'

Since the time to insert a signature into or remove a signature from a given entry is
much shorter than that to find the proper entry, the time complexities of insertion and
removal operations are almost the same as that of the search operation. Equation (3-6)
gives the space/time trade-off (i.e., .1, vs. Th) of the hash table method.

Analysis for Bloom filter. First of all, we consider the space complexity of Bloom
filter. Denote by ./,, the length of the vector V used by Bloom filter (see Section 3.4.2).
The choice of ,. will affect the accuracy of the search function, BloomFilterSearch (see
Figure 3-6). The reason is the following.
When signatures of N flows are stored in V, 0, denoting the percentage of entries of
V with value 0, is

= C (3-7)

where KC is the number of hash functions. Assuming KC < 1 as is certainly the case, we
can approximate 0 as

O exp( I ). (3-8)

Function BloomFilterSearch(V, s) falsely identifies s to be stored in V if and only if
results of all KC hash functions point to bits with value 1, which is known as a collision.
Denote by T]N the collision probability under the condition that N flows have been
recorded. Then

= ( = exp --- (3-9)
( I"'* /KN


6.1 System Architecture

Voice Training
Flows Flow Summary P Feature Extractor Voice Subspace
Generator (PSD) Generator
Video Training
Flows Flow Summary Feature Extractor Voice Subspace
Generator (PSD) Generator
Training Phase

Raw Packets
Raw Packets Flow Summary_ Feature Extractor_ Voice/Video
Generator (PSD) Classifier
Classification Phase


Figure 6-1. VOVClassifier System Architecture

We first present the overall architecture of our system (VOVCl -,I... r Figure 6-1),

and provide a high-level description of the functionalities of each of its modules. Generally

speaking, VOVClassifier is an automated learning system that uses packet headers from

raw packets collected off the wire, organize them into transport network flows and process

them in realtime to search for voice and video applications. VOVClassifier first trains voice

and video data streams separately before being used in realtime for classification. During

the training phase, VOVClassifier extracts feature vectors, which is a summary (also

known as a statistic) of raw traffic bit stream, and maintain their statistics in memory.

During the online classification phase, a classifier makes decision by measuring similarity

metrics between the feature vector extracted from on-the-fly network traffic and the

feature vectors extracted from training data. Flows with high values of similarity metric

with the voice (or video) features are classified as voice(or video); data streams with low

values of similarity with voice/video are classified as other applications.

In general, VOVClassifier is composed of four 1 i, Prw modules that operate in cascade:

(i) Flow Suni,,,;,,, e Generator (FSG),(ii) Feature Extractor (FE) via Power Spectral

resides in twofold. First, the authors shift the focus from classifying individual flows to

associating Internet hosts with applications, and then classifying their flows accordingly.

Second, BLINC follows a different philosophy from previous methods attempting to

capture the inherent behavior of a host at three levels: (i) social level, e.g. how each host

interacts with other hosts, (ii) functional level, e.g. role 1p1 i-, d by each host in the network

as a provider of an application or a consumer of the application, and finally (iii) the

application level, e.g. ports used by each host during its communication with other hosts.

Although the approach proposed in [50] is interesting from a conceptual perspective and

proved to perform reasonably well for a variety of different applications, it is still prone

to large estimation errors for streaming applications. Moreover, its high complexity and

large memory consumption remains an open issue for high-speed application classification.

Other papers using pattern classification appeared lately in literature but more focused

on specific application detection like Peer-to-Peer [46] and chat [51]. More importantly, to

the best of our knowledge, none of the existing work has been able to separate voice traffic

from video traffic or to indicate the presence of voice traffic or video traffic in a hybrid

flow that contains traffic from both voice/video and other applications such as file transfer.

5.3 Intuitions Behind a Proper Detection of Voice and Video Streams

Generally -1'" i1:;i the problem of voice and video detection can be formulated as

a complex pattern Ia- I .: /,l.:.n problem that has to deal with curse of i.:,,. ,:..:"',:,l.':l,

e.g. discrimination of voice and video data streams when dealing with hidden traffic

patterns and too many interrelated features. A critical step toward the solution is to

identify traffic features that correctly represent the characteristics of the data streams of

interest and uniquely isolate them from other applications. In order to achieve this, in this

section we start by showing how simple metrics presented in the past are not applicable

in our context and we conclude with some observations that constitute the essence of

our approach. In Figure 5-1 we show the results obtained when using the combination

of average packet size and the inter-arrival variability metric proposed by Roughan et

6.1.3 Voice/Video CLassifer (CL)

During the classification phase, the data are processed by one extra module, named

Voice/Video Cl. *-.:,' r, that compares the feature vectors extracted from current data

streams entering the system to the voice and video subspaces generated during training

in order to classify the stream as voice, video or other. The problem of data stream

classification requires the implementation of a similarity metric. In literatures, there

are many similarity metrics. For example, B-i-;,- classifier uses cost function, and

nearest-neighbor (1-NN) and K-nearest-neighbor (KNN) use Euclidean distance. In

general, no similarity metric is guaranteed to be the best for all applications. For example,

B-i,--- classifier is applicable only when the likelihood probabilities are well estimated,

which requires the number of training samples to be much larger than the number of

feature dimensions. As a consequence, it is not suitable for classification based on an high

dimensional feature vector, such as the PSD feature vector. Furthermore, both 1-NN and

K-NN are proved to be optimal only under the assumption that data of the same category

are clustered together. Unfortunately, this is not alv--,- the case. We overcome the above

problem by employing a similarity metric based on the normalized distance from feature

vector representing the ongoing flow to the two subspaces obtained during training phase.

The subspace at minimum distance will be elected as candidate only if the distance is

below specific thresholds.

We conclude this section by highlighting one minor limitation of our approach. Our

system is unable to distinguish a flow containing video only from a flow containing video

packets pi.:-ybacked by voice data (when video and voice applications are simultaneously

launched in Skype, voice data is pi.-:.vbacked on video packets). This is because the

feature for video packets piggybacked by voice data is very similar to that for video only.

Hence, our traffic classifier will declare a flow containing video packets pi-:2yvbacked by

voice data as "video".







The power spectral density of a digital signal represents its energy distribution in

the frequency domain. Regularities in time domain translate into dominant periodic

components in its autocorrelation function and finally to peaks in its power spectral


For a general second-order stationary sequence {y(i)}~ the power spectral density

(defined in [59]) can be computed as:


where {r(k; y); k E Z} represents the autocovariance sequence of the signal {yi; i E Z}, i.e.,

r(k; y) = [y(i)y*( k)]. (6-6)

Although w in Equation (6-5) can take any value, we restrict its domain to be within

[-T, 7) because i(j7; y) = i(w + 27; y).

According to Equations (6-5) and (6-6), the computation of the PSD for a digital

signal theoretically requires to have access to an infinite long time sequence. Since in

reality, we cannot assume to have infinite digital sequences at our disposal, we need to

face the problem on which technique can be used in our context to estimate the power

spectral density with an admissible accuracy. In literature, two different families of PSD

estimation are available: parametric and non-parametric. Parametric methods have

shown to perform better under the assumption that the underlying model is correct

and accurate. Furthermore, these methods are more interesting from a computational

complexity perspective as they require the estimation of fewer variables when compared

with non-parametric methods.

In our research, we employ parametric method to estimate PSD. The details are

presented in the next section.

PSD estimation based on parametric method.

The most commonly used approach to estimate model parameters is the maximum-likelihood

(\! I) method. Given the training features at node i with network state u, { l ,... ..)},

the ML method chooses the parameters to maximize
p( (lk) i = u), (4-28)

where p(.-| = u) is given in Equation (4-26). Unfortunately, Nechyba[32] showed

that ML method for G-state GMM with G > 1 has no closed form solution. In

addition, a G-state GMM has a 3G-dimensional continuous parameter space. Exhaustive

searching numerical solution for ML estimate in such a parameter space is computational


1. Input: k(k), k e {1,..., ; 7 (g), ) (g), and 0(g), g {1,..G}.
2. Output: ri,u(g), pi,(ug), and i,u(g), g c {1,..., G}.
3. j = 0.
4. repeat

9. jKu (- j

C=1" (,' )"
9. j -j+ 1
10. until converge
11. iji,u(g) (g), k^,(g) iu (g), -i,u(g) -= )(g), Vg e {1,..., G}.

Figure 4-8. The EM algorithm for estimating p(oi I = u), i E t, u E {0, 1}.

A practical solution to this issue is the expectation-maximization (EM) algorithm[29,

30]. Nechyba[32] derived EM algorithm for GMM in detail. Figure 4-8 illustrates the


The EM algorithm requires initial values for the parameters, as denoted by (0)(g),

o)(g), and Eio(g) in Figure 4-8. At each iteration j, the EM algorithm uses parameters
estimated at iteration j 1 to calculate new estimates. Although both EM and ML

one to uncount it. In Function ProcOutbound, Line 13 starts a loop to iterate j' from j to

j w + 1. Condition 1 is checked in lines 14 to 15 and Condition 2 is checked in lines 16

to 17. Note that the loop exits (line 15) if RVj, contains s'; this is because an outbound

packet of the same flow arrived in that j'th slot and hence the buffer of the jth slot (for

each j < j') has already been checked.

Function Sample is to extract the two-way matching features. When we execute

Function Sample at the end of the jth slot (i.e., at time Tj+i), the output is D(rj-w+i)

instead of D(-j) since a time lag of F (w slots) is needed for two-way matching.

3.5.3 Round Robin Sliding Window

1. function ProcInbound(s)
2. a -- false, b -- false
3. if j',j' e {(I w + 1).,"',(I w + 2) .',....,I .,'},such that
BloomFilterSearch(RVjy,s) returns true then
4. a -- true
5. if BloomFilterSearch(IVI,s) returns true then
6. b true
7. if a and b are both false then
8. C C +
9. BloomFilterInsert (IV,, s)
10. end if
11. end function
12. function ProcOutbound(s')
13. for j' -- I to (I w + 1)- .
14. if BloomFilterSearch(RVj,, s') returns true then
15. break
16. if BloomFilterSearch(IVjy, s') returns true then
17. CY <- C- 1
18. end for
19. BloomFilterInsert(RVI, s')
20. end function
21. function Sample()
22. I <-- (I + 1)
23. return CI
24. end function
Figure 3-9. Bloom Filter Array Algorithm using sliding window

The algorithm presented in Section 3.5.2 has a drawback in memory allocation.

Specifically, at epoch Tj+i, we sample D(rj-_w+i), and then we need to throw away the

3.6.1 Space/Time Trade-off

Space/time trade-off for both Hash table and Bloom filter algorithms was 1i, lv. I 1 by

Bloom [21]. However, the analysis by Bloom[21] is not directly applicable to our setting

due to the following reasons:

1. A static data set was assumed by Bloom[21]. However, our feature extraction deals
with a dynamic data set, i.e., the number of elements in the data set changes over
time. Hence, new analysis for a dynamic data set is needed. In addition, Bloom[21]
only considered the search operation due to the assumption of static data sets. Our
feature extraction, on the other hand, requires three operations, i.e., insertion, search,
and removal, for dynamic data sets.

2. Bloom[21] assumed bit-comparison hardware in time complexity analysis. However,
current computers usually use word (or multiple-bit) comparison, which is more
efficient than bit-comparison hardware. Hence, it is necessary to analyze the
complexity based on word comparison.

3. The time complexity obtained by Bloom[21] did not include hash function calculations.
However, hash function calculation dominates the overall time complexity, e.g.,
calculating one hash function based on MD5 takes 64 clock cycles [25], while one
word-comparison usually takes less than 8 clock cycles [26].

For the above reasons, we develop new i ,1', i -; for the hash table and Bloom filter,

respectively. In addition, we analyze the performance of BFA and use numerical results to

compare the three algorithms. Table 3-2 lists the notations used in the analysis.

Table 3-2: Notations for complexity analysis
Notation Description
N Random variable representing the number of different flows recorded.
SEmpty ratio.
'q Collision probability, i.e., the probability that an item is falsely identified
to be in the buffer.
R Flow arrival rate, which is assumed to be constant.

Analysis for hash table. Denote by i the size of a hash table in bits (i.e., space

complexity) and by Th the random variable representing the number of hash function

calculations for an unsuccessful search (i.e., time complexity).

Let us consider search operation first. Upon the arrival of an inbound packet, the

HashTableSearch (see Figure 3-5) checks if its inbound signature s is in the table. Because


Figure page

2-1 An ISP network architecture. .................. ...... 19

2-2 Network anomaly detection framework. .................. .... 19

2-3 Responsibilities of and interactions among the traffic monitor, local analyzer,
and global analyzer. .................. ... ......... 20

2-4 Example of .,-vmmetric traffic whose feature extraction is done by the global
analyzer . .................... ................ 21

3-1 Hierarchical structure for feature extraction. ................ .... 24

3-2 Network in normal condition. ............... ....... 28

3-3 Source-address-spoofed packets. ............... ........ 29

3-4 Reroute ............... ................ .. 29

3-5 Hash Table Algorithm ............... .......... .. 33

3-6 Bloom Filter Operations ............... ........... .. 34

3-7 Scenarios of the problems caused by Bloom filter. (a) Boundary problem. (b)
An outbound packet arrives before its matched inbound packet with 2 tl < F. 34

3-8 Bloom Filter Array Algorithm .................. ......... .. 37

3-9 Bloom Filter Array Algorithm using sliding window .............. 38

3-10 Space/time trade-off for the hash table, BFA with q = 0.1 and BFA with q =
1 ... ............... .................. .... .. 48

3-11 Relation among space complexity, time complexity, and collision probability.
(a) 3. : vs. q. (b) E [T,]* vs. . . . ... . . .. 50

3-12 Space complexity vs. collision probability for fixed time complexity. ...... ..52

3-13 Memory size (in bits) vs. average processing time per query (in ps) ...... ..53

3-14 Average processing time per query (in ps) vs. average number of hash function
calculations per query ................ .......... .. .. 54

3-15 Comparison of numerical and simulation results. (a) Hash table algorithm. (b)
BFA algorithm with rq1 .................. .......... .. 55

3-16 Feature data: (a) Number of SYN packets (link 1), (b) Number of unmatched
SYN packets (link 1), (c) Number of SYN packets (link 2), and (d) Number of
unmatched SYN packets (link 2). .................. ..... 58

1. function TransProbEstimate(. .)
2. Argument 1: likelihood, {p(Qk Q|); ie E}.
3. Argument 2: training data, {(k); k E {1,..., K}}.
4. Return: transition probability estimate: P( ul |f,()), iE \ So.
5. P(o)(Qi u l, u') = for V i eE \ o, V u, u' {0, 1}.
6. j = 0.
7. repeat
8. fork<- 1 toK

{P(')(Qp() (k)), p(+l(1) i, Q(i) (k));i e \ 0o}

-BP ({PJ)(Q| Q(); i e E \ Bo} {p(i D); i }, i f) (4-29)

10. end for
11. forVi E \ Eo

PQ(i+1)( K1 P+I) (Ai, p(i) (k))
K( P(j+l). p(i) (k)
SP(J+I) (i P(), (4-30)
13. end for
14. j<--j+1
15. until converge
16. P(Qi|QP(i))=-PF)(Qi|P(i)).
Figure 4-9. Iteratively estimate transition probabilities.

equally likely. Then, at each iteration, we update the estimate of transition probabilities

until it converges. The update procedure is the following.

First, we iterate the training feature set. For each feature, we use BP algorithm (see

Figure 4-10) to estimate the posterior probabilities given that feature. The details of the

BP algorithm is discussed in the next section.

Three sets of arguments are passed to the BP algorithm:
estimate of transition probabilities obtained at the previous iteration, {P()( (iR,)); i E \ Eo};

1. likelihood, {p( ijp|); i E}, which is the argument passed to function TransProbEsti-

In order to overcome the above problem, in this section we exploit different metrics

that might have great potential to serve our purpose: strong regularities of inter-arrival

times between packets within the same flow and packet sizes residing in voice and video

data streams. Specifically, we consider four types of metrics, i.e.,

1. packet inter-arrival time and packet size in time domain;

2. packet inter-arrival time in frequency domain;

3. packet size in frequency domain;

4. combining packet inter-arrival time and packet size in frequency domain.

These metrics are discussed later.



0 05

0 001 002 003 004 005 006
IAT (seconds)

Figure 5-2. Inter-arrival time distribution for voice and video traffic

5.3.1 Packet Inter-Arrival Time and Packet Size in Time Domain

The intuitions behind such metrics reside in the observation that any protocol used

for voice and video applications specifies a constant time between two consecutive packets

at the transmitter side, also known as Inter-Packet Delay (IPD).For example, Table 5-1

lists some speech codec standards and the associated IPDs that are required for a correct

implementation of those protocols. Packets leaving the transmitter might traverse a

large number of links in the Internet before reaching the proper destination. Along this

the two-way matching features, proposed by us. This type of features uses both the

temporal and spacial information carried in network traffic. It is a very effective indicator

of network anomaly associated with spoofed source IP address. We designed a novel

data structure, referred to as Bloom filter array, to efficiently extract two-way matching

features. Different from the existing works, our data structure has the following properties:

1) i,',in.' Bloom filter, 2) combination of a sliding window with the Bloom filter, and

3) using insertion-removal pairs to enhance the Bloom filter with a removal operation.

Our analysis and simulation demonstrate that the proposed data structure has a better

space/time trade-off than conventional algorithms.

In the end, we applied the machine learning technology to network anomaly detection.

Specifically, we used B li, -i in, model to determine the state of each edge router, normal

or abnormal. Traditionally, edge routers are regarded as independent. It is incapable to

detect low-traffic network anomaly. Straightforward improvement over this independent

model is to regard edge routers to be dependent on each other. However, this method has

an exponential time complexity to determine the edge router states, i.e., O (2"), where

r is the number of edge routers. We proposed the hidden Markov tree (HMT) to model

the correlations among edge routers. It takes advantages of employing dependence among

edge routers while has almost linear time complexity, i.e., O (Ba), where B is the number

of child nodes of each non-leaf node in the HMT. Our machine learning scheme has the

following nice properties:

* In addition to detecting network anomalies having high-data-rate on a link, our
scheme is also capable of accurately detecting attacks having low-data-rate on
multiple links. This is due to exploitation of spatial correlation of network anomalies.

* Our scheme is robust against time-varying traffic patterns, owing to powerful machine
learning techniques.

* Our scheme can be deploy, 1 in large-scale high-speed networks, thanks to use of
Bloom filter array to efficiently extract features.

and level-two filters are placed in traffic monitors. A feature extraction module can be

placed in either a traffic monitor or a local analyzer, depending on the type of the feature.

Level-one filters select a packet based on its source-destination pair, which is defined

by the source IP address (SA), the source network mask (SNM), the destination IP

address (DA) the destination network mask (DNM). For example, if we are interested in

packets from to, we can choose as both the SNM

and the DNM; if we are interested in packets from 172.10.x.x to 208.33.1.x, we can use as the SNM and as the DNM. In this way, we selectively monitor

an end-host or a subnet, giving much flexibility in framework configuration. The output of

a level-one filter is packets with the same source-destination pair, which are cor,:' .1 to

level-two filters.

A level-two filter classifies the packets coming from level-one filters, based on

the upper- iv.-r' data fields, e.g., TCP SYN or FIN. The packets of interest will be

forwarded to one or multiple feature extraction modules. For example, the number of

TCP SYN packets can be used to generate both the TCP SYN rate feature and the TCP

SYN/FIN(RST) ratio feature; hence, TCP SYN packets are conveyed to both the TCP

SYN rate module and the TCP SYN/FIN(RST) ratio module (Figure 3-1). On the other

hand, a feature module may need packets from multiple level-two filters. For example, the

SYN/FIN(RST) ratio feature extraction requires packets from three filters (Figure 3-1).

Compared to the packet classification schemes developed by Wang et al.[ll] and

Peng et al.[12], our hierarchical structure for feature extraction is more general and


Next, we describe the most important module in the three-level hierarchical structure,

the feature extraction module.

1 Here, the upper liv--r can be either Liv-r 4 or Li--r 7.

When these flows are mixed together, classification accuracy is reduced. But the

accuracy reduction is acceptable. Specifically, for hybrid voice traffic at PFA w 0, PD is

reduced from 1 to 0.986; for hybrid video traffic, it is from 0.965 to 0.948.

This shows that our approach is robust. The robustness results from the fact that the

subspace identification module, as presented in Section 6.3, decomposes multiple subspaces

in the original high-dimensional feature space. As a result, PSD feature vectors of Skype

and GTalk are likely to be within different subspaces than those of MSN. Therefore, we

can still classify traffic accurately.

6.6 Summary

In this chapter, we describe the VOVClassifier system to do network classification.

VOVClassifier is composed of four components, feature summary generator, feature

extractor, subspace generator, and voice/video classifier. The novelty of VOVClassifier is

* modeling a network flow by a stochastic process;

* estimating PSD feature vector to extract the regularities residing in voice and video

* decomposing subspaces of training feature vectors followed by bases identification;

* using minimum distance to subspace as the similarity metric to perform classification.

Experiment results demonstrate the effectiveness and robustness of our approach.

Specifically, we show that classification of voice traffic is more accurate than that of video

traffic; classification of single-typed flows is more accurate than that of hybrid flows;

and classification of pure Skype flows is more accurate than that of flows generated from

multiple applications (e.g., Skype, MSN, and GTalk).

Bloom filter stores data in a vector V of M elements, each of which consists of one

bit. Bloom filter also uses IC hash functions hi:S v-+ ZM, where i E c. Figure 3-6

describes the insertion and search operations of Bloom filter.

1. function BloomFilterInsert(V, s)
2. for Vi e Z;c do
3. V[hi(s)] 1
4. end function
5. function BloomFilterSearch(V, s)
6. for Vi E Z;K do
7. if V[hi(s)] / 1 then
8. return false
9. end for
10. return true
11. end function

Figure 3-6. Bloom Filter Operations

P p' P' p

ti t ti+1 t '2 ti+2 ti t 1 t '2 ti+1
(a) (b)

Figure 3-7. Scenarios of the problems caused by Bloom filter. (a) Boundary problem. (b)
An outbound packet arrives before its matched inbound packet with t2 t1 < F.

Although Bloom filter has better performance in the sense of space/time trade-off, it

cannot be directly applied to our application because of the following problems:

1. Bloom filter does not provide removal functionality. Since one bit in the vector may
be mapped by more than one item, it is unsuitable to remove the item by setting all
bits indexed by its hash results to 0.

2. Bloom filter does not have counting functionality. Although the counting Bloom filter
[22] can be used for counting, it replaces a bit with a counter, which significantly
increases the space complexity.

[59] P. Stoica and R. Moses, Spectral A,i,.l-.:- of S.:i,.'l- 1st ed. Upper Saddle River, NJ:
Prentice Hall, 2005.

[60] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms,
2nd ed. Duxbury Press, Sept. 2001.

[61] L. I. Smith, "A tutorial on principal components analysis," Feb. 2002.
[Online]. Available:

[62] K. V. Deun and L. Delbeke, \!ill dimensional scaling," University of Leuven.
[Online]. Available:

[63] J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for
nonlinear dimensionality reduction," Science, vol. 290, pp. 2319-2323, Dec. 2000.

[64] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality reduction by locally linear
embedding," Science, vol. 290, pp. 2323-2326, Dec. 2000.

[65] W. Hong, "Hybrid models for representation of imagery data," Ph.D. dissertation,
University of Illinois at Urbana-C'l I' p ,i-n Aug. 2006.

'5 -15
O -20
5 -25
0 -30
-35 \ /-\3 5- / \ F\ fx

0 02 04 06 08 1
Normalized Frequency (xTc rad/sample)

Figure 5-5. Power spectral density of two sequences of time-varying inter-arrival times for
video traffic

and/or combinations of the two. Its decoding needs reference to previously decoded

frames. Packets containing I-frames are larger than those containing P-frames. Usually,

the number of P-frames between two consecutive I-frames is constant. Hence, one can

observe a strong periodic variation of packet size due to the interleaving of I-frames and

P-frames composing video data streams. Voice streams have similar phenomenon if Linear

Prediction Coding (LPC), e.g., code excited linear prediction (CELP) voice coder, is

employ, ,1 As an example, Figs. 5-6 and 5-7 show the power spectral density of voice and

video packet sizes, respectively.

5.3.4 Combining Packet Inter-Arrival Time and Packet Size in Frequency

Figs 5-8 and 5-9 show how the regularities hidden in voice and video data streams

can be amplified when combining the two features together in one single stochastic process

that will be described later in the paper. Note how the two important frequencies are

amplified and clearly visible in the PSD plots. The reason why there is a peak in the

PSD for voice (see Figure 5-8) is that voice applications usually produce close-to-constant

packet rate due to constant inter-packet delay of the widely used speech codecs listed

in Table 5-1; e.g., the peak of 33 Hz in Figure 5-8 corresponds to 30 ms inter-packet


2.1 Introduction

The first issue of network anomaly detection is to design a framework. There are

two types of network anomaly detection frameworks, i.e., host-based frameworks and

network-based frameworks. Host-based frameworks are deploy, .1 on end-hosts. These

frameworks typically use firewall and intrusion detection systems (IDS), and/or balance

the load among multiple (geographically dispersed) servers to defend against network

anomalies. The host-based approaches can help protect the server system; but it may not

be able to protect legitimate access to the server, because high-volume abnormal traffic

may congest the incoming link to the server.

On the other hand, network-based frameworks are deploy, .1 inside networks, e.g.,

on routers. These frameworks are responsible for detecting network anomalies and

identifying abnormal packets/flows or anomaly sources. To detect network anomalies,

signal processing techniques (e.g., wavelet [8], spectral analysis [9, 10], statistical methods

[11-13]), and machine learning techniques [14] can be used. To identify network anomaly

sources, IP traceback [15] is typically used. The IP traceback techniques can help contain

the attack sources; but it requires large-scale deployment of the same IP traceback

technique and needs modification of existing IP forwarding mechanisms (e.g., IP header


This chapter presents our network anomaly detection framework, which is of the

network-based category. We present our framework design in Section 2.2 and summarize

this chapter in Section 2.3.

2.2 Edge-Router Based Network Anomaly Detection Framework

To detect network anomalies in an ISP network, we designed an edge-router based

network anomaly detection framework. The motivation results from an ISP network

architecture (Figure 2-1). It consists of two types of IP routers, i.e., core routers and edge

Yule-Walker method. Now, we describe methods to estimate coefficients of an AR

signal. Given the fact that q = 0, Equation (6-10) can be written as
y(i) + ,,(i- t) =(i). (6-11)

Multiplying Equation (6-12) by y*(i k) and taking expectation at both sides, one obtains
r(k; y) + Y atr(k t; y) =E {E(i)y*( k)}. (6-12)
t= 1
Noting that

0 ifi k
E{ (i)y*(- k)} (6-13)
E2 ifi k

one obtains the equation system

r(0; y) + Et, atr(-t; y) 2 (
< (6-14)
r(k;y) + EY atr(k t;y) = 0, k 1,...,p.

Equation (6-14) can be rewritten in matrix form, i.e.,

r(0;y) r(- ;y) r(-p; y) 1 2

r(1; y) r(; y) al 0
r(- ;) :

r(p; y) (. r(0;y) 0, O

Equation (6-15) is known as the Yule-Walker method for AR spectral estimation[59].

Given data {y(i)} 1i, one first estimates the autocovariance sequence

r(k; y) 7= L =k+ y(i)y*(i k)
S, k = 0,...,p. (6-16)
(-k; y) = *(k; y)

When {r(k; y); k = -p,...,p} is replaced by its estimate {r(k; y); k = -p,...,p},

Equation (6-15) becomes a system of p + 1 linear equations with p + 1 unknown variables,

Finally, we obtain the posterior probabilities by Equations (4-33) and (4-34). These

two equations are proven in Appendix A.3 and A.4, respectively. The estimated posterior

probabilities are used in Function TransProbEstimate to update estimates of transition

probabilities (see Figure 4-9).

Till now, we established the HMT model and described approaches to estimate

its model parameters from training data. Next, we present network anomaly detection

approaches using the fully determined HMT model.

4.4 Network Anomaly Detection Using HMT

In this section, we present the network anomaly detection using HMT model. This

is equivalent to a decoding problem in terms of pattern classification. That is, given an

observation sequence, i.e., extracted features ( {1(; iE E}, and a HMT model defined by

* prior probabilities: P(RQ), i E Bo;

* likelihood: p(Q ij|), i E B;

* transition probabilities: P(QjI2,(i)), i E B \ "o,

we need to compute the "b' -I state combination = {2i; i E E}. Here, the word
"b. -I is in terms of maximum gain criterion, as illustrated in Equation (4-19). We

rewrite the criterion in Equation (4-39),

i = arg max f (u,)P(A, = u, I Qp(i) = p(i'), )

I IP(Q ) I"c ` (4-39)

for V i E B. Function ViterbiDecodeHMT, as illustrated in Figure 4-11, shows the

pseudo-code for the classification algorithm.

Function ViterbiDecodeHMT takes three arguments. The first two arguments, the

transition probabilities and the likelihood, are estimated during training phase. The

last one is the extracted features, based on which we perform anomaly detection. It

returns the estimates of node states. Among them, we are only interested in states of leaf

2. current training feature, (k).

It returns estimate of posterior probabilities

P (+ O ( ) I|(k)) p(j+) i p(i) I -(k)) \ 70

With the posterior probabilities obtained through BP algorithm, we update the

estimates of transition probabilities by Equation (4-30) for V i E E \ Bo and step to the

next iteration. When the estimates converge, iteration stops and Function TransProbEsti-
mate returns estimates obtained at the last iteration.

The validity of Equation (4-30) is shown in the following. For V i E E \ 0o,

P(Q|, QP)) J P(QAI Q, = )p b)dQ

= f [P((Q|, Q(), ))] (4-35)

where [.] represents statistical expectation and the subscript stands for the random

variable over which expectation is taken. Because sample average is alv--x the best

unbiased estimate of statistical expectation[30], we estimate

[P((Q|I Q(,),))]


k -1 k -1

Combining Equations (4-35) and (4-36), we obtain Equation (4-30).

Next, we discuss the belief propagation algorithm, which is called by function


Belief propagation algorithm. The BP algorithm [33-35], also known as the

sum-product i,'lj .:thm, e.g., [36-39], is an important method for computing approximate

marginal distributions. In this paper, we apply the BP algorithm to estimating posterior

probabilities (Figure 4-10).


o 50

| 40



0 02 04 06 08
Normalized Frequency (xrc rad/sample)

Figure 5-6. Power spectral density of two sequences of discrete-time packet sizes for voice

Trace 1
75- -Trace 2

m 70

6 65
a 60





0 02 04 06 08
Normalized Frequency (xrc rad/sample)

Figure 5-7. Power spectral density of two sequences of discrete-time packet sizes for video

delay. Compared to voice, video applications have a flatter PSD. The reason is as below.

The number of bits in an I-frame of video depends on the texture of the image (e.g.,

the I-frame of a blackboard image produces much less bits than that of a complicated

flower image), resulting in a large range in the number of packets in an I-frame, e.g., from

1 packet to a few hundred packets produced by an I-frame. The frame rate is usually

constant (e.g., 30 frames/s is a standard rate in USA), i.e., a frame will be generated every

6 BFA (=0 001)


S 2 3 4 5 6 7 8
Average number of hash function calculations

Figure 3-14. Average processing time per query (in ps) vs. average number of hash
function calculations per query.

processing time per query), in Section 3.6.1, we used the average number of hash function

calculations per query to represent the time complexity of the hash table algorithm and

the BFA algorithm.

Performance comparison between numerical and simulation results.

Figure 3-15 compares the simulations results and the numerical results obtained from

the analysis in Section 3.6, for both hash table algorithm and BFA algorithm in terms of

space complexity vs. time complexity.

In Figure 3-15(a), the numerical result agrees well with the simulation result, except

when the average number of hash function calculations per query is close to 1. From

Equation (3 6), if the expected number of hash function calculations approaches 1, the

required memory size approaches infinity; in contrast, simulations with a large i A, may

not give accurate results, due to limited memory size of a computer. This causes the big

discrepancy between the numerical result and the simulation result when the average

number of hash function calculations per query is close to 1. When the average number

of hash function calculations per query is greater than or equal to two, it is observed that

simulation ah--,~ requires more memory than the numerical result. This is due to the

fact that practical hash function is not perfect. That is, entries in the hash table are not

or 95'. The algorithm returns 5 variables. j7 represents the sampled mean of all feature

vectors. It is the origin of the identified subspace. The columns of U are the bases with

dominant energy (i.e.,variance), whose corresponding variances are denoted by E. These

bases determine the identified low dimensional subspace spanned by T. The columns of

U compose the null space of the previous subspace, whose corresponding variances are E.

The last two outputs are required to calculate the distance of an ongoing feature vector to

the subspace, which will be described in Section 6.4.

Applying the function II. ,l:fyBases on all segmentations, we obtain

k ik k) IdentifyBases (T ) (6-56)

for Vk = 1,..., K, i = 1, 2. These are the outputs of subspace identification module, and

hence the results of training phase, in Figure 6-1.

During the classification phase, these outputs are used as system parameters, which

will be presented in the next section.

6.4 Voice/Video Classifier

In Section 6.3, we presented an approach to identify subspaces spanned by PSD

feature vectors of training voice and video flows. Specifically, one obtains the following


[l~,^)j i) ,u ) U i)] ( -57)
pk Uk k k k (6-57)

for Vk = 1,...,K, i, = 1, 2. In this section, we use these parameters to do classification.

During the classification phase, for each ongoing flow F, one composes a sub-flow,

Fs, by extracting small packets, i.e., packets smaller than Op, and passes it through PSD

feature extraction module to generate PSD feature vector Q. This is the input to the

voice/video classifier.

The voice/video classifier works in the following way. It first calculates the normalized

distances between p and all subspaces of both categories. Then it chooses minimum

Edge routers


S Autonomous

Core routers Subnet

Figure 2-1. An ISP network architecture.

routers. Core routers interconnect with one another to form a high-speed autonomous

system (AS). In contrast, edge routers are responsible for connecting subnets (i.e.,

customer networks or other ISP networks) with the AS. In this paper, a subnet can be

either a customer network or an ISP network.

Global Analyzer - Local Analyzer

Autonomous Systehinm

SLocal Analyzer

0 Traffic monitor

Directional link between autonomous system and subnet

Figure 2-2. Network anomaly detection framework.


3.1 Introduction

Given the network anomaly detection framework we have established, the second

issue of network anomaly detection is feature extraction. Features for network anomaly

detection have been studied extensively in recent years. For example, Peng et al.[12]

proposed the number of new source IP addresses to detect DDoS attacks, under the

assumption that source addresses of IP packets observed at an edge router were relatively

static in normal conditions than those during DDoS attacks. Peng further pointed out

that the feature could differentiate DDoS attacks from the flash crowd, which represents

the situation when many legitimate users start to access one service at the same time.

For example, when many people watch a live sports broadcast over the Internet at the

same time. In both cases (DDoS attacks and the flash crowd), the traffic rate is high. But

during DDoS attacks, the edge routers will observe many new source IP addresses because

attackers usually spoof source IP addresses of attacking packets to hide their identities.

Therefore, this feature improves those DDoS detection schemes that rely on traffic rate

only. However, Peng et al.[12] focused on detection of DDoS attacks. It did not mention

other types of network anomalies. For example, when malicious users are scanning the

network, we can also observe high traffic rate but few new source IP addresses. It is

very important to differentiate network scanning from flash crowd because the former is

malicious but the latter is not. The two-way matching feature on different network l1 i-ri

(Section 3.3.1) can tell not only the presence of network anomalies but also their cause.

Lakhina et al.[16] summarized the characteristics of network anomalies under different

causes. Its contribution is to help identify causes of network anomalies. For example,

during DDoS attacks, we can observe high bit rate, high packet rate, and high flow rate.

The source addresses are distributed over the whole IP address space. On the other hand,

during network scanning, all the three rates are high, but the destination addresses,

packets or the number of unmatched SYN packets as the feature can achieve much higher

detection probability. More importantly, our machine learning algorithm can significantly

outperform CUSUM, given the same feature data, no matter whether the feature is the

number of SYN packets or the number of unmatched SYN packets.


I 0.6

S 0.4

0.2 Threshold (SYN)
Machine Learning (SYN)
Threshold (UM-SYN)
0 Machine Learning (UM-SYN)
0 0.2 0.4 0.6 0.8 1
False Alarm Probability

Figure 4-13. Performance of threshold-based and machine learning algorithms with
different feature data

Figure 4-13 compares the ROC curve of the threshold-based scheme described in

Section 4.1.2 and our machine learning algorithm under two different features, i.e., the

number of SYN packets (denoted by 'SYN') and the number of unmatched SYN packets

(denoted by 'UM-SYN'). We observe that, for the same detection algorithm, using the

number of unmatched SYN packets can significantly improve the ROC performance,

compared to using the number of SYN packets. In other words, given the same false alarm

probability, the detection probability is much higher when using the number of unmatched

SYN as feature.

Another important observation from Figure 4-13 is that given the same feature data,

our machine learning algorithm can (significantly) improve the ROC, compared to the

threshold-based scheme; e.g., for the same false alarm probability of 0.05, our machine

learning algorithm achieves a detection probability of 0.93, while the threshold-based

scheme only achieves a detection probability of 0.72. This is due to the fact that our

1. function HashTableInsert(V, s)
2. for i -- 0 to C 1
3. if V[hi(s)] is empty
4. insert s to V[hi(s)], set state bit of V[hi(s)] to "UNMATCHED"
5. return
6. end if
7. end for
8. report insertion operation error
9. end function
10. function HashTableSearch(V, s)
11. for i -- 0 to C 1
12. if V[hi(s)] is empty
13. return false
14. if V[hi(s)] holds s
15. return true
16. end for
17. return false;
18. end function
19. function HashTableRemove(V, s)
20. for i -- 0 to C 1
21. if V[hi(s)] is empty
22. return;
23. if V[hi(s)] holds s
24. set state bit of V[hi(s)] to \lATCHED"
25. return true;
26. end if
27. end for
28. end function
Figure 3-5. Hash Table Algorithm

3.4.2 Bloom Filter

The hash table algorithm can be used for offline traffic a n i1 i-; or analysis of low

data-rate traffic but it cannot catch up with a high data rate at edge routers. To address

this limitation, one can use Bloom filter algorithm[21]. Compared to the hash table

algorithm, Bloom filter algorithm reduces space/time complexity by allowing small degree

of inaccuracy in membership representation, i.e., a packet signature, which does not

appear before, may be falsely identified as present.

methods scan the parameter space, EM works in a better way. It is proven that after each

iteration, EM algorithm guarantees to generate estimates of parameters which increase

Equation (4-28). As a result, EM algorithm converges much faster than numerical ML


However, the disadvantage of EM algorithm is that it converges to a local maxima

rather than the global one. Specifically, initial values of parameters determine the local

maxima to which the EM algorithm converges. In practice, we have prior knowledge of

network features, which helps to choose initial values of parameters.

Till now, we present schemes to estimate likelihood of HMT. In next section, we

estimate transition probabilities.

4.3.2 Transition Probability Estimation

In this section, we estimate P(QilSp,()), i EE \ 0o. Since closed form representation of

the transition probabilities is not available, we also estimate them in an iterative way.

Denote by {Q(k); k {1,... K}} the set of training features for transition probability

estimation, where (k) 0(k); ci E -}. Figs. 4-9 and 4-10 show the pseudo-code for

transition probability estimation. In next two sections, we explain the two figures.

Iteratively estimate transition probabilities. Figure 4-9 shows the pseudo-code

to estimate transition probabilities. The function TransProbEstimate takes three sets of


1. likelihood estimated in Section 4.3.1, {p( |Qj); i E B};

2. training features, {((k); k e {1,..., K}}.

It returns the estimate of transition probabilities, i.e.,

P(, Q~,(Q)),V i \ o.

Before the iterations, we set the initial transition probabilities to be 1 at line 5 of

Figure 4-9. This is equivalent to assume normal state and abnormal state to be initially

space/time trade-off than the hash table. We also see that the curve of BFA with r = 1

is below the curve of BFA with l = 0.1 This shows the relationship between space/time

and collision probability. Specifically, to reach a lower collision probability or more

accurate detection, we need to either calculate more hash functions or use more storage


To see the gain of using BFA, let us look at an example. Suppose E[T] = 5, i.e., in

each slot, 5 hash function calculations is needed on average. Then, the memory required

by the hash table scheme, BFA with q = 0.1 t and BFA with = 1-. is 1.01G bits,

115.3M bits, and 62.9M bits, respectively. It can be seen that our BFA with = 1 can

save storage by a factor of 16, compared to the hash table scheme.

Figure 3-10 shows that for the hash table scheme, i. is a monotonic decreasing

function of E[Th]. The observation matches our intuition that the larger table, the smaller

collision probability for hash functions, resulting in less hash function calculations. Further

note that i., approaches RF(b + 1) when E[Th] increases. This is the minimum space

required to tolerate up to RF flows.

For BFA, .1, is not a monotonic function of E[T,], which approximately equals /C.

We have the following observations.

* Case A: For fixed storage size, the smaller IC, the larger the probability that all /C
hash functions of two different inputs return the same outputs, which is the collision
probability. In other words, the smaller IC, the larger storage size required to achieve
a fixed collision probability. That is, KC = i= M, T.

* Case B: Since an input to BFA may set KC bits to "1" in a vector V, hence the larger
IC, the more bits in V will be set to "1" (nonempty), which translates into a larger
collision probability. In other words, the larger IC, the larger storage size required to
achieve a fixed collision probability. That is, K/ T=> .l, T.

Combining Cases A and B, it can be argued that there exists a value of K/ or E[Ta]

that achieves the minimum value of i ,, given a fixed collision probability. This minimum

property can be used to guide the parameter setting for BFA, which will be addressed in

Section 3.6.2.

traffic from multiple types of applications (such as voice, video, chat, and file transfer),

referred to as i,;l';.: flow in the remainder of this paper.

Our research focuses on detecting and classifying voice and video traffic and further

deal with its more general formulation that considers the presence of hybrid flows.

Based on the intuitions that voice and video data streams show strong regularities in

the inter-arrival times of packets within the flow and the associated packet sizes when

combined together in one single stochastic process and analyzed in the frequency domain,

we propose a system, called VOVCla' -'.,. r for voice and video traffic classification.

VOVClassifier is an automated self-learning system that is composed of four

ii i P.i modules operating in cascade. The system first is trained with voice and video

data streams and afterwards enters the classification phase. During the training

period, all packets belonging to the same flow are extracted and used to generate a

stochastic model that captures the features of interest. Then all flows are processed in

the frequency domain using Power Spectral D. u;-.:/ (PSD) analysis in order to extract a

high-dimensional space of frequencies that carry the 1 Pi ii y of energy of the signal. All

features extracted from each flow are grouped into a I. II ire vector". Due to the wide

usage of different codecs for voice and video, we propose a second module that clusters the

feature vectors into several groups using Subspace Decomposition (SD) and then identifies

its subspace structure, e.g. bases of the subspace, using Principal Component A,.l;.i..:

(PCA). These two steps are applied to all flows during the training period and produce

low-dimensional spaces, referred in the paper as voice subspace and video subspace.

After training, all flows are processed by the PSD module and the associated feature

vector is compared with the voice and video spaces obtained during training. The space at

minimum normalized distance from the feature vector is selected as candidate and chosen

if and only if its distance is below a specific predetermined threshold.

We applied VOVClassifier to real packet traces collected into two different network

scenarios. Results demonstrate the effectiveness and robustness of our approach, able to

1 ----r










0 02











04 06 08


02 04 06 08










0 02 04 06 08











02 04 06 08


(c) (d)

Figure 6-11. The ROC curves of hybrid flows generated by Skype, MSN, and GTalk: (a)


One application vs. multiple applications. One further notes that classification

of Skype flows is more accurate than that of flows generated from general applications, i.e.,

Skype, MSN, and GTalk.

Empirically, we found that Skype flows are similar to GTalk flows, but quite different

from MSN. For example, both Skype and GTalk voice flows have about 33-millisecond

inter-arrival time, whereas MSN voice flow has approximately 25-millisecond inter-arrival


In this section, we use the formulae derived in above sections to compare the hash

table scheme with BFA algorithm through numerical calculations. The setting of our

numerical study is the following:

1. Traces captured from an ISP's edge router shows that the average number of
flows during one second is around 250, 000. So, we let R=250, 000. To reduce the
probability of false alarms caused by normal packets with long RTT, we choose F
large enough such that more than 9'' packets have RTT less than F. For the same
traces, F 80 seconds.

2. Suppose we want to detect TCP traffic anomaly. Thus the signature captured from
each packet is composed of 32-bit SA, 32-bit DA, 16-bit SP, and 16-bit DP. So b = 96

3. In the BFA algorithm, we use 40 time slots (i.e., w = 40), each of which is 2 seconds
(i.e., 7 2). Also suppose each counter is a 32-bit integer (i.e., L = 32).

Hash Table
-- BFA ( = 0 001)
SBFA ( = 0 01)


1 2 3 4 5 6 7 8 9 10
Time E[T]

Figure 3-10. Space/time trade-off for the hash table, BFA with = 0.1 and BFA with

Figure 3-10 shows M vs. E[T] for the hash table scheme, BFA with collision

probability 1 and BFA with collision probability 0.1 In Figure 3-10, X axis represents

the time complexity (i.e., the expected number of hash function calculations) and Y

axis represents the space complexity (i.e., the number of bits needed for storage). From

Figure 3-10, we can see that the curve of BFA is below the curve of the hash table. It

means BFA uses less space for a given time complexity. Therefore, BFA achieves better

=( 2Q1, 22, Q ), a (4-12)

u U( U2, U ,) (413)

1 Q02, )O (414)

where fi is the random variable representing state of edge router i, i E {1,..., }, which

is defined in the same way as in Equation (410). We further assume gain factors are

independent of node index, i.e., <(ui) = X(uv,) whenever ui = u, (0 or 1), no matter

whether i is equal to i' or not. Then the maximum gain criterion (see Equation (4-9)) for

the dependent model is

u = argmax (i)p()J|Q = ui)P(Q = i)

-argmax [ixUi)P(Qi 2 Ut) P(l it). (4-15)

As the dependent model takes spatial correlation into consideration, it can make more

accurate detection, especially when traffic volume is low. However, it is a computationally

intractable model. That is because solving Equation (4-15) directly, we need to exhaustively

compute p(|l)P() for each possible combination of Q, which results in a O (2")

complexity. For a large AS, it is intractable.

We introduced a hierarchical structure to reduce computation complexity, which is the

topic of the next section.

4.2.3 Hidden Markov Tree (HMT) Model for Global Analyzer

The reason that the dependent model illustrated in Figure 4-4 becomes computationally

intractable is that we assume edge routers are fully dependent. A rough understanding

is that, if we break some dependence in Figure 4-4, we can reduce the computation

complexity. On the other hand, we would like to account for the dependencies among as

many nodes as possible to provide accurate detection. To balance these two conflicting

4.2 Bayesian Model for Network Anomaly Detection

In this section, we model the network anomaly detection issue in terms of B li, -i i,

decision theory. This section is organized as follows. Section 4.2.1 generalizes the

B li, i i model for network anomaly detection on traffic monitors and local analyzers.

In section 4.2.2, we extend this model to the whole autonomous system. Section 4.2.3

introduces the hidden Markov tree model to decrease the computation complexity of the

general model defined in Section 4.2.2.

4.2.1 Bayesian Model for Traffic Monitors and Local Analyzers

As described earlier, both the traffic monitor and the local ,i 1v-. r have local

information of one edge router. They are able to detect network anomalies through one

single edge router. That is, a traffic monitor makes anomaly declaration when it observes

abnormal features extracted from one link of an edge router, such as large data rate, large

SYN/FIN(RST) ratio, and so on. Similarly, a local analyzer detects network anomaly

by observing the two-way matching features on one edge router. Next, we formulate the

detection problem in terms of the B i, -i in decision theory introduced in Section 4.1.4.

Figure 4-1. Generative process in graphical representation, in which the traffic state
generates the stochastic process of traffic.

In the context of network anomaly detection, there are two states of nature of an edge

router, i.e., H= {Ho, H1}, where

* Ho represents normal state, in which case no abnormal network traffic enters the AS
through that edge router;

* Hi represents abnormal state, in which case abnormal network traffic enters the AS
through the edge router.

where i' E E\Eo. Therefore, it also requires to calculate posterior probabilities. We employ

BP algorithm (see Equations (4-40) and (4-41) in Figure 4-11) to solve Equation (4-43) in

a way similar to HMT transition probability estimation(see Figure 4-10).

Obtaining solutions to Equation (4-43) for all nodes, we can solve Equation (4-39). A

brute force solution to Equation (4-39) is to exhaustively compute

IH I)P(/ Ui,\Q^) U), V I P () i\) W (4-44)

for V i e E \ Eo, V Ui e {Space of 0} and select the "b. -i one. A HMT with L levels

modeling an AS with K edge routers has

1 -L
1 B-1

nodes, each of which has two states, normal and abnormal. Then the 0 space has

possible values. The computation complexity of the brute force method is

even worse than the dependent model (see Figure 4-4), whose complexity is

0 (2").

In this paper, we applied Viterbi algorithm [40-42] to solving Equation (4-39) in an

iterative manner, reducing the computational complexity to O (Ba).

The motivation of Viterbi algorithm is the following. It iterates from top level of a

HMT to bottom level. At each iteration, it estimates the node states in that level in a way

that, when combined with states estimated in upper levels, "b, -1 explains the observed

features. That is, at each iteration, Viterbi algorithm albv-i- selects the local maxima.

3.6.2 Optimal Parameter Setting for Bloom Filter Array

This section addresses how to determine parameters of BFA under two criteria,

namely, minimum space criterion and competitive optimality criterion.

Minimum space criterion. According to Equation (3-25), three parameters, ii,,

E[Ta], and r7, are coupled. Since the collision probability Tr critically affects the detection

error rate in our network anomaly detection, a network operator may want to choose an

upper bound r on the acceptable collision probability r and then minimize the storage

required, i.e.,

According to Equation (3

min ii,, subject to T < T
E[Ta h

25), the solution of (3 26) is as below

i f* = minif, = min w [2aR (qT, E[Ta]) + L],
E[Ta] E[Ta]

E[T] = arg minf ., arg min aR (T7, E[T,]).
E[Ta] E[Ta]

10 10


10 10

10" 10


Figure 3-11. Relation among space
(a) .l: vs. T1. (b) E[T]a* vs. T7.

complexity, time complexity, and collision probability.

Figure 3-11 shows .1 1 vs. T7, and E[T,]* vs. Tr under the same setting as that

in Section 3.6.1. From Figure 3-11(a), it can be observed that .1i decreases when T




10 10


and a partition of video feature vector set T(2)

(2) ,T2) U .-. U (2) (6-53)

Next, we describe the method to identify subspace bases in each of the segmentations.

6.3.2 Subspace Bases Identification

In this section, we use PCA[61] algorithm to identify subspace bases for each


{ (i k 1,..., K i 1,2}, (6-54)

obtained in the previous section. The basic idea is to identify uncorrelated bases and

choose those bases with dominant energy. Figure 6-6 shows the algorithm.

1. function L,,zsU ,1] II I. IfyBases(T E R MXN6)
2. 4- '1
3. P-,', ^ b, im -
4. Do eigenvalue decomposition on 4T such that

TT =UUT', (6-55)

where U [ul, M], E L diag ([ar ,..., a ]), and a2 > 7g > > a.
5. j argmin5 EY 12(r) > MEM 172(M)
6. U = [uit, u2,..., i-1
7. U [j, ij+ ,... ,UMI
8. E diag( 91 ,..., 7- _J)
9. E= diag ( r.... c.r )
10. end function
Figure 6-6. Function II. ,l.:fyBases identifies bases of subspace.

In Figure 6-6, argument T represents the feature vector set of one segmentation and

6 is a user defined parameter which specifies the percentage of energy retained, e.g., 91'

5.3.2 Packet Inter-Arrival Time in Frequency Domain

We show how the same feature becomes a key reliable feature when observed in

the frequency domain. In this new domain, we are interested whether it does exist any

frequency component, e.g. inter-arrival time, that captures the 1i, ii, i ily of the energy of

this stochastic process at the receiver side. We exploit the above by computing the power

spectral density (PSD) analysis of the packet inter-arrival time process of two traces, each

of which is of length 10 second, in Figures 5-4 and 5-5 respectively for voice and video. We

can see that some regularity for both voice and video exist for different traces, although

the regularity is not quite strong. This result holds true for all experiments conducted

when transmitting Skype voice and video packets over the Internet from University A to

University B.

Trace 1
Trace 2
o -30

P -40

a -50

0 02 04 06 08 1
Normalized Frequency (xTc rad/sample)

Figure 5-4. Power spectral density of two sequences/traces of time-varying inter-arrival
times for voice traffic

5.3.3 Packet Size in Frequency Domain

Somewhat stronger regularity is visible for voice and video packet sizes. Indeed,

most video coding schemes use two types of frames[58], i.e., Intra frames (I-frame)

and Predicted frames (P-frame). An I-frame is a frame coded without reference to any

frame except itself. It serves as the starting point for a decoder to reconstruct the video

stream. A P-frame may contain both image data and motion vector displacements

[46] T. Karagiannis, A. Broido, N. Brownlee, kc claffy, and M. Faloutsos, "Is p2p dying or
just hiding?" in IEEE Globecom 2004, 2004.

[47] S. Sen, O. Spatscheck, and D. Wang, "Accurate, scalable in network identification of
p2p traffic using application signatures," in WWW, 2004.

[48] K. Wang, G. Cretu, and S. J. Stolfo, "Anomalous p ,iload-based network intrusion
detection," in 7th International Symposium on Recent Advanced in Intrusion
Detection, Sept. 2004, pp. 201 222.

[49] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, "Class-of-service mapping for qos:
A statistical signature-based approach to ip traffic classification," in AC I[ Internet
Measurement Conference, Taormina, Italy, 2004.

[50] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, "Blinc: multilevel traffic
classification in the dark," in SIGCOMM '05: Proceedings of the 2005 conference on
Applications, technologies, architectures, and protocols for computer communications.
New York, NY, USA: AC\ I Press, 2005, pp. 229-240.

[51] C. Dewes, A. Wichmann, and A. Feldmann, "An analysis of internet chat
systems," in IMC '03: Proceedings of the 3rd ACi f SIGCOMM conference on Internet
measurement. New York, NY, USA: AC \! Press, 2003, pp. 51 64.

[52] D. Wu, T. Hou, and Y.-Q. Zhang, "Transporting real-time video over the internet:
C'! i11! y and approaches," Proceedings of the IEEE, vol. 88, no. 12, pp. 1855 1875,
December 2000.

[53] ITU-T, "G.711: Pulse code modulation (pcm) of voice frequencies," ITU-T
Recommendation G.711, 1989. [Online]. Available:

[54] ITU-T, "G.726: 40, 32, 24, 16 kbit/s adaptive differential pulse code
modulation (adpcm)," ITU-T Recommendation G.726, 1990. [Online]. Available:

[55] ITU-T, "G.728: Coding of speech at 16 kbit/s using low-d,'1li code excited
linear prediction," ITU-T Recommendation G.728, 1992. [Online]. Available:

[56] ITU-T, "G.729: Coding of speech at 8 kbit/s using conjugate-structure
algebraic-code-excited linear prediction (cs-acelp)," ITU-T Recommendation G.729,
1996. [Online]. Available:

[57] ITU-T, "G.723.1: Dual rate speech coder for multimedia communications transmitting
at 5.3 and 6.3 kbit/s," ITU-T Recommendation G.723.1, 2006. [Online]. Available:

[58] Y. Wang, J. Ostermann, and Y.-Q. Zhang, Video Processing and Communications,
1st ed. Prentice Hall, 2002.

Full Text








Firstofall,thankmyadvisorProfessorDapengWuforhisgreatinspiration,excellentguidance,deepthoughts,andfriendship.Ialsothankmysupervisorycommitteemembers,ProfessorsShigangChen,LiuqingYang,andTaoLi,fortheirinterestinmywork.Ialsoexpressmyappreciationtoallofthefaculty,sta,andmyfellowstudentsintheDepartmentofElectricalandComputerEngineering.Inparticular,IextendmythankstoDr.KejieLuforhishelpfuldiscussions. 4


page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 14 1.1IntroductiontoNetworkAnomalyDetection ................. 14 1.2IntroductiontoNetworkCentricTracClassication ............ 16 2NETWORKANOMALYDETECTIONFRAMEWORK ............. 18 2.1Introduction ................................... 18 2.2Edge-RouterBasedNetworkAnomalyDetectionFramework ........ 18 2.2.1TracMonitor ............................. 20 2.2.2LocalAnalyzer ............................. 20 2.2.3GlobalAnalyzer ............................. 21 2.3Summary .................................... 22 3FEATURESFORNETWORKANOMALYDETECTION ............ 23 3.1Introduction ................................... 23 3.2HierarchicalFeatureExtractionArchitecture ................. 24 3.2.1Three-LevelDesign ........................... 24 3.2.2FeatureExtractioninaTracMonitor ................ 26 3.2.3FeatureExtractioninaLocalAnalyzeroraGlobalAnalyzer .... 27 3.3Two-WayMatchingFeatures .......................... 27 3.3.1Motivation ................................ 27 3.3.2DenitionofTwo-WayMatchingFeatures .............. 30 3.4BasicAlgorithms ................................ 32 3.4.1HashTableAlgorithm ......................... 32 3.4.2BloomFilter ............................... 33 3.5BloomFilterArray(BFA) ........................... 35 3.5.1DataStructure ............................. 35 3.5.2Algorithm ................................ 36 3.5.3RoundRobinSlidingWindow ..................... 38 3.5.4Random-KeyedHashFunctions .................... 39 3.6ComplexityAnalysis .............................. 40 3.6.1Space/TimeTrade-o .......................... 41 3.6.2OptimalParameterSettingforBloomFilterArray .......... 50 5


............................... 51 3.7.1TheBFAAlgorithmvs.theHashTableAlgorithm .......... 51 3.7.2ExperimentonFeatureExtractionSystem .............. 55 3.8Summary .................................... 57 4MACHINELEARNINGALGORITHMFORNETWORKANOMALYDETECTION ..................................... 59 4.1Introduction ................................... 59 4.1.1ReceiverOperatingCharacteristicsCurve ............... 59 4.1.2Threshold-BasedAlgorithm ...................... 60 4.1.3Change-PointAlgorithm ........................ 60 4.1.4BayesianDecisionTheory ........................ 62 4.2BayesianModelforNetworkAnomalyDetection ............... 64 4.2.1BayesianModelforTracMonitorsandLocalAnalyzers ...... 64 4.2.2BayesianModelforGlobalAnalyzers ................. 66 4.2.3HiddenMarkovTree(HMT)ModelforGlobalAnalyzer ....... 68 4.3EstimationofHMTParameters ........................ 72 4.3.1LikelihoodEstimation .......................... 72 4.3.2TransitionProbabilityEstimation ................... 76 4.4NetworkAnomalyDetectionUsingHMT ................... 81 4.5SimulationResults ............................... 84 4.5.1ExperimentSetting ........................... 84 4.5.2PerformanceComparison ........................ 86 4.5.3Discussion ................................ 88 4.6Summary .................................... 89 5NETWORKCENTRICTRAFFICCLASSIFICATION:ANOVERVIEW .... 90 5.1Introduction ................................... 90 5.2RelatedWork .................................. 94 5.3IntuitionsBehindaProperDetectionofVoiceandVideoStreams ..... 95 5.3.1PacketInter-ArrivalTimeandPacketSizeinTimeDomain ..... 97 5.3.2PacketInter-ArrivalTimeinFrequencyDomain ........... 99 5.3.3PacketSizeinFrequencyDomain ................... 99 5.3.4CombiningPacketInter-ArrivalTimeandPacketSizeinFrequencyDomain .................................. 100 5.4Summary .................................... 102 6NETWORKCENTRICTRAFFICCLASSIFICATIONSYSTEM ........ 104 6.1SystemArchitecture .............................. 104 6.1.1FlowSummaryGenerator(FSG) ................... 105 6.1.2FeatureExtactor(FE)andVoice/VideoSubspaceGenerator(SG) 105 6.1.3Voice/VideoCLassifer(CL) ...................... 106 6.2FeatureExtractor(FE)ModuleviaPowerSpectralDensity(PSD) ..... 107 6.2.1Modelingthenetworkowasastochasticdigitalprocess ...... 107 6


............. 108 6.3SubspaceDecompositionandBasesIdenticationonPSDFeatures .... 115 6.3.1SubspaceDecompositionBasedonMinimumCodingLength .... 117 6.3.2SubspaceBasesIdentication ...................... 120 6.4Voice/VideoClassier ............................. 121 6.5ExperimentResults ............................... 123 6.5.1ExperimentSettings ........................... 123 6.5.2SkypeFlowClassication ........................ 124 6.5.3GeneralFlowClassication ....................... 124 6.5.4Discussion ................................ 125 6.6Summary .................................... 128 7CONCLUSIONANDFUTUREWORK ...................... 129 7.1SummaryofNetworkCentricAnomalyDetection .............. 129 7.2SummaryofNetworkCentricTracClassication .............. 131 APPENDIX APROOFS ....................................... 133 A.1Equation( 4{31 ) ................................. 133 A.2Equation( 4{32 ) ................................. 133 A.3Equation( 4{33 ) ................................. 134 A.4Equation( 4{34 ) ................................. 134 REFERENCES ....................................... 135 BIOGRAPHICALSKETCH ................................ 140 7


Table page 3-1Notationsfortwo-waymatchingfeatures ...................... 31 3-2Notationsforcomplexityanalysis .......................... 41 3-3Space/timecomplexityforhashtable,Bloomlter,andBFA ........... 47 4-1ParametersusedinCUSUM ............................. 60 4-2Notationsforhiddenmarkovtreemodel ...................... 70 4-3Parametersettingoffeatureextractionfornetworkanomalydetection ...... 86 4-4Performanceofdierentschemes. .......................... 86 5-1Commonlyusedspeechcodecandtheirspecications ............... 96 6-1TypicalPDandPFAvalues. ............................. 126 8


Figure page 2-1AnISPnetworkarchitecture. ............................ 19 2-2Networkanomalydetectionframework. ....................... 19 2-3Responsibilitiesofandinteractionsamongthetracmonitor,localanalyzer,andglobalanalyzer. ................................. 20 2-4Exampleofasymmetrictracwhosefeatureextractionisdonebytheglobalanalyzer. ........................................ 21 3-1Hierarchicalstructureforfeatureextraction. .................... 24 3-2Networkinnormalcondition. ............................ 28 3-3Source-address-spoofedpackets. ........................... 29 3-4Reroute. ........................................ 29 3-5HashTableAlgorithm ................................ 33 3-6BloomFilterOperations ............................... 34 3-7ScenariosoftheproblemscausedbyBloomlter.(a)Boundaryproblem.(b)Anoutboundpacketarrivesbeforeitsmatchedinboundpacketwitht2t1<. 34 3-8BloomFilterArrayAlgorithm ............................ 37 3-9BloomFilterArrayAlgorithmusingslidingwindow ................ 38 3-10Space/timetrade-oforthehashtable,BFAwith=0:1%,andBFAwith=1% ........................................... 48 3-11Relationamongspacecomplexity,timecomplexity,andcollisionprobability.(a)Mavs..(b)E[Ta]vs.. ........................... 50 3-12Spacecomplexityvs.collisionprobabilityforxedtimecomplexity. ....... 52 3-13Memorysize(inbits)vs.averageprocessingtimeperquery(ins) ....... 53 3-14Averageprocessingtimeperquery(ins)vs.averagenumberofhashfunctioncalculationsperquery. ................................ 54 3-15Comparisonofnumericalandsimulationresults.(a)Hashtablealgorithm.(b)BFAalgorithmwith=1%. ............................. 55 3-16Featuredata:(a)NumberofSYNpackets(link1),(b)NumberofunmatchedSYNpackets(link1),(c)NumberofSYNpackets(link2),and(d)NumberofunmatchedSYNpackets(link2). .......................... 58 9


........................... 64 4-2Extendedgenerativemodelincludingtracfeaturevectors:(a)originalmodeland(b)simpliedmodel. .............................. 65 4-3Generativeindependentmodelthatdescribesdependenciesamongtracstatesandtracfeaturevectors. .............................. 66 4-4Generativedependentmodelthatdescribesdependenciesamongedgerouters. 67 4-5HiddenMarkovtreemodel.Forannodei,(i)denotesitsparentnodeand(i)denotesthesetofitschildrennodes. ........................ 69 4-6ProbabilitydensityfunctionoftheunivariateGaussiandistributionN(x;0;1). 73 4-7Histogramofthetwo-waymatchingfeaturesmeasuredatarealnetworkduringnetworkanomalies. .................................. 73 4-8TheEMalgorithmforestimatingp(iji=u),i2,u2f0;1g. ......... 75 4-9Iterativelyestimatetransitionprobabilities. .................... 77 4-10Beliefpropagationalgorithm. ............................ 78 4-11ViterbialgorithmforHMTdecoding. ........................ 82 4-12ExperimentNetwork ................................. 85 4-13Performanceofthreshold-basedandmachinelearningalgorithmswithdierentfeaturedata ...................................... 87 4-14Performanceoffourdetectionalgorithms ...................... 88 5-1Averagepacketsizeversusinter-arrivalvariabilitymetricfor5applications:voice,video,letransfer,mixofletransferwithvoiceandvideo. ........... 96 5-2Inter-arrivaltimedistributionforvoiceandvideotrac .............. 97 5-3Packetsizedistributionforvoiceandvideotrac ................. 98 5-4Powerspectraldensityoftwosequences/tracesoftime-varyinginter-arrivaltimesforvoicetrac .................................... 99 5-5Powerspectraldensityoftwosequencesoftime-varyinginter-arrivaltimesforvideotrac ...................................... 100 5-6Powerspectraldensityoftwosequencesofdiscrete-timepacketsizesforvoicetrac ......................................... 101 10


......................................... 101 5-8Powerspectraldensityoftwosequencesofcontinuous-timepacketsizesforvoicetrac ......................................... 102 5-9Powerspectraldensityoftwosequencesofcontinuous-timepacketsizesforvideotrac ......................................... 102 6-1VOVClassierSystemArchitecture ......................... 104 6-2Powerspectraldensityfeaturesextractionmodule.Cascadeofprocessingsteps. 107 6-3Levinson-DurbinAlgorithm. ............................. 113 6-4ParametricPSDEstimateusingLevinson-DurbinAlgorithm. ........... 114 6-5Pairwisesteepestdescentmethodtoachieveminimalcodinglength. ....... 119 6-6FunctionIdentifyBasesidentiesbasesofsubspace. ................ 120 6-7FunctionVoiceVideoClassifydetermineswhetheraowwithPSDfeaturevector~isoftypevoiceorvideoorneither.1are2aretwouser-speciedthresholdarguments.FunctionvoicevideoClassifyusesFunctionNormalizedDistancetocalculatenormalizeddistancebetweenafeaturevectorandasubspace. ..... 122 6-8TheROCcurvesofsingle-typedowsgeneratedbySkype,(a)VOICEand(b)VIDEO. ........................................ 124 6-9TheROCcurvesofhybridowsgeneratedbySkype,(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. ..................... 125 6-10TheROCcurvesofsingle-typedowsgeneratedbySkype,MSN,andGTalk:(a)VOICEand(b)VIDEO. ............................. 126 6-11TheROCcurvesofhybridowsgeneratedbySkype,MSN,andGTalk:(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. .............. 127 11






1 2 ]hasrisenfromapproximately9,472,000inJanuary1996to394,991,609inJanuary2006.Inaddition,theemergenceofnewapplicationsandprotocols,suchasvoiceoverInternetProtocol(VoIP),pear-to-pear(P2P),andvideoondemand(VoD)[ 3 ],alsoincreasesthecomplexityoftheInternet.Accompanyingthistrendisanincreasingdemandformorereliableandsecureservice.AmajorchallengeforInternetserviceproviders(ISP)istobetterunderstandthenetworkstatebyanalyzingnetworktracinrealtime.ThusISPsareveryinterestedintheproblemofnetworkcentrictracanalysis.Weconsiderthenetworkcentrictracanalysisproblemfromtwoperspectives:1)networkanomalydetectionand2)networkcentrictracclassication.Weintroducethetwoperspectivesinthenexttwosections. 14




4 ]summarizedtechniquesforQoSprovisionforreal-timestreamsfromthepointofviewofendhosts.Thesetechniquesincludecodingmethods,protocols,andrequirementsonstreamservers.AnothereectivesolutionisfromthepointofviewofnetworkcarriersorISPs.Forexample,ISPscanassigndierentforwardingprioritytodierenttypesofnetworktraconrouters.Thisisthemotivationofdierentiatedservices(DiServ)[ 5 6 ].DiServisamethoddesignedtoguaranteedierentlevelsofQoSfordierentclassesofnetworktrac.Itisachievedbysettingthe\typeofservice"(TOS)[ 7 ]eld,whichhenceisalsocalledDiServcodepoint(DSCP)[ 5 ],intheIPheaderaccordingtotheclassofthenetworkdata,sothatthebetterclassesgethighernumbers.Unfortunately,suchdesignhighlydependsonnetworkprotocols,especiallyproprietaryprotocols,observingDiServregulations.Intheworstcase,ifallprotocolssetTOStothehighestnumber,itisevenworsetoemployDiServmethod.Forthisreason,webelieveaproperDiServschemeshouldbeabletoclassifynetworktraconthey,insteadofrelyingonanytagsinpacketheader.Thus,thedicultyliesinaccurateclassicationofnetworktracinreal-time. 16




8 ],spectralanalysis[ 9 10 ],statisticalmethods[ 11 { 13 ]),andmachinelearningtechniques[ 14 ]canbeused.Toidentifynetworkanomalysources,IPtraceback[ 15 ]istypicallyused.TheIPtracebacktechniquescanhelpcontaintheattacksources;butitrequireslarge-scaledeploymentofthesameIPtracebacktechniqueandneedsmodicationofexistingIPforwardingmechanisms(e.g.,IPheaderprocessing).Thischapterpresentsournetworkanomalydetectionframework,whichisofthenetwork-basedcategory.WepresentourframeworkdesigninSection 2.2 andsummarizethischapterinSection 2.3 2-1 ).ItconsistsoftwotypesofIProuters,i.e.,coreroutersandedge 18


AnISPnetworkarchitecture. routers.Coreroutersinterconnectwithoneanothertoformahigh-speedautonomoussystem(AS).Incontrast,edgeroutersareresponsibleforconnectingsubnets(i.e.,customernetworksorotherISPnetworks)withtheAS.Inthispaper,asubnetcanbeeitheracustomernetworkoranISPnetwork. Figure2-2. Networkanomalydetectionframework. 19


Responsibilitiesofandinteractionsamongthetracmonitor,localanalyzer,andglobalanalyzer. GivensuchISPnetworkarchitecture,wedesignaframeworktodetectnetworkanomalies.Ourframework(Figure 2-2 )consistsofthreetypesofcomponents:tracmonitors,localanalyzers,andaglobalanalyzer.Figure 2-3 summarizesthefunctionalitiesofeachtypeofcomponentsandtheirinteractions.Next,wediscussthefunctionalitiesoftracmonitors,localanalyzers,andglobalanalyzerinSections 2.2.1 2.2.2 ,and 2.2.3 ,respectively. 2-2 )isresponsiblefor:scanningpartialorallpacketsofasingleunidirectionallink;summarizingtraccharacteristics;extractingsimplefeaturesfromthetraccharacteristic;makingdecisions(e.g.,declarenetworkanomalyorclassifytypeofnormaltrac)ononesingleunidirectionallink;andreportingthesummaryoftracinformation,simplefeaturedata,anddecisionstoalocalanalyzer. 20


Figure2-4. Exampleofasymmetrictracwhosefeatureextractionisdonebytheglobalanalyzer. Theglobalanalyzerhasaglobalviewofthewholenetwork.Hence,itexploitsbothtemporalcorrelationandspatialcorrelationoftrac.Hereitisimportanttonotethat,somefeaturedatamustbeobtainedattheglobalanalyzerifglobalinformationisrequired.Forexample,inFigure 2-4 ,ifthetracfromsubnetAtoserverBpassesthroughedgerouterX,andthetracfromserverBtosubnetApassesthroughedge 21




12 ]proposedthenumberofnewsourceIPaddressestodetectDDoSattacks,undertheassumptionthatsourceaddressesofIPpacketsobservedatanedgerouterwererelativelystaticinnormalconditionsthanthoseduringDDoSattacks.PengfurtherpointedoutthatthefeaturecoulddierentiateDDoSattacksfromtheashcrowd,whichrepresentsthesituationwhenmanylegitimateusersstarttoaccessoneserviceatthesametime.Forexample,whenmanypeoplewatchalivesportsbroadcastovertheInternetatthesametime.Inbothcases(DDoSattacksandtheashcrowd),thetracrateishigh.ButduringDDoSattacks,theedgerouterswillobservemanynewsourceIPaddressesbecauseattackersusuallyspoofsourceIPaddressesofattackingpacketstohidetheiridentities.Therefore,thisfeatureimprovesthoseDDoSdetectionschemesthatrelyontracrateonly.However,Pengetal.[ 12 ]focusedondetectionofDDoSattacks.Itdidnotmentionothertypesofnetworkanomalies.Forexample,whenmalicioususersarescanningthenetwork,wecanalsoobservehightracratebutfewnewsourceIPaddresses.Itisveryimportanttodierentiatenetworkscanningfromashcrowdbecausetheformerismaliciousbutthelatterisnot.Thetwo-waymatchingfeatureondierentnetworklayers(Section 3.3.1 )cantellnotonlythepresenceofnetworkanomaliesbutalsotheircause.Lakhinaetal.[ 16 ]summarizedthecharacteristicsofnetworkanomaliesunderdierentcauses.Itscontributionistohelpidentifycausesofnetworkanomalies.Forexample,duringDDoSattacks,wecanobservehighbitrate,highpacketrate,andhighowrate.ThesourceaddressesaredistributedoverthewholeIPaddressspace.Ontheotherhand,duringnetworkscanning,allthethreeratesarehigh,butthedestinationaddresses, 23


Hierarchicalstructureforfeatureextraction. Toecientlyextractfeaturesfromtrac,wedesignathree-levelhierarchicalstructure(Figure 3-1 ),whereincomingpacketsareprocessedbylevel-onelters,thenbylevel-twolters,andnallyby(level-three)featureextractionmodules.Level-onelters 24


3-1 ).Ontheotherhand,afeaturemodulemayneedpacketsfrommultiplelevel-twolters.Forexample,theSYN/FIN(RST)ratiofeatureextractionrequirespacketsfromthreelters(Figure 3-1 ).ComparedtothepacketclassicationschemesdevelopedbyWangetal.[ 11 ]andPengetal.[ 12 ],ourhierarchicalstructureforfeatureextractionismoregeneralandecient.Next,wedescribethemostimportantmoduleinthethree-levelhierarchicalstructure,thefeatureextractionmodule. 25


11 12 ],wegeneratefeaturesinadiscretemanner,i.e.,ourfeatureextractionmodulewillgeneratea(feature)valueoravectorattheendofeachtimeslot.Intuitively,shorterslotdurationmayreducethedetectiondelay,whichisdenedastheintervalfromtheepochwhentheanomalystartstotheepochwhentheanomalyisdetected;butasmallerdurationmayincreasethecomputationalcomplexity,sincethedetectionalgorithmneedstoanalyzemorefeaturedataforthesametimeinterval.Ontheotherhand,ifafeatureisrepresentedbyaratio,theslotdurationmustbesucientlylargetoavoiddivisionbyzero.Forexample,ifwewanttousetheSYN/FIN(RST)ratioasinRef.[ 11 ]todetectTCPSYNood,thentheslotdurationcannotbetoosmall,becausethenumberofFINpacketsinashortperiodcanbe0,whichwillresultinafalsealarmevenifthenumberofSYNpacketsisnotlarge.Featureextractioncanbedoneinatracmonitor,alocalanalyzer,andaglobalanalyzer,whichwillbedescribedinSections 3.2.2 and 3.2.3 ,respectively. 12 ].Soisdatarate.Datarate:denedbythetotalnumberofbitsofallpacketsthatarriveinonetimeslot.SYN/FIN(RST)ratio 17 ]. 26


17 ]andthepercentageofnewIPaddressesproposedinRef.[ 12 ].However,theexistingfeaturessuchastheSYN/SYN-ACKratio[ 17 ]andthepercentageofnewIPaddresses[ 12 ]eitherdonotleadtogoodperformanceofdetectors,orrequirehighstorage/timecomplexity(Section 3.1 ).Toaddressthesedeciencies,weproposeanewtypeoffeaturescalledtwo-waymatchingfeatures,whichcanmakedistinctfeaturesbetweennormalandattacktrac,therebyimprovingaccuracyofdetectingattacks.Next,wediscussthetwo-waymatchingfeaturesandtheextractionscheme. 3.3.1MotivationThemotivationofusingtwo-waymatchingfeaturesarisesfromthefactthat,formostInternetapplications,packetsaregeneratedfrombothendhoststhatareengagedincommunication.Informationcarriedbypacketsononedirectionshallmatchthecorrespondinginformationcarriedbypacketsontheotherdirection.Bymonitoringthedegreeofmismatchbetweenowsoftwodirections,wecandetectnetworkanomalies.Toillustratethis,letusconsiderthebehaviorsofthetwo-waytracinthreescenarios,namely,1)normalconditions,2)DDoSattacks,and3)re-route.Intherstscenario,whenthenetworkofanISPworksnormally,informationcarriedonbothdirectionsofcommunicationmatches(Figure 3-2 ).Hostaandhostvaretwo 27


Networkinnormalcondition. endsofcommunication(assumethathostviswithintheautonomoussystemoftheISPwhilehostaisnot).Hostasendsapackettohostvandvrespondsapacketbacktohosta.BothpacketspasstheedgerouterA.Fromthepointofviewofthelocalanalyzer1attachedtoedgerouterA,wedenetherstpacketasaninboundpacket,andthesecondpacketasanoutboundpacket.ThesourceIPaddress(SA)anddestinationIPaddress(DA)oftheinboundpacketmatchtheDAandSAoftheoutboundpacket.IfthecommunicationisbasedonUDPorTCP,wecanfurtherobservethatthesourceport(SP)anddestinationport(DP)oftheinboundpacketmatchtheDPandSPoftheoutboundpacket.Therefore,thelocalanalyzer1canobservematchedinboundandoutboundpacketsinnormalconditions.IntheexampleofFigure 3-2 ,itisassumedthatthebordergatewayprotocol(BGP)routingmakestheinboundpacketsandthecorrespondingoutboundpacketspassthroughthesameedgerouter.IftheBGProutingmakestheinboundpacketsandthecorrespondingoutboundpacketsgothroughdierentedgerouters(Figure 2-4 ),thematchingcanstillbeachievedbyaglobalanalyzer(Section 2.2.3 ),i.e.,multiplelocalanalyzersconveytheunmatchedinboundpacketsandthecorrespondingoutboundpacketstotheglobalanalyzer,whichhastheroutinginformationofthewholeautonomoussystem. 28


Source-address-spoofedpackets. Inthesecondscenario,whenattackerslaunchspoofed-source-IP-addressDDoSattacks[ 18 ],thelocalanalyzer1observesmanyunmatchedinboundpackets(Figure 3-3 ).Sincesourceaddressesofinboundpacketsarespoofed,theoutboundpacketsareroutedtothenominaldestinations,i.e.,bandcinFigure 3-3 ,whichdonotpassthroughedgerouterAanymore.Inthiscase,localanalyzer1willobservemanyunmatchedinboundpackets. Figure3-4. Reroute. Inthethirdscenario(Figure 3-4 ),thenumberofunmatchedinboundpacketsobservedbylocalanalyzer1isincreasedduetoafailureoftheoriginalrouteandre-routeofoutboundpacketstoanotheredgerouter.Aglobalanalyzercanaddressthisproblemsimilartotheasymmetriccaseintherstscenario. 29


3-2 ,ifhostaisaclientuploadingalargeleusingtheFileTransferProtocol(FTP)[ 19 ]tohostv,therewillbemuchmorepacketsfromatovthanthosefromvtoa.UploadingletoanFTPserverisanormalbehaviorbutthenumberofunmatchedinboundpacketsisveryhighinthiscase.Therefore,itismoreappropriatetouseow-levelquantities(insteadofpacket-levelquantities)asfeaturesfornetworkanomalydetection.AsintheaboveFTPcase,whenaTCPconnectionisestablished,allpacketsononedirectionconstituteoneowandpacketsonthereversedirectionconstituteanotherow.Nomatterhowmanypacketsaresentoneachdirection,thereareonlyoneinboundowandonlyoneoutboundow.TheymatchinIPaddressesandportnumbers.Therefore,wecallthenumberofunmatchedinboundowsasatwo-waymatchingfeature.Two-waymatchingfeaturesareshowntobeeectiveindicatorsofnetworkanomalies[ 20 ]. 3.4 and 3.5 .Next,wedenethetwo-waymatchingfeatures. 30


Table3-1: Notationsfortwo-waymatchingfeatures NotationDescription Basedontheabovedenitions,wedenethetwo-waymatchingfeaturestobethenumberofUIF.Table 3-1 liststhenotationsusedintherestofthepaper,whereZ+standsforthenonnegativeintegerset.Inthefollowingsections,wepresentalgorithmstoextracttwo-waymatchingfeaturesfromthetracatlocalanalyzers.Notethattwo-waymatchingfeaturesshouldbe 31


3-5 ,wheretheargumentsisthesignatureextractedfromapacket. 32


functionHashTableInsert(V,s) fori0toK1 3. ifV[hi(s)]isempty 4. insertstoV[hi(s)],setstatebitofV[hi(s)]to\UNMATCHED" 5. return 6. endif 7. endfor 8. reportinsertionoperationerror 9. endfunction 10. functionHashTableSearch(V,s) fori0toK1 12. ifV[hi(s)]isempty 13. returnfalse 14. ifV[hi(s)]holdss returntrue 16. endfor 17. returnfalse; 18. endfunction 19. functionHashTableRemove(V,s) fori0toK1 21. ifV[hi(s)]isempty 22. return; 23. ifV[hi(s)]holdss setstatebitofV[hi(s)]to\MATCHED" 25. returntrue; 26. endif 27. endfor 28. endfunction Figure3-5. HashTableAlgorithm 21 ].Comparedtothehashtablealgorithm,Bloomlteralgorithmreducesspace/timecomplexitybyallowingsmalldegreeofinaccuracyinmembershiprepresentation,i.e.,apacketsignature,whichdoesnotappearbefore,maybefalselyidentiedaspresent. 33


3-6 describestheinsertionandsearchoperationsofBloomlter. functionBloomFilterInsert(V,s) for8i2ZKdo 3. 4. endfunction 5. functionBloomFilterSearch(V,s) for8i2ZKdo 7. ifV[hi(s)]6=1then 8. returnfalse 9. endfor 10. returntrue 11. endfunction BloomFilterOperations (b)Figure3-7. ScenariosoftheproblemscausedbyBloomlter.(a)Boundaryproblem.(b)Anoutboundpacketarrivesbeforeitsmatchedinboundpacketwitht2t1<. AlthoughBloomlterhasbetterperformanceinthesenseofspace/timetrade-o,itcannotbedirectlyappliedtoourapplicationbecauseofthefollowingproblems:1.Bloomlterdoesnotprovideremovalfunctionality.Sinceonebitinthevectormaybemappedbymorethanoneitem,itisunsuitabletoremovetheitembysettingallbitsindexedbyitshashresultsto0.2.Bloomlterdoesnothavecountingfunctionality.AlthoughthecountingBloomlter[ 22 ]canbeusedforcounting,itreplacesabitwithacounter,whichsignicantlyincreasesthespacecomplexity. 34


3-7(a) ).Aninboundpacketarrivesattimet012[ti;ti+1)whereasitsmatchedoutboundpacketarriveswithinnextperiod.Theinboundpacketiscountedasanunmatchedinboundpacketeventhought02t01<.Therefore,boundaryeectincreasesthefalsealarmrate.4.Inpreviousdiscussion,wedidnotconsiderthescenariothatanoutboundpacketmayarrivebeforeitsmatchedinboundpacket(Figure 3-7(b) ).Whentheoutboundpacketarrivesattimet01,itssignatureisnotinthebuer,sowedonothing.Attimet02,itsmatchedinboundpacketarrives,whoseinboundsignaturewillberecorded.Asaresult,thelatterisregardedasanunmatchedinboundpacketduringperiod[ti;ti+1).Thisearly-arrivalproblemalsoincreasesthefalsealarmrate.Next,weproposeaBloomlterarrayalgorithmtoaddresstheaboveproblems. 3.4.2 .OurideaistodesignaBloomlterarray(BFA)withthefollowingfunctionalities,notavailableintheoriginalBloomlter[ 21 23 ]:1.Removalfunctionality:Weimplementinsertionandremovaloperationssynergisticallybyusinginsertion-removalpairvectors.Thetrickisthat,ratherthanremovinganoutboundsignaturefromtheinsertionvector,wecreatearemovalvectorandinserttheoutboundsignatureintotheremovalvector.2.Countingfunctionality:WeimplementthisbyintroducingcountersinBloomlterarray.Thevalueofacounterischangedbasedonthequeryresultfromaninsertion/removaloperation.3.Boundaryeectabatement:Weusemultipletimeslotsandaslidingwindowtomitigatetheboundaryeect.4.Resolvingtheearly-arrivalproblem:whichisachievedbystoringsignatureofnotonlyinboundpacketsbutalsooutboundpackets.Inthisway,whenaninboundpacketarrivesandthesignatureofitsmatchedoutboundpacketispresent,wedonotcountthisinboundpacketasanunmatchedone. 3.5.3 ). 35


3-8 )consistsofthreefunctions,namely,ProcInbound,ProcOutboundandSample,whicharedescribedasbelow.FunctionProcInboundistoprocessinboundpackets.Itworksasbelow.Whenaninboundpacketarrivesduring[j;j+1),weincreaseCjby1andinsertitsinboundsignaturesintoIVjifnoneofthefollowingconditionsissatised:1.sisstoredinatleastoneRVj0,wherejw+1j0j;2.sisstoredinIVj.Condition1beingtruemeansthatthecorrespondingoutboundowofthisinboundpackethasbeenobservedpreviously;soweshouldnotcountitasanunmatchedinboundpacket.Condition2beingtruemeansthattheinboundow,towhichthisinboundpacketbelongs,hasbeenobservedduringthecurrentslotj;soweshouldnotcountthesameinboundowagain.Ifbothconditionsarefalse,weincreaseCjbyonetoindicateanewpotentialUIF(line 7 to 10 ). 36


functionProcInbound(s) if9j0,jw+1j0j,suchthatBloomFilterSearch(RVj0,s)returnstruethen 4. ifBloomFilterSearch(IVj,s)returnstruethen 6. ifaandbarebothfalse 8. 9. BloomFilterInsert(IVj;s) 10. endif 11. endfunction 12. functionProcOutbound(s0) forj0jtojw+1 14. ifBloomFilterSearch(RVj0;s0)returnstrue 15. break 16. ifBloomFilterSearch(IVj0;s0)returnstrue 17. 18. endfor 19. BloomFilterInsert(RVj,s0) 20. endfunction 21. functionSample(j) returnCjw+1 endfunction Figure3-8. BloomFilterArrayAlgorithm FunctionProcOutboundistoprocessoutboundpackets.Itworksasbelow.Whenanoutboundpacketarrivesduring[j;j+1),wecheckwhetherweneedtoupdatecounterCj0foreachj0(jw+1j0j).Specically,foreachj0(jw+1j0j),decreaseCj0byoneifitsoutboundsignatures0satisesbothofthefollowingconditions:1.s0isnotcontainedinRVj0;2.s0iscontainedinIVj0.Condition1beingtruemeansthatnopacketfromtheoutboundowtowhichthisoutboundpacketbelongsarrivesduringthej0thtimeslot.Condition2beingtruemeansthatthematchedinboundowofthisoutboundpackethasbeenobservedinthej0thslot.SatisfyingbothconditionsmeansthatitsmatchedinboundowhasbeencountedasapotentialUIF;hence,uponthearrivaloftheoutboundpacket,weneedtodecreaseCj0by 37


13 startsalooptoiteratej0fromjtojw+1.Condition1ischeckedinlines 14 to 15 andCondition2ischeckedinlines 16 to 17 .Notethattheloopexits(line 15 )ifRVj0containss0;thisisbecauseanoutboundpacketofthesameowarrivedinthatj0thslotandhencethebuerofthejthslot(foreachj

3-9 showstheimprovedversionofBFA,basedontheround-robinslidingwindow. 39


24 ]asthehashfunction.SinceMD5takesanynumberofbitsasinput,wecanorganizekeyiandxintoabitvectorandapplyMD5toit.Usingkeyedhashfunctions,therstconcern(varyingK)canbeaddressedstraightforwardly.Specically,whenKischanged,wesimplygenerateacorrespondingnumberofrandomkeys.ApplyingtheseKkeystothesamekernelhashfunction,weobtainKhashfunctions.Hence,ourmethodhastwoadvantages:1)thenumberofhashfunctionscanbespeciedonthey;2)hashfunctionsaredeterminedonthey,insteadofbeingstoredapriori,resultinginstoragesaving.Thesecondconcern(changinghashfunctions)canalsobeaddressedifthekeysareperiodicallychanged.Evenifthekernelhashfunctionisdisclosed,itisstillverydicult,ifnotimpossible,foranattackertoguessthechangingrandomkeys.Notethatthecollisionprobabilityofthehashfunctionsisnotaectedduetotheuseofkeyedhashfunctions.Inthecaseofrandom-keyedhashfunctions,thecollisionprobabilityofhi(x)dependsonnotonlythecollisionprobabilityofhbutalsothecorrelationbetweenkeyiandx.Sincerandomnumbergeneratortechniquesaresomaturethatwecanassumeindependencebetweenkeyiandx,introductionofrandomkeyshasnoeectonthecollisionprobability. 3.6.1 ,weanalyzethespace/timetrade-oforthethreealgorithms.Section 3.6.2 addresseshowtooptimallychooseparametersofBFA. 40


21 ].However,theanalysisbyBloom[ 21 ]isnotdirectlyapplicabletooursettingduetothefollowingreasons:1.AstaticdatasetwasassumedbyBloom[ 21 ].However,ourfeatureextractiondealswithadynamicdataset,i.e.,thenumberofelementsinthedatasetchangesovertime.Hence,newanalysisforadynamicdatasetisneeded.Inaddition,Bloom[ 21 ]onlyconsideredthesearchoperationduetotheassumptionofstaticdatasets.Ourfeatureextraction,ontheotherhand,requiresthreeoperations,i.e.,insertion,search,andremoval,fordynamicdatasets.2.Bloom[ 21 ]assumedbit-comparisonhardwareintimecomplexityanalysis.However,currentcomputersusuallyuseword(ormultiple-bit)comparison,whichismoreecientthanbit-comparisonhardware.Hence,itisnecessarytoanalyzethecomplexitybasedonwordcomparison.3.ThetimecomplexityobtainedbyBloom[ 21 ]didnotincludehashfunctioncalculations.However,hashfunctioncalculationdominatestheoveralltimecomplexity,e.g.,calculatingonehashfunctionbasedonMD5takes64clockcycles[ 25 ],whileoneword-comparisonusuallytakeslessthan8clockcycles[ 26 ].Fortheabovereasons,wedevelopnewanalysisforthehashtableandBloomlter,respectively.Inaddition,weanalyzetheperformanceofBFAandusenumericalresultstocomparethethreealgorithms.Table 3-2 liststhenotationsusedintheanalysis. Table3-2: Notationsforcomplexityanalysis NotationDescription 3-5 )checksifitsinboundsignaturesisinthetable.Because 41


3.4.1 ,thehashtablehas`cellsofb+1bitseach,suchthatMh=`(b+1).GiventheconditionthatNowshavebeenrecordedbythehashtable,theemptyratiois=`N `=MhN(b+1) 42


3{5 )toEquation( 3{4 ),weobtaintheexpectationofThE[Th]=1 3{6 )givesthespace/timetrade-o(i.e.,Mhvs.Th)ofthehashtablemethod. 3.4.2 ).ThechoiceofMbwillaecttheaccuracyofthesearchfunction,BloomFilterSearch(seeFigure 3-6 ).Thereasonisthefollowing.WhensignaturesofNowsarestoredinV,,denotingthepercentageofentriesofVwithvalue0,is=1K Mb: MbK: 43


MbK; 3{5 ).FromEquation( 3{10 ),itcanbeobservedthatdecreaseswithMbifKisxed.BasedonEquation( 3{10 ),wecandenoteMbasafunctionofandKasbelowMb=R(;K): 3{11 )givesthespacecomplexityofBloomlterasafunctionofcollisionprobabilityandthenumberofhashfunctions.Now,letusconsiderthetimecomplexityofBloomlter.DenotebyTbtherandomvariablerepresentingthenumberofhashfunctioncalculations.FunctionBloomFilterInsertalwayscalculatesalltheKhashfunctions,thatis,TbjfBloomFilterInsertisexecutedgK; 3.6.1 ).Ingeneral,Pr[Tb=xjN=nandBloomFilterSearchisexecuted]=8>><>>:(1)x1x

R(;K)iK R(;K)4=n(;K): 3{15 ),wegettheexpectationofTbundertheconditionthatBloomFilterSearchisexecuted,i.e.,E[TbjBloomFilterSearchisexecuted]=1 3{17 )givesthetimecomplexityofBloomlterintermsofnumberofhashfunctioncalculations. 3.6.1 canbeappliedheresinceBFAisoriginatedfromstandardBloomlter.However,therearesomedierencesbetweenthesetwoschemes.AsdescribedinSection 3.5 ,BFAhasmultiplebuerssuchasIVj,RVj,andCj,j2Zw.Therefore,thestoragesizeforBFA,denotedbyMa(inbits),isw(2Mv+L),whereMvisthesizeofeachinsertionorremovalvector,andListhesizeofeachcounterinbits. 45


3{10 ),thecollisionprobabilityis=1 MvK: 3{11 ),MvisafunctionofandK.WedeneMv=R(;K): 3{20 )givesthespacecomplexityofBFA.Now,letusconsiderthetimecomplexityofBFA.DenotebyTatherandomvariablerepresentingthenumberofhashfunctioncalculationsforBFA.RecallthatBFA(Figure 3-9 )denesthreefunctions,ProcInbound,ProcOutbound,andSample.Obviously,TajfSampleisexecutedg0: 3{12 )).2.Otherwise,atleastoneofaandbistrue;thenatleastoneofthesearchoperations,i.e.,BloomFilterSearch(RVj0,s),j0=(Iw+1)%w,(Iw+2)%w,:::;I%w,andBloomFilterSearch(IVI,s),returnstrue.ThisalsomeansthatKhashfunctionshavebeencalculated(seeEquation( 3{13 )).Therefore,inanycase,ProcInboundcalculatesalltheKhashfunctions.Furthernotethat,althoughBloomFilterSearchexecutesuptow+1searchoperations,andatmostoneinsertionoperation,thetotalnumberofhashfunctioncalculationsintheseoperations 46


3{21 ),( 3{22 ),and( 3{23 )andassuming(Rpi+Rpo)1,whichisalwaystrueinourdesignofBFA,wehaveE[Ta]=01 (Rpi+Rpo)+1+K(Rpi+Rpo) 3{24 )and( 3{20 ),weobtaintherelationshipbetweenMaandTaasbelowMa=w[2R(;E[Ta])+L]: Table3-3: Space/timecomplexityforhashtable,Bloomlter,andBFA AlgorithmSpacecomplexityTimecomplexity HashtableMh(freevariable)Equation( 3{6 )BloomlterEquations( 3{10 )and( 3{11 )Equations( 3{15 ),( 3{16 ),and( 3{17 )BFAEquation( 3{18 ),( 3{19 ),and( 3{20 )Equation( 3{24 ) Table 3-3 liststhespacecomplexityandtimecomplexityforhashtable,Bloomlter,andBFAalgorithms. 47


Figure3-10. Space/timetrade-oforthehashtable,BFAwith=0:1%,andBFAwith=1% Figure 3-10 showsMvs.E[T]forthehashtablescheme,BFAwithcollisionprobability1%,andBFAwithcollisionprobability0.1%.InFigure 3-10 ,Xaxisrepresentsthetimecomplexity(i.e.,theexpectednumberofhashfunctioncalculations)andYaxisrepresentsthespacecomplexity(i.e.,thenumberofbitsneededforstorage).FromFigure 3-10 ,wecanseethatthecurveofBFAisbelowthecurveofthehashtable.ItmeansBFAuseslessspaceforagiventimecomplexity.Therefore,BFAachievesbetter 48


3-10 showsthatforthehashtablescheme,MhisamonotonicdecreasingfunctionofE[Th].Theobservationmatchesourintuitionthatthelargertable,thesmallercollisionprobabilityforhashfunctions,resultinginlesshashfunctioncalculations.FurthernotethatMhapproachesR(b+1)whenE[Th]increases.ThisistheminimumspacerequiredtotolerateuptoRows.ForBFA,MaisnotamonotonicfunctionofE[Ta],whichapproximatelyequalsK.Wehavethefollowingobservations.CaseA:Forxedstoragesize,thesmallerK,thelargertheprobabilitythatallKhashfunctionsoftwodierentinputsreturnthesameoutputs,whichisthecollisionprobability.Inotherwords,thesmallerK,thelargerstoragesizerequiredtoachieveaxedcollisionprobability.Thatis,K#)Ma".CaseB:SinceaninputtoBFAmaysetKbitsto\1"inavectorV,hencethelargerK,themorebitsinVwillbesetto\1"(nonempty),whichtranslatesintoalargercollisionprobability.Inotherwords,thelargerK,thelargerstoragesizerequiredtoachieveaxedcollisionprobability.Thatis,K")Ma".CombiningCasesAandB,itcanbearguedthatthereexistsavalueofKorE[Ta]thatachievestheminimumvalueofMa,givenaxedcollisionprobability.ThisminimumpropertycanbeusedtoguidetheparametersettingforBFA,whichwillbeaddressedinSection 3.6.2 49


3{25 ),threeparameters,Ma,E[Ta],and,arecoupled.Sincethecollisionprobabilitycriticallyaectsthedetectionerrorrateinournetworkanomalydetection,anetworkoperatormaywanttochooseanupperboundontheacceptablecollisionprobabilityandthenminimizethestoragerequired,i.e., minE[Ta]Ma;subjectto AccordingtoEquation( 3{25 ),thesolutionof( 3{26 )isasbelowMa=minE[Ta]Ma=minE[Ta]w[2R(;E[Ta])+L]; (b)Figure3-11. Relationamongspacecomplexity,timecomplexity,andcollisionprobability.(a)Mavs..(b)E[Ta]vs.. Figure 3-11 showsMavs.,andE[Ta]vs.underthesamesettingasthatinSection 3.6.1 .FromFigure 3-11(a) ,itcanbeobservedthatMadecreaseswhen


3-11(b) ,oneobservesthatgenerally,E[Ta]decreaseswhenincreases.ThismaybebecausethesmallerE[Ta]orK,thelargertheprobabilitythatallKhashfunctionsoftwodierentinputsreturnthesameoutputs,whichisthecollisionprobability. 3{18 ),itcanbeobservedthatdecreaseswiththeincreaseofMvifKisxed;inotherwords,MvdecreaseswiththeincreaseofifKisxed.Further,fromEquations( 3{19 )and( 3{25 ),itcanbeinferredthatMadecreaseswiththeincreaseofifE[Ta]isxed(notethatE[Ta]K).ThisisshowninFigure 3-12 .Fromthegure,itcanbeobservedthatthetwolinesintersectatavalueofcollisionprobability,denotedbyc.ThisvalueiscriticalfortheparametersettingofBFA.Ifanetworkoperatorhasadesirablecollisionprobability,whichisgreaterthanc,thenitshouldchooseE[Ta]=4sincethisparametersettinggivesbothsmallertimecomplexityandsmallerspacecomplexity.Wecallthisproperty`competitiveoptimality'sincethereisnotradeobetweentimecomplexityandspacecomplexityinthiscase.Ontheotherhand,ifanetworkoperatorhasadesirablecollisionprobability,whichissmallerthanc,thenitneedstomakeatradeobetweenspacecomplexityandtimecomplexity. 3.7.1 comparestheperformanceoftheBFAalgorithmwiththatofthehashtablealgorithm.InSection 3.7.2 ,weshowtheperformanceofthecompletefeatureextractionsystem,whichusestheBFAalgorithm. Simulationsettings.WeapplythehashtablealgorithmandtheBFAalgorithmtothetimeseriesofsignaturesextractedfromrealtractraces,whichwerecollectedbyAucklandUniversity[ 27 ].Tomakeafaircomparisonwithrespectivetothenumerical 51


Spacecomplexityvs.collisionprobabilityforxedtimecomplexity. resultsinSection 3.6.1 ,weusethesame96-bitsignature,i.e.,SA,DA,SP,andDP,andletR=250;000packets/secondand=80seconds,whichtranslatesto250;00080=20Minputsignaturesforeachsimulation.ThesesignaturesarepreloadedintomemorybeforethebeginningofsimulationssothatI/Ospeedofharddrivedoesnotaecttheexecutiontimeofsimulations.Foreachsimulationrunofthehashtablealgorithm,wespecifythememorysizeMh,andmeasurethealgorithmperformanceintermsoftheaveragenumberofhashfunctioncalculationspersignaturequeryrequest,denotedby^Th,andtheexecutiontime.DuetotheLawofLargeNumbers,^Thapproachestheexpectednumberofhashfunctioncalculationspersignaturequeryrequest,i.e.,E[Th]inEquation( 3{6 ),ifwerunthesimulationmanytimeswiththesameMh.Inoursimulations,werunthehashtablealgorithmtentimes;eachtimewithadierentsetofinputsignaturesbutwiththesameMh.ForeachsimulationrunoftheBFAalgorithm,wespecifythememorysizemaandthenumberofhashfunctionsK,andmeasurethealgorithmperformanceintermsofthecollisionfrequency,denotedby^,andtheexecutiontime.ThecollisionfrequencyisdenedastheratioofthenumberofcollisionoccurrencesinBloomFilterSearchtothe 52


3-13 showsaverageprocessingtimeperqueryvs.memorysizeforthehashtablealgorithm,BFAalgorithmwith^=0.1%,andBFAalgorithmwith^=1%. Figure3-13. Memorysize(inbits)vs.averageprocessingtimeperquery(ins) FromFigure 3-13 ,weobservethat1)comparedtothehashtablealgorithm,theBFAalgorithmrequireslessmemoryspaceforthesametimecomplexity(averageprocessingtimeperquery),whichwaspredictedinSection 3.6 ,and2)theBFAalgorithmwith^=1%hasabetterspace-complexity/time-complexitytradeothantheBFAalgorithmwith^=0.1%butatcostofhighercollisionprobability,whichispredictedbythenumericalresultsinFigure 3-10 .Figure 3-14 showsaverageprocessingtimeperqueryvs.averagenumberofhashfunctioncalculationsperquery.Itcanbeobservedthattheaverageprocessingtimeperquerylinearlyincreaseswiththeincreaseoftheaveragenumberofhashfunctioncalculationsperquery.Thatis,thelargertheaveragenumberofhashfunctioncalculationsperquery,thelargertheaverageprocessingtimeperquery.Forthisreason,insteadofrunningsimulationstoobtainthetimecomplexity(i.e.,theaverage 53


Averageprocessingtimeperquery(ins)vs.averagenumberofhashfunctioncalculationsperquery. processingtimeperquery),inSection 3.6.1 ,weusedtheaveragenumberofhashfunctioncalculationsperquerytorepresentthetimecomplexityofthehashtablealgorithmandtheBFAalgorithm. 3-15 comparesthesimulationsresultsandthenumericalresultsobtainedfromtheanalysisinSection 3.6 ,forbothhashtablealgorithmandBFAalgorithmintermsofspacecomplexityvs.timecomplexity.InFigure 3-15(a) ,thenumericalresultagreeswellwiththesimulationresult,exceptwhentheaveragenumberofhashfunctioncalculationsperqueryiscloseto1.FromEquation( 3{6 ),iftheexpectednumberofhashfunctioncalculationsapproaches1,therequiredmemorysizeapproachesinnity;incontrast,simulationswithalargeMhmaynotgiveaccurateresults,duetolimitedmemorysizeofacomputer.Thiscausesthebigdiscrepancybetweenthenumericalresultandthesimulationresultwhentheaveragenumberofhashfunctioncalculationsperqueryiscloseto1.Whentheaveragenumberofhashfunctioncalculationsperqueryisgreaterthanorequaltotwo,itisobservedthatsimulationalwaysrequiresmorememorythanthenumericalresult.Thisisduetothefactthatpracticalhashfunctionisnotperfect.Thatis,entriesinthehashtablearenot 54


(b)Figure3-15. Comparisonofnumericalandsimulationresults.(a)Hashtablealgorithm.(b)BFAalgorithmwith=1%. equallylikelytobeaccessed.Hence,Equation( 3{2 )doesnotholdperfectly,neitherdoesEquation( 3{3 ).Asaresult,theaveragenumberofhashfunctioncalculationsperqueryinsimulationislargerthanthatpredictedbyEquation( 3{6 ).Figure 3-15(b) showsthatthenumericalresultagreeswellwiththesimulationresultforallthevaluesoftheaveragenumberofhashfunctioncalculationsperqueryunderourstudy. 3.2 .Incontrast,theexperimentinSection 3.7.1 doesnotinvolvetheinteractionamongthethreelevels. 27 ]asthebackgroundtrac.ThisdatasetconsistsofpacketheaderinformationoftracbetweentheInternetandAucklandUniversity.TheconnectionisOC-3(155Mb/s)forbothdirections. 55


18 ]byrandomlyinsertingTCPSYNpacketswithrandomsourceIPaddressesintothebackgroundtraceduringspeciedtimeperiods.Specically,synchronizedattacksaresimulatedduring14000{16000secondand28000{32000secondinbothtraces.Inaddition,asynchronousattacksarelaunchedduring50000-52000secondintrace1andduring57000-59000secondintrace2.Theaverageattackrateis1%ofthepacketrateofthebackgroundtracduringthesameperiod.TodetectTCPSYNoodattacks,wechooseasthesignatureofinboundpacketsandforoutboundones.Thusthetwo-waymatchingfeatureisthenumberofunmatchedinboundTCPSYNpacketsinonetimeslot.Theaverageowrateis2480ows/second.Therefore,wesetR=2480.Wefurtherset=0:1%,K=8,w=8,and=10.Then,bysolvingEquation( 3{18 )forMvandrequiringMvtobeapowerof2,weobtainMv=215bits.Thecomputerusedforourexperimentshasone2.4GHzCPUand1GBmemory.Forcomparison,wealsoextractthenumberofinboundSYNpacketsinaslot. 56


3-16 ).FromFigure 3-16 ,itcanbeobservedthatthefeaturesarerathernoisy,especiallyforthefeatureofthenumberofSYNpackets.FromFigs. 3-16(a) and 3-16(c) ,wecanhardlydistinguishtheslotsunderthelowvolumesynchronizedattacksfromtheslotswithoutattacks(byvisualinspection).Incomparison,itismucheasiertoidentifytheslotsunderthesynchronizedattacks(byvisualinspection)whenthenumberofunmatchedSYNpacketsisusedasthefeature(seeSlot14001600andSlot28003200inFigs. 3-16(b) and 3-16(d) 57


(b) (c) (d)Figure3-16. Featuredata:(a)NumberofSYNpackets(link1),(b)NumberofunmatchedSYNpackets(link1),(c)NumberofSYNpackets(link2),and(d)NumberofunmatchedSYNpackets(link2). 58


4.1.1 introducestheReceiverOperatingCharacteristicscurve.ItisusedinSection 4.5 asthemetricstocompareperformanceofdierentclassicationmethods.Wedescribethethreshold-basedalgorithm,change-pointalgorithm,andBayesiandecisiontheoryinSections 4.1.2 4.1.3 ,and 4.1.4 ,respectively. 28 ]isatypicalmethodtoquantifytheperformanceofdetectionalgorithms.Itisaplotofdetectionprobabilityvs.falsealarmprobability.Inpractice,weestimatedetectionprobabilityandfalsealarmprobabilitybythefractionoftruepositivesandthefractionoftruenegatives,respectively.Hence,toobtainanROCcurve,oneneedstomeasurethefollowingquantitiesf:thenumberoffalsealarms,i.e.,thenumberofslotsinwhichthedetectionalgorithmdeclares`abnormal'giventhatnoanomalyactuallyhappensintheseslots;n:thenumberofslotsinwhichnoanomalyhappens;d:thenumberofslotsinwhichthedetectionalgorithmdeclares`abnormal'giventhatnetworkanomaliesactuallyhappenintheseslots;a:thenumberofslotsinwhichnetworkanomalieshappen.Thefalsealarmprobabilityandthedetectionprobabilityofthedetectionalgorithmcanbeestimatedbyf=nandd=a,respectively.Byvaryingparametersofdetectionalgorithms,wecanobtaindierentpairsoffalsealarmprobabilityanddetection 59


28 ].Inthispaper,wewillusetheROCcurvetocomparetheperformanceofdierentdetectionalgorithms.Next,weintroducesomebasicclassicationalgorithms. 11 { 13 17 ].However,existingstudiesonlyconsiderthechangefromnormalstatetoabnormalstate,whichmeansthatthenumberoffalsealarmscanbeverylargeafternetworkanomaliesend.Tofacilitatethediscussion,wedenethefollowingparametersusedinCUSUMinTable 4-1 : Table4-1: ParametersusedinCUSUM ParameterDescription (ti)Theobservedtracfeatureattheendoftimesloti.nTheexpectationof(ti)innormalstates.aTheexpectationof(ti)inabnormalstates.Withoutlosinggenerality,hereweassumethatn

(4{1)IntheCUSUMalgorithm,ifS(ti)issmallerthanathresholdHCUSUM,declarethatthenetworkstateisnormal;otherwise,declarethatthestateisabnormal.Fromthediscussionabove,wenotethattwoparameters,i.e.,aandHCUSUM,needtobedetermined.However,wecannotuniquelydeterminethesetwoparameters.Toovercomethisproblem,weshallintroduceanotherparameter,i.e.,thedetectiondelay,denotedasD.Accordingtothechange-pointtheory,wehaveD aa: 4{2 ),wecanobtainHCUSUM=D(aa): 4{3 ). 11 { 13 ]onlyconsideronechange,i.e.,fromthenormalstatetotheabnormalstate.Inpractice,thisapproachmayleadtoalargenumberoffalsealarmsaftertheendofattacks.Tomitigatethehighfalsealarmissueoftheexistingalgorithms,whichwecallsingle-CUSUMalgorithms,wedevelopadual-CUSUMalgorithm.Inthisalgorithm,oneCUSUMwillbeusedtodetectthechangefromthenormaltotheabnormalstate,whileanotherCUSUMisresponsiblefordetectingthechangefromtheabnormaltothenormalstate.Themethodofsettingparametersfor 11 ]and[ 12 ],a=(an)=2;thusonlythedetectiondelayisneeded. 61


4.5.2 ).OurmachinelearningalgorithmisbasedonBayesiandecisiontheory,whichisintroducedinnextsection. 29 ].Itiscomposedofthefeaturespace,D,whichmightbeamulti-dimensionalEuclideanspace;Ustatesofnature,~H=fHu;u2ZUg;priorprobabilitydistribution,P(H),H2~H;likelihoodprobabilities,p(jH),2D,H2~H;lossfunction,(H;H),H;H2ZU,whichdescribesthelossincurredforclassifyinganobjecttobeofclassHwhenitsstateofnatureisclassH.Notethat,inthepaper,P()representsaprobabilitymassfunction(PMF)[ 30 ]andp()aprobabilitydensityfunction(pdf)[ 30 ].Giventheobservedfeature,,ofanobject,theBayesiandecisiontheoryclassiesittobeofclass^Hsuchthat^H=argminHXH2~H(H;H)P(Hj): 30 ],P(Hj)=p(jH)P(H) 62


4{4 )isequivalentto^H=argminHXH2~H(H;H)p(jH)P(H): 4{6 )givestheBayesiancriterionforpatternclassication.Asimplelossfunctionisdenedtobe(Hu;Hu)=~(u)I(u=u);for8u2ZU; 4{7 )speciesthatmisclassicationinduceszerolossandcorrectclassicationinducesnegativeloss,whichactuallyachievesgain.ApplyingEquation( 4{7 )toEquation( 4{6 ),theBayesiancriterionissimpliedto,^u=argmaxu2ZU~(u)p(jHu)P(Hu): 4{9 )themaximumgaincriterion.Furthernotethat,scalingallgainfactorsbyasamefactordoesnotchangethecriterionspeciedbyEquation( 4{9 ).Hence,wecanalwaysset~(0)=1.Bytuningothergainfactors,wecangenerateROCcurveforBayesiandecisiontheory.Inthischapter,weextendedtheBayesiandecisiontheoryfornetworkanomalydetection.Theremainderofthischapterisorganizedasfollows.InSection 4.2 ,weestablishBayesianmodelsfornetworkanomalydetection.Sections 4.3 and 4.4 solvetwofundamentalproblemsofBayesianmodel,i.e.,trainingproblemandclassicationproblem,respectively.Section 4.5 showsoursimulationresultsandSection 4.6 concludesthischapter. 63


4.2.1 generalizestheBayesianmodelfornetworkanomalydetectionontracmonitorsandlocalanalyzers.Insection 4.2.2 ,weextendthismodeltothewholeautonomoussystem.Section 4.2.3 introducesthehiddenMarkovtreemodeltodecreasethecomputationcomplexityofthegeneralmodeldenedinSection 4.2.2 4.1.4 Figure4-1. Generativeprocessingraphicalrepresentation,inwhichthetracstategeneratesthestochasticprocessoftrac. Inthecontextofnetworkanomalydetection,therearetwostatesofnatureofanedgerouter,i.e.,~H=fH0;H1g,whereH0representsnormalstate,inwhichcasenoabnormalnetworktracenterstheASthroughthatedgerouter;H1representsabnormalstate,inwhichcaseabnormalnetworktracenterstheASthroughtheedgerouter. 64


31 ]todepictthiscause-eectrelationshipinFigure 4-1 (b)Figure4-2. Extendedgenerativemodelincludingtracfeaturevectors:(a)originalmodeland(b)simpliedmodel. Denoteby,2D,thefeatureextractedfromtrac,whereDisthefeaturespace.Mostimportantly,inselectionoftheoptimalfeatures,weseekforthemostdiscriminativestatisticalpropertiesofthetrac.Alsonotethatitispossibletoemploymultiplefeaturesinthedetectionprocedure,inwhichcaseisavector.Sincefeaturesaresuccinctrepresentationsofthevoluminoustrac,weextendtheabovemodelinFigure 4-1 totheoneillustratedinFigure 4-2(a) .Onceisextractedfrom,weassumethatrepresentswell.Itmeansthatwemayoperateonlyoverlower-dimensional,whichreducescomputationalcomplexity.Therefore,wesimplifythemodelinFigure 4-2(a) tothatillustratedinFigure 4-2(b) ,whereisdismissed.Sincethefeatureismeasurable,itiscalledobservablerandomvector,andisdepictedbyarectangularnodeinFigure 4-2(b) .Thenetworkstategeneratingthetracfeatureistobeestimated.Wecallithiddenrandomvariable,anddepictitbyaroundnodeinFigure 4-2(b) .Now,thegoalbecomestoestimatethehiddenstategiventheobservable 65


4{9 ))speciestheestimate,^u,tobe^u=argmaxu2Z2~(u)p(j=u)P(=u): 4.3 ). Figure4-3. Generativeindependentmodelthatdescribesdependenciesamongtracstatesandtracfeaturevectors. AnAShasmanyedgerouters,eachofwhichhasmultiplelinks.Tracmonitorsdeployedonlinksandlocalanalyzersonedgeroutersextractfeaturesandmakedecisionsindependently.Therefore,theonelinkmodelinFigure 4-2(b) isfurtherextendedtothemoregeneralmodelfortheASasillustratedinFigure 4-3 ,wherestandsforthenumberofedgerouters.ThelimitationofthedetectionmodelinFigure 4-3 isthatitassumesedgeroutersaremutuallyindependent.ThisisduetothefactthattracmonitorsandlocalanalyzersonlyhavelocalinformationofthewholeAS.Althoughitissuitabletodetectnetworkanomaliesaccompaniedwithhightracvolumeonsinglelink,itisnotsuitableforlowvolumenetworkanomalydetection.Weaddressthislimitationbyintroducingspatialcorrelationinnextsection. 66


Figure4-4. Generativedependentmodelthatdescribesdependenciesamongedgerouters. IntroducingthespatialcorrelationintotheindependentmodelinFigure 4-3 resultsinthedependentmodelasillustratedinFigure 4-4 .Thedierencebetweentwomodelsisthat,fromtheviewpointofaglobalanalyzer,edgeroutersarenolongindependent.Asaresult,statisticaldependenceamongstatesofedgeroutersisrepresentedbythenon-directionalconnections.Notethattheindependentmodelcanberegardedasaspecialcaseofthedependentone.Alsonotethatwestillassumethatfeaturesextractedfromoneedgerouterareindependentofthestatesofotheredgerouters. 67


4{10 ).Wefurtherassumegainfactorsareindependentofnodeindex,i.e.,~(ui)=~(ui0)wheneverui=ui0(0or1),nomatterwhetheriisequaltoi0ornot.Thenthemaximumgaincriterion(seeEquation( 4{9 ))forthedependentmodelis^~u=argmax~u~(~u)p(~j~=~u)P(~=~u)=argmax~u"Yi=1~(ui)p(iji=ui)#P(~=~u): 4{15 )directly,weneedtoexhaustivelycomputep(~j~)P(~)foreachpossiblecombinationof~,whichresultsinaO(2)complexity.ForalargeAS,itisintractable.Weintroducedahierarchicalstructuretoreducecomputationcomplexity,whichisthetopicofthenextsection. 4-4 becomescomputationallyintractableisthatweassumeedgeroutersarefullydependent.Aroughunderstandingisthat,ifwebreaksomedependenceinFigure 4-4 ,wecanreducethecomputationcomplexity.Ontheotherhand,wewouldliketoaccountforthedependenciesamongasmanynodesaspossibletoprovideaccuratedetection.Tobalancethesetwoconicting 68


4-5 Figure4-5. HiddenMarkovtreemodel.Forannodei,(i)denotesitsparentnodeand(i)denotesthesetofitschildrennodes. ThemotivationofapplyingHMTmodelisthatweassumeedgeroutersarenotequallycorrelated.Instead,edgerouterstopologicallyclosetoeachotherhavehighmutualcorrelations.Basedonthisassumption,weclusteredgeroutersaccordingtothetopologyofASandformatreestructure,asdepictedinFigure 4-5 .Withoutlossofgenerality,Figure 4-5 plotsaquad-treestructure,i.e.,eachnode,exceptleafnodes,hasfourchildren.Tofacilitatefurtherdiscussion,eachnodeintheHMTisassignedanintegernumber,beginningwith0,fromtoptobottom.Thatis,node0isalwaysarootnode 4-2 liststhenotationsusedintherestofthepaperforHMT.IntheHMT,eachleafnodestandsforanedgerouter.Zero-paddingvirtualedgeroutersareintroducedwhenthenumberofedgeroutersisnotapowerofB.Statesofthesezero-paddingvirtualnodesarealwaysnormalandfeaturesarealways0.Non-leaf 69


Notationsforhiddenmarkovtreemodel NotationDescription iTherandomvariablerepresentingthestateofnodei.iTherandomvariable/vectorrepresentingthefeature(s)measuredatnodei.~Tfi;i2Tg,whereTisasubtreeoftheHMT.LThenumberoflevelsoftheHMT.ThesetofallnodesintheHMT.lThesetofnodesatlevell,l2ZL,intheHMT.Specically,0representsthesetofrootnodesandL1leafnodes.BThenumberofchildrennodesofeachnode,exceptleaves.Forexample,B=4forquad-HMT,asillustratedinFigure 4-5 .(i)Theparentnodeofnodei,wherei=20.(i)Thesetofchildrennodesofnodei,wherei=2L1.TiThesetofancestornodesofnodei,wherei=20,includingnodei.R(i)Therootnodeofthesubtreecontainingnodei,wherei2.TiThesubtreewhoserootisnodei,wherei2.TinjTinTj.TniTR(i)nTi. nodesrepresentclustersofedgerouters.FeaturesofnodesaredenedinEquation( 4{16 ).i=8>><>>:Featuresmeasuredatthecorrespondingedgerouteri2L1(i.e.,leafnode)1 (4{18) 70


4.2.1 and 4.2.2 ,weemploymaximumgaincriterion(seeEquation( 4{9 ))toestimatenodestates,i.e.,^ui=argmax~u~(~u)P(~=~uj~)=argmaxfui0;i02Tig24Yi02TinfR(i)g~(ui0)P(i0=ui0j(i0)=u(i0);~)35~(uR(i))P(R(i)=uR(i)j~); 4{19 ),wereducethecomputationcomplexityfromO(2)(seeSection 4.2.2 )toO(B).ThisisthemajoradvantageofintroducingHMTmodel.ThedetailsaregiveninSection 4.4 .SolvingEquation( 4{19 )requiresknowledgeofP(ij(i);~),for8i2n0,andP(i=uij~),for8i20.ByBayesianformula[ 30 ],P(i=uij(i)=u(i);~)=P(i=ui;(i)=u(i)j~) 4{19 )translatestoestimatingP(i;(i)j~);8i2n0; 4{21 )and( 4{22 )inclosedformisdicult.Weproposedabeliefpropagation(BP)algorithm,describedinSection 4.3 ,toestimatethemecientlygivenknowledgeofpriorprobabilities:P(i=0),i20;likelihood:p(iji),i2;transitionprobabilities:P(ij(i)),i2n0. 71


2,i20.Thatis,statesofrootnodesareequallylikelytobenormalorabnormal.Otherparameterssuchaslikelihoodandtransitionprobabilitiesareestimatedfromtrainingdata.ThisiscoveredinSection 4.3 .Afterthat,classicationusingmaximumgaincriterionisdescribedinSection 4.4 4.3.1 ,wedescribeestimationoflikelihoodp(iji),i2.EstimationoftransitionprobabilitiesP(ij(i)),i2n0arepresentedinSection 4.3.2 30 ]iswidelyemployedinmanyapplications.Thepdfofad-dimensionalmultivariateGaussiandistributionwithmeanvectorandvariancematrixisN(x;;)4=1 (2)d=2jj1=2exp1 2(x)t1(x): 4-6 plotsthepdfoftheunivariateGaussiandistributionN(x;0;1).ItisobservedthatGaussiandistributionisaunimodaldistribution[ 30 ],i.e.,itspdfonlyhasonepeak. 72


ProbabilitydensityfunctionoftheunivariateGaussiandistributionN(x;0;1). Figure4-7. Histogramofthetwo-waymatchingfeaturesmeasuredatarealnetworkduringnetworkanomalies. However,multiplepeaksmayexistintheempiricaldistributionofiji.Forexample,Figure 4-7 showsthehistogramofthetwo-waymatchingfeaturesmeasuredinarealnetworkduringDDoSattacks.Ithastwopeaks.Hence,theunimodalGaussiandistributionisnotsuitable.Inthepaper,weadopttheGaussianmixturemodel(GMM)tomodelthelikelihooddistribution. 73


4{26 ),thelikelihoodestimationtranslatestoestimatingGMMparametersi;u(g),i;u(g),andi;u(g)for8i2,8u2f0;1g,and8g2f1;:::;Gg,withconstraintGXg=1i;u(g)=1: 74


4{26 ).Unfortunately,Nechyba[ 32 ]showedthatMLmethodforG-stateGMMwithG>1hasnoclosedformsolution.Inaddition,aG-stateGMMhasa3G-dimensionalcontinuousparameterspace.ExhaustivesearchingnumericalsolutionforMLestimateinsuchaparameterspaceiscomputationalintractable. 1. Input:(k)i;u,k2f1;:::;Kug;(0)i;u(g),(0)i;u(g),and(0)i;u(g),g2f1;:::;Gg. 2. Output:^i;u(g),^i;u(g),and^i;u(g),g2f1;:::;Gg. 3. 4. repeat 5. 7. (j+1)i;u(g)=PKuk=1#(j)i;u(g;(k)i;u)k(k)i;u(j+1)i;u(g)k2 10. untilconverge 11. ^i;u(g)=(j)i;u(g),^i;u(g)=(j)i;u(g),^i;u(g)=(j)i;u(g),8g2f1;:::;Gg. Figure4-8. TheEMalgorithmforestimatingp(iji=u),i2,u2f0;1g. Apracticalsolutiontothisissueistheexpectation-maximization(EM)algorithm[ 29 30 ].Nechyba[ 32 ]derivedEMalgorithmforGMMindetail.Figure 4-8 illustratesthealgorithm.TheEMalgorithmrequiresinitialvaluesfortheparameters,asdenotedby(0)i;u(g),(0)i;u(g),and(0)i;u(g)inFigure 4-8 .Ateachiterationj,theEMalgorithmusesparametersestimatedatiterationj1tocalculatenewestimates.AlthoughbothEMandML 75


4{28 ).Asaresult,EMalgorithmconvergesmuchfasterthannumericalMLmethod.However,thedisadvantageofEMalgorithmisthatitconvergestoalocalmaximaratherthantheglobalone.Specically,initialvaluesofparametersdeterminethelocalmaximatowhichtheEMalgorithmconverges.Inpractice,wehavepriorknowledgeofnetworkfeatures,whichhelpstochooseinitialvaluesofparameters.Tillnow,wepresentschemestoestimatelikelihoodofHMT.Innextsection,weestimatetransitionprobabilities. 4-9 and 4-10 showthepseudo-codefortransitionprobabilityestimation.Innexttwosections,weexplainthetwogures. 4-9 showsthepseudo-codetoestimatetransitionprobabilities.ThefunctionTransProbEstimatetakesthreesetsofarguments:1.likelihoodestimatedinSection 4.3.1 ,fp(iji);i2g;2.trainingfeatures,f~(k);k2f1;:::;Kgg.Itreturnstheestimateoftransitionprobabilities,i.e.,P(ij(i));8i2n0:Beforetheiterations,wesettheinitialtransitionprobabilitiestobe1 2atline 5 ofFigure 4-9 .Thisisequivalenttoassumenormalstateandabnormalstatetobeinitially 76


functionTransProbEstimate(:::) Argument1:likelihood,fp(iji);i2g. 3. Argument2:trainingdata,f~(k);k2f1;:::;Kgg. 4. Return:transitionprobabilityestimate:P(i=uij(i)),i2n0. 5. 2,for8i2n0,8u;u02f0;1g. 6. 7. repeat 8. fork1toK 10. endfor 11. for8i2n0 (4{30) 13. endfor 14. 15. untilconverge 16. Figure4-9. Iterativelyestimatetransitionprobabilities. equallylikely.Then,ateachiteration,weupdatetheestimateoftransitionprobabilitiesuntilitconverges.Theupdateprocedureisthefollowing.First,weiteratethetrainingfeatureset.Foreachfeature,weuseBPalgorithm(seeFigure 4-10 )toestimatetheposteriorprobabilitiesgiventhatfeature.ThedetailsoftheBPalgorithmisdiscussedinthenextsection.ThreesetsofargumentsarepassedtotheBPalgorithm:estimateoftransitionprobabilitiesobtainedatthepreviousiteration,P(j)(ij(i));i2n0;1.likelihood,fp(iji);i2g,whichistheargumentpassedtofunctionTransProbEsti-mate; 77


functionBP(:::) 2. Argument1:transitionprobabilities,P(ij(i));i2n0. 3. Argument2:likelihood,fp(iji);i2g. 4. Argument3:trainingfeature,~. 5. Return:posteriorprobabilities,nP((i)j~);P(i;(i)j~);i2n0o. 6. i(0)=i(1)=1 2,for8i20(i.e.,roots). 7. 8. Top-downpass,i.e.,fromroottoleaf: 9. forl1;:::;L1 10. for8i2l;8u2f0;1g,let 11. i(u)=Xu02f0;1gPi=uj(i)=u0(i)(u0)p((i)ju(i)=u0) (4{31) 12. endfor 13. endfor 14. Bottom-uppass,i.e.,fromleaftoroot; 15. forlL2;:::;0 16. for8i2l,8u2f0;1g,let 17. 18. endfor 19. endfor 20. (4{33) 21. [Pu00i(u00)i(u00)]Pu00Pi=u00j(i)=u0i(u00) Figure4-10. Beliefpropagationalgorithm. 78


4{30 )for8i2n0andsteptothenextiteration.Whentheestimatesconverge,iterationstopsandFunctionTransProbEsti-matereturnsestimatesobtainedatthelastiteration.ThevalidityofEquation( 4{30 )isshowninthefollowing.For8i2n0,P(ij(i))=ZP(ij(i);~=~)p(~)d~=E~hP((ij(i);~))i; 30 ],weestimateE~hP((ij(i);~))iby1 4{35 )and( 4{36 ),weobtainEquation( 4{30 ).Next,wediscussthebeliefpropagationalgorithm,whichiscalledbyfunctionTransProbEstimate. 33 { 35 ],alsoknownasthesum-productalgorithm,e.g.,[ 36 { 39 ],isanimportantmethodforcomputingapproximatemarginaldistributions.Inthispaper,weapplytheBPalgorithmtoestimatingposteriorprobabilities(Figure 4-10 ). 79


2andi(1)=1 2forallrootnodesi(seeline 6 ofFigure 4-10 ),becauseweassumerootnodesareequallylikelytobeinnormalorabnormalstate.Wheni2L1,i.e.,leafnodes,~Ti=~i,thereforei(u)=p(iji=u),u2f0;1g(seeline 7 ofFigure 4-10 ).Thenitpropagatesbeliefonthetreerootsandleavestoothernodesintop-downpassandbottom-uppass,respectively.Duringthetop-downpass,FunctionBPiteratesfromroottoleaf.Ateachlevell,weupdatethetransitoryvariablesi(u)byEquation( 4{31 ),whichisproveninAppendix A.1 .Notethat,(i)(u0)inEquation( 4{31 )isobtainedinthepreviousiteration,i.e.,iterationatlevell1.Duringthebottom-uppass,FunctionBPiteratesfromleaftoroot.ateachlevell,weupdatethetransitoryvariablesi(u)byEquation( 4{32 ),whichisproveninAppendix A.2 .Alsonotethat,j(u0)isobtainedinthepreviousiteration,i.e.,iterationatlevell+1. 80


4{33 )and( 4{34 ).ThesetwoequationsareproveninAppendix A.3 and A.4 ,respectively.TheestimatedposteriorprobabilitiesareusedinFunctionTransProbEstimatetoupdateestimatesoftransitionprobabilities(seeFigure 4-9 ).Tillnow,weestablishedtheHMTmodelanddescribedapproachestoestimateitsmodelparametersfromtrainingdata.Next,wepresentnetworkanomalydetectionapproachesusingthefullydeterminedHMTmodel. 4{19 ).WerewritethecriterioninEquation( 4{39 ),^ui=argmaxfui0;i02Tig24Yi02TinfR(i)g~(ui0)P(i0=ui0j(i0)=u(i0);~)35~(uR(i))PR(i)=uR(i)j~; 4-11 ,showsthepseudo-codefortheclassicationalgorithm.FunctionViterbiDecodeHMTtakesthreearguments.Thersttwoarguments,thetransitionprobabilitiesandthelikelihood,areestimatedduringtrainingphase.Thelastoneistheextractedfeatures,basedonwhichweperformanomalydetection.Itreturnstheestimatesofnodestates.Amongthem,weareonlyinterestedinstatesofleaf 81


1. functionViterbiDecodeHMT(:::) 2. Argument1:transitionprobabilities,P(ij(i));i2n0. 3. Argument2:likelihood,fp(iji);i2g. 4. Argument3:feature,~. 5. Return:estimatednodestates,f^ui;i2g. 6. 7. 8. for8i20 ^ui=argmaxuiP(i=uij~) 10. endfor 11. forl1;:::;L1,for8i2l ^ui=argmaxui~(ui)Pi=uij(i)=^u(i);~24Yi02T(i)nR(i)~(^ui0)Pi0=^ui0j(i0)=^u(i0);~35~(^uR(i))PR(i)=^uR(i)j~ 13. endfor Figure4-11. ViterbialgorithmforHMTdecoding. ByBayesianformula,thetermsinEquation( 4{39 )canbecomputedbyPi0=ui0j(i0)=u(i0);~=Pi0=ui0;(i0)=u(i0)j~ 82


4{40 )and( 4{41 )inFigure 4-11 )tosolveEquation( 4{43 )inawaysimilartoHMTtransitionprobabilityestimation(seeFigure 4-10 ).ObtainingsolutionstoEquation( 4{43 )forallnodes,wecansolveEquation( 4{39 ).AbruteforcesolutiontoEquation( 4{39 )istoexhaustivelycompute24Yi02TinfR(i)g~(ui0)P(i0=ui0j(i0)=u(i0);~)35~(uR(i))PR(i)=uR(i)j~; 4-4 ),whosecomplexityisO(2):Inthispaper,weappliedViterbialgorithm[ 40 { 42 ]tosolvingEquation( 4{39 )inaniterativemanner,reducingthecomputationalcomplexitytoO(B).ThemotivationofViterbialgorithmisthefollowing.ItiteratesfromtoplevelofaHMTtobottomlevel.Ateachiteration,itestimatesthenodestatesinthatlevelinawaythat,whencombinedwithstatesestimatedinupperlevels,\best"explainstheobservedfeatures.Thatis,ateachiteration,Viterbialgorithmalwaysselectsthelocalmaxima. 83


4{39 ),Viterbialgorithmisecientandhasgoodperformanceempirically.ThecomputationcomplexityofViterbialgorithmisO(B),muchbetterthanthedependentmodel,asillustratedinFigure 4-4 ,whosecomplexityisO(2).TheperformanceimprovementresultsfromthefactthatViterbialgorithmdoesnotexhaustivelytestallpossiblenodestatecombinations.Instead,wedecomposethedecodingproblemintomultiplestages,eachofwhichdecodesnodestatesatonelevelinHMT.Atlevell,weestimatenodestatesoflevell,i.e.,fi;i2lg,basedonresultsobtainedduringpreviousstages,;i2f0;:::;l1gg.Insuchaprocedure,eachnodeisonlyaccessedfortwice.Hencethecomplexityislineartothenumberofnodes,whichisBdlogBe1BL 4-12 .Ateachedgerouter,twomonitorsareplacedtomeasuretheinboundandoutboundtrac 84


ExperimentNetwork betweenasubnetandthevictimnetwork,respectively.Forconvenience,wedenotealinkastheroutebetweenanedgerouterandthevictimsubnet. 3.7.2 [ 27 ].Sincewedonothaverealdatatracesobtainedfrom16dierentlinks,weusetherealtractracemeasuredononelink(betweentheInternetandAucklandUniversity)in16dierentdaystocreatetractracesfor16dierentlinks.Fortheabnormaltrac,werandomlygenerateTCPSYNoodattacksintothebackgroundtrace.Specically,wegenerateseveralattackscenarios.Foreachscenario,werandomlyselecttheabnormallinksandattackdurations.AttacktraconeachlinkisgeneratedinthesamewayasinSection 3.7.2 .Thatis,werandomlyinsertTCPSYNpacketswithrandomsourceIPaddressesintothebackgroundtracofthatlink.TheaveragepacketrateofTCPSYNattacktraconeachselectedlinkis1%ofthetotalpacketrateonthelink.Foreachattackscenario,attacksoneachoftheselectedlinksarelaunchedduringalmostthesameperiodtosimulatethesynchronizedDDoSattacks.Sincetheattacktraconeachlinkislow(just1%),weeectivelysimulatelowvolumeattacktrac. 85


3 ,i.e.,thenumberofunmatchedinboundSYNpacketsinonetimeslot.Theparametersettingoftwo-waymatchingfeaturesextractionissameasinSection 3.7.2 .Forconvenience,wesummarizetheparametersinTable 4-3 Table4-3: Parametersettingoffeatureextractionfornetworkanomalydetection NotationDescription Forcomparisonpurpose,wealsomeasurethenumberofSYNpacketsandSYN/FINratio[ 11 ]inaslot. Performanceofdierentschemes. FeatureDetectionalgorithmDetectionprobabilityFalsealarmprobability SYN/FINratioCUSUM0.1740.129SYNCUSUM0.520.129SYNMachinelearning0.6560.123UnmatchedSYNCUSUM0.6900.130UnmatchedSYNMachinelearning0.9730.115 Table 4-4 comparestheperformanceofdierentschemes,wherethebenchmarkistheschemein[ 11 ],i.e.,theCUSUMschemewithSYN/FINratioasthefeature;forthebenchmarkscheme,weusethesameparametersettingasthatin[ 11 ];wecomparethebenchmarkwithCUSUMandourmachinelearningalgorithmunderdierentfeatures.Tomakefaircomparison,wemakethefalsealarmprobabilityofeachschemealmostthesameandcomparethedetectionprobability.FromTable 4-4 ,itcanbeseenthat,thebenchmarkscheme(`SYN/FINratio'+CUSUM)performsverypoorlyindetectinglowvolumeDDoSattacks.Incontrast,aCUSUMalgorithmwiththenumberofSYN 86


Figure4-13. Performanceofthreshold-basedandmachinelearningalgorithmswithdierentfeaturedata Figure 4-13 comparestheROCcurveofthethreshold-basedschemedescribedinSection 4.1.2 andourmachinelearningalgorithmundertwodierentfeatures,i.e.,thenumberofSYNpackets(denotedby`SYN')andthenumberofunmatchedSYNpackets(denotedby`UM-SYN').Weobservethat,forthesamedetectionalgorithm,usingthenumberofunmatchedSYNpacketscansignicantlyimprovetheROCperformance,comparedtousingthenumberofSYNpackets.Inotherwords,giventhesamefalsealarmprobability,thedetectionprobabilityismuchhigherwhenusingthenumberofunmatchedSYNasfeature.AnotherimportantobservationfromFigure 4-13 isthatgiventhesamefeaturedata,ourmachinelearningalgorithmcan(signicantly)improvetheROC,comparedtothethreshold-basedscheme;e.g.,forthesamefalsealarmprobabilityof0.05,ourmachinelearningalgorithmachievesadetectionprobabilityof0.93,whilethethreshold-basedschemeonlyachievesadetectionprobabilityof0.72.Thisisduetothefactthatour 87


Figure4-14. Performanceoffourdetectionalgorithms InFigure 4-14 ,wecomparetheROCperformanceoffourdetectionalgorithms(thethreshold-based,thesingle-CUSUM,thedual-CUSUMdescribedinSection 4.1.3 ,andourmachinelearningalgorithm)underthesamefeature,i.e.,thenumberofunmatchedSYNpackets.Forthesingle-CUSUMandthedual-CUSUMalgorithm,thedetectiondelayDischosenfrom1to10slotsandtheparameteraioflinkiisdeterminedbyai=(DattackDnormal)i 4-14 )ofourmachinelearningalgorithmisthebestamongallthealgorithms.Wealsoseethatthedual-CUSUMout-performsthesimplethreshold-basedalgorithmandthesingle-CUSUMalgorithmhastheworstROCperformance. 88


27 ].WetestedourmachinelearningalgorithmsforalargeIPaddressspace,i.e.,theIPaddressspacecanbethewholeIPaddressspacefortheInternet. 89


5 and 6 ,wefocusonthesecondpartofourresearch,i.e.,networkcentrictracclassication.Thischaptermotivatesthesignicanceandpointsoutthechallengesofthisissue,andshowsweaknessofexistingsolutions. 90


43 44 ].InprincipletheTCPandUDPserverportnumberscanbeusedtoidentifythehigherlayerapplication,bysimplyidentifyingtheserverportandmappingthisporttoanapplicationusingtheInternetAssignedNumbersAuthority(IANA)listofregisteredports[ 45 ].However,port-basedapplicationclassicationhaslimitationsduetotheemergenceofnewapplicationsthatnolongerusexed,predictableportnumbers.Forexample,non-privilegedusersoftenhavetouseportsabove1024tocircumventoperatingsystemaccesscontrolrestrictions;orcommonapplicationslikeFTPallowsthenegotiationofunknownandunpredictableserverportstobeusedforthedatatransfer;orproprietaryapplicationsmaydeliberatelytrytohidetheirexistenceorbypassport-basedltersbyusingstandardports.Forexample,serverport80isbeingusedbyalargevarietyofnon-webapplicationstocircumventrewallswhichdonotlterport-80trac;others(e.g.,Skype)tendtousedynamicports.Amorereliabletechniqueinvolvesstatefulreconstructionofsessionandapplicationinformationfrompacketcontents[ 46 { 48 ].Althoughthisavoidsrelianceonxedportnumbers,itimposessignicantcomplexityandprocessingloadontheclassicationdevice, 91


49 50 ]havetakenthisstatisticalapproachtoclassifytracintop2p,multimediastreaming,interactiveapplication,andbulktransferapplication.Unfortunately,althoughthesepapersaddressedtheproblemofdistinguishingmultimediatracfromotherapplications,theyhavenotaddressedtheproblemofdistinguishingvoicetracfromvideotrac.Oneproblemistoseparatestreamingtracfromotherapplications,andadierentproblemistodetectandcorrectlyclassifyvoiceandvideoandclearlyseparatethetwoapplicationsfromeachother.Intheextremecase,voiceand/orvideodatastreamsmightevenbebundledtogetherinthesameexactowwithotherapplications.TheseproblemsarecommonformanyapplicationslikeSkype,GtalkandMSNthatallowuserstomixvoiceand/orvideostreamswithchatand/orletransfertracinthesameexact5-tupleow,denedas.Insuchcases,oneowmaycarry 92




5.2 ,weintroducetherelatedworkintheareaofpatternclassicationmethodologies.Section 5.3 describestheweaknessesofmetricspreviouslyusedbyotherworkswhenappliedinourcontextandhighlightsthenewtracfeaturesthatconstitutethefoundationofourapproach.Section 5.4 summarizesthischapter. 49 ],theauthorsproposedthecombinationofaveragepacketsizewithinaowandtheinter-arrivalvariabilitymetric,e.g.denedastheratioofthevariancetotheaverageinter-arrivaltimesofpacketswithinaow,asapowerfulmetrictodenefairlydistinctboundariesforthreegroupsofapplications:(i)bulkdatatransferlikeFTP,(ii)interactivelikeHTTP,and(iii)streaminglikevoice,video,gaming,etc.Severalclassicationtechniques,likenearest-neighborandK-nearest-neighbor,werethentestedusingtheabovetracfeatures.Althoughthispreliminarystudyhasprovedthattheapproachofpatternclassicationhasgreatpotentialforaproperapplicationclassication,itprovesthatmuchmoreworkstillremains,e.g.exploitingotheralternativefortracfeaturesandclassicationtechniques.Moreover,althoughthefeaturesextractedaresimpleandfeasibletobeimplementedon-the-y,thelearningalgorithmiscomplexandtheoutcomeboundariesamongthethreefamiliesofapplicationsareheavilynon-linearandtime-dependent.SimilartoRef.[ 49 ],Karagiannisetal.[ 50 ]proposedanovelapproach,calledBLINC,thatexploitsnetwork-relatedpropertiesandcharacteristics.Thenoveltyofthisapproach 94


50 ]isinterestingfromaconceptualperspectiveandprovedtoperformreasonablywellforavarietyofdierentapplications,itisstillpronetolargeestimationerrorsforstreamingapplications.Moreover,itshighcomplexityandlargememoryconsumptionremainsanopenissueforhigh-speedapplicationclassication.OtherpapersusingpatternclassicationappearedlatelyinliteraturebutmorefocusedonspecicapplicationdetectionlikePeer-to-Peer[ 46 ]andchat[ 51 ].Moreimportantly,tothebestofourknowledge,noneoftheexistingworkhasbeenabletoseparatevoicetracfromvideotracortoindicatethepresenceofvoicetracorvideotracinahybridowthatcontainstracfrombothvoice/videoandotherapplicationssuchasletransfer. 5-1 weshowtheresultsobtainedwhenusingthecombinationofaveragepacketsizeandtheinter-arrivalvariabilitymetricproposedbyRoughanet


49 ].Althoughthismetricperformedverywellinseparatingstreaming,letransfer,transactionalandinteractiveapplications,itperformspoorlywhenusedtofurtherseparateapplicationswithinthesamefamily,asvoice,videoorvoiceandvideomixedwithotherapplicationslikeletransfer,e.g.hybridows.Figure 5-1 clearlyhighlightsthecompleteabsenceofanydistinctboundaryandheavyoverlappingbetweenvoiceandvideotrac.Thereasonswhythepair(averagepacketsize,inter-arrivalvariabilitymetric)cannotseparatevideofromvoiceareasbelow.First,thepacketsizeforvideo/voiceiscontrolledbythepacketizationstrategyofthevideo/voiceapplicationdesigner[ 52 ];hence,avideoapplicationmayproducesimilaraveragepacketsizetothatforvoice(Figure 5-1 ).Second,randomend-to-enddelayintheInternetcauseslargevariationsintheinter-arrivalvariabilitymetricfordierentvideo/voiceows. Figure5-1. Averagepacketsizeversusinter-arrivalvariabilitymetricfor5applications:voice,video,letransfer,mixofletransferwithvoiceandvideo. Table5-1: Commonlyusedspeechcodecandtheirspecications StandardCodecMethodInter-PacketDelay(ms) G.711[ 53 ]PCM.125G.726[ 54 ]ADPCM.125G.728[ 55 ]LD-CELP.625G.729[ 56 ]CS-ACELP10G.729A[ 56 ]CS-ACELP10G.723.1[ 57 ]MP-MLQ30G.723.1[ 57 ]ACELP30 96


Figure5-2. Inter-arrivaltimedistributionforvoiceandvideotrac 5-1 listssomespeechcodecstandardsandtheassociatedIPDsthatarerequiredforacorrectimplementationofthoseprotocols.PacketsleavingthetransmittermighttraversealargenumberoflinksintheInternetbeforereachingtheproperdestination.Alongthis 97


Packetsizedistributionforvoiceandvideotrac traveling,packetsmightexperiencerandomdelayduetocongestionatrouters'interfaces.Asaconsequence,theinter-arrivaltimesbetweenpacketsatthereceivermightbeseverelyaectedbyrandomnoise,e.g.jitter,andthusthismetricmightnotrepresentareliablecandidatefeatureforarobustclassicationmethodology.Althoughthisproblemdoesexist,wenotehowtheinter-arrivaltimesbetweenpacketswithinthesameowstillshowsastrongregularitywhenstudiedinthefrequencydomainatthereceiverside.Asanexample,inFigure 5-2 ,weshowthedistributionsoftheinter-arrivalpackettimesatthereceiversidewhenusingSkypetotransmitrespectivelyvoiceonlyandvideoonlybetweentwohosts,onelocatedinUniversityAineastcoastandtheotherinUniversityBinwestcoastofUSA.Aswecansee,thedistributionsforbothvideoandvoicearecenteredaround0.03second.Ontheotherhands,Figure 5-3 showsthedistributionsofthepacketsizesforbothvoiceandvideo.Asyoucansee,bothvoiceandvideoarecharacterizedbysimilardistributionforpacketsizelessthan200bytes.Althoughvideotracgeneratespacketsizeoflargerthan200,theselargerpacketscannotbereliablyusedtoseparatevideofromvoicesinceotherapplicationssuchaschatorletransfermightalsogeneratetheselargerpackets.Asaconsequence,packetinter-arrivaltimeorpacketsizeisaweakfeaturewhenconsideredinthetemporaldomain. 98


5-4 and 5-5 respectivelyforvoiceandvideo.Wecanseethatsomeregularityforbothvoiceandvideoexistfordierenttraces,althoughtheregularityisnotquitestrong.ThisresultholdstrueforallexperimentsconductedwhentransmittingSkypevoiceandvideopacketsovertheInternetfromUniversityAtoUniversityB. Figure5-4. Powerspectraldensityoftwosequences/tracesoftime-varyinginter-arrivaltimesforvoicetrac 58 ],i.e.,Intraframes(I-frame)andPredictedframes(P-frame).AnI-frameisaframecodedwithoutreferencetoanyframeexceptitself.Itservesasthestartingpointforadecodertoreconstructthevideostream.AP-framemaycontainbothimagedataandmotionvectordisplacements 99

PAGE 100

Powerspectraldensityoftwosequencesoftime-varyinginter-arrivaltimesforvideotrac and/orcombinationsofthetwo.Itsdecodingneedsreferencetopreviouslydecodedframes.PacketscontainingI-framesarelargerthanthosecontainingP-frames.Usually,thenumberofP-framesbetweentwoconsecutiveI-framesisconstant.Hence,onecanobserveastrongperiodicvariationofpacketsizeduetotheinterleavingofI-framesandP-framescomposingvideodatastreams.VoicestreamshavesimilarphenomenonifLinearPredictionCoding(LPC),e.g.,codeexcitedlinearprediction(CELP)voicecoder,isemployed.Asanexample,Figs. 5-6 and 5-7 showthepowerspectraldensityofvoiceandvideopacketsizes,respectively. 5-8 and 5-9 showhowtheregularitieshiddeninvoiceandvideodatastreamscanbeampliedwhencombiningthetwofeaturestogetherinonesinglestochasticprocessthatwillbedescribedlaterinthepaper.NotehowthetwoimportantfrequenciesareampliedandclearlyvisibleinthePSDplots.ThereasonwhythereisapeakinthePSDforvoice(seeFigure 5-8 )isthatvoiceapplicationsusuallyproduceclose-to-constantpacketrateduetoconstantinter-packetdelayofthewidelyusedspeechcodecslistedinTable 5-1 ;e.g.,thepeakof33HzinFigure 5-8 correspondsto30msinter-packet 100

PAGE 101

Powerspectraldensityoftwosequencesofdiscrete-timepacketsizesforvoicetrac Figure5-7. Powerspectraldensityoftwosequencesofdiscrete-timepacketsizesforvideotrac delay.Comparedtovoice,videoapplicationshaveaatterPSD.Thereasonisasbelow.ThenumberofbitsinanI-frameofvideodependsonthetextureoftheimage(e.g.,theI-frameofablackboardimageproducesmuchlessbitsthanthatofacomplicatedowerimage),resultinginalargerangeinthenumberofpacketsinanI-frame,e.g.,from1packettoafewhundredpacketsproducedbyanI-frame.Theframerateisusuallyconstant(e.g.,30frames/sisastandardrateinUSA),i.e.,aframewillbegeneratedevery 101

PAGE 102

Figure5-8. Powerspectraldensityoftwosequencesofcontinuous-timepacketsizesforvoicetrac Figure5-9. Powerspectraldensityoftwosequencesofcontinuous-timepacketsizesforvideotrac 102

PAGE 103

5.2 )arenotabletoaccuratelydistinguishbetweenvoiceandvideoows.Byanalyzingthepropertiesofvoiceandvideodatastreams,ourintuitionistoexploitthestrongregularitiesresidinginpacketinter-arrivaltimesandtheassociatedpacketsizes.Inthischapter,weanalyzefourtypesofmetricsthatexploittheregularities,1.packetinter-arrivaltimeandpacketsizeintimedomain;2.packetinter-arrivaltimeinfrequencydomain;3.packetsizeinfrequencydomain;4.combiningpacketinter-arrivaltimeandpacketsizeinfrequencydomain.Byanalyzingpropertiesandillustratingguresofthefourtypesofmetrics,weshowthatcombiningpacketinter-arrivaltimeandpacketsizeinonesinglestochasticprocessgeneratesdistinctivefeaturetoclassifyvoiceandvideostreams. 103

PAGE 104

VOVClassierSystemArchitecture Werstpresenttheoverallarchitectureofoursystem(VOVClassier,Figure 6-1 ),andprovideahigh-leveldescriptionofthefunctionalitiesofeachofitsmodules.Generallyspeaking,VOVClassierisanautomatedlearningsystemthatusespacketheadersfromrawpacketscollectedothewire,organizethemintotransportnetworkowsandprocesstheminrealtimetosearchforvoiceandvideoapplications.VOVClassierrsttrainsvoiceandvideodatastreamsseparatelybeforebeingusedinrealtimeforclassication.Duringthetrainingphase,VOVClassierextractsfeaturevectors,whichisasummary(alsoknownasastatistic)ofrawtracbitstream,andmaintaintheirstatisticsinmemory.Duringtheonlineclassicationphase,aclassiermakesdecisionbymeasuringsimilaritymetricsbetweenthefeaturevectorextractedfromon-the-ynetworktracandthefeaturevectorsextractedfromtrainingdata.Flowswithhighvaluesofsimilaritymetricwiththevoice(orvideo)featuresareclassiedasvoice(orvideo);datastreamswithlowvaluesofsimilaritywithvoice/videoareclassiedasotherapplications.Ingeneral,VOVClassieriscomposedoffourmajormodulesthatoperateincascade:(i)FlowSummaryGenerator(FSG),(ii)FeatureExtractor(FE)viaPowerSpectral 104

PAGE 105

5.3 wehaveshownasvoiceandvideodatastreamsthatarecharacterizedbypacketsthatareverysmallinsize.Asaconsequence,thismoduleltersoutallpacketswhosesizeissmallerthanapre-speciedthresholdP.Theprocessedowistheninternallydescribedintermsofpacketsizesandinter-arrivaltimesbetweenpacketswithinanygenericowFS.FS=fhPi;Aii;i=1;:::;Ig; 6-1 )areusedduringboththetrainingandtheclassicationphases. 105

PAGE 106


PAGE 107

6.2 6.3 ,and 6.4 describethecomponents,FeatureExtractor,Voice/VideoSubspaceGenerator,andVoice/VideoClassier,respectively.InSection 6.5 ,weconductexperimentsontraccollectedbetweentwouniversitiesusingSkype,MSN,andGTalk.Section 6.6 summarizesthischapter. 5.3 theextractionandprocessingofsimpletracfeaturesdoesnotsolvetheproblemofdetectingandseparatingvoiceandvideodatastreamsfromotherapplications.InthissectionwerstintroducethepreliminarystepsthatwetaketotransformeachgenericowFSobtainedfromFSGintoastochasticprocessthatcombinestheinter-arrivaltimesandpacketsizes.Thenwedescribehowtousepowerspectraldensity(PSD)analysisasapowerfulmethodologytoextractsuchhiddenkeyregularitiesresidinginreal-timemultimedianetworktrac. Powerspectraldensityfeaturesextractionmodule.Cascadeofprocessingsteps. EachowFSextractedfromtheFSGisforwardedtotheFEmodulethatappliesseveralstepsincascade(Figure 6-2 ).First,anyFSextracted(seeEquation( 6{1 ))ismodeledasacontinuousstochasticprocessasillustratedinEquation( 6{2 ):P(t)=X2FSP(tA); 107

PAGE 108

6{2 )isrepresentedasasummationofdeltafunctions,itsspectrumspansthewholefrequencydomain.InordertocorrectlyreshapethespectrumofPh(t)toavoidaliasingwhenitissampledatintervalTs,weapplyalowpasslter(LPF)characterizedbyitsimpulseresponsehLPF(t).Ph(t)canthenbemathematicallydescribedasfollowing:Ph(t)=P(t)hLPF(t)=X2FSPh(tA): Powerspectraldensitydenition. 108

PAGE 109

59 ])canbecomputedas:($;y)=1Xk=r(k;y)ej$k;$2[;); 6{5 )cantakeanyvalue,werestrictitsdomaintobewithin[;)because($;y)=($+2;y).AccordingtoEquations( 6{5 )and( 6{6 ),thecomputationofthePSDforadigitalsignaltheoreticallyrequirestohaveaccesstoaninnitelongtimesequence.Sinceinreality,wecannotassumetohaveinnitedigitalsequencesatourdisposal,weneedtofacetheproblemonwhichtechniquecanbeusedinourcontexttoestimatethepowerspectraldensitywithanadmissibleaccuracy.Inliterature,twodierentfamiliesofPSDestimationareavailable:parametricandnon-parametric.Parametricmethodshaveshowntoperformbetterundertheassumptionthattheunderlyingmodeliscorrectandaccurate.Furthermore,thesemethodsaremoreinterestingfromacomputationalcomplexityperspectiveastheyrequiretheestimationoffewervariableswhencomparedwithnon-parametricmethods.Inourresearch,weemployparametricmethodtoestimatePSD.Thedetailsarepresentedinthenextsection. 109

PAGE 110

6{7 )canberegardedasobtainingasignalbylteringwhitenoiseofpower"2throughalterwithtransferfunctionB($) (6{10)StartingfromEquation( 6{7 ),threetypesofmethodsarederived:ifp>0andq=0,onemodelsfy(i)gi2Zasanautoregressive(AR(p))signal;1.ifp=0andq>0,onemodelsfy(i)gi2Zasamovingaverage(MA(q))signal;2.otherwise,itismodeledasanautoregressivemovingaverage(ARMA(p;q))signal.BasedontheAR,MA,orARMAassumptions,onecanestimatethecoecientsinEquation( 6{7 )andhencePSD.Ingeneral,noneofthesethreemodelsoutperformstheothertwobutrathertheirperformancearestrictlyrelatedtothespecicshapeofthesignalunderconsideration.Duetothefactthatthesignalweprocessarecharacterizedbystrongregularitiesinthetimedomain,wedecidedtoadopttheARmodel.ThereasonisthattheARequationcanmodelspectrumwithnarrowpeaksbyplacingzerosofA($)closetotheunitcircle. 110

PAGE 111

6{10 )canbewrittenasy(i)+pXt=1aty(it)="(i): 6{12 )byy(ik)andtakingexpectationatbothsides,oneobtainsr(k;y)+pXt=1atr(kt;y)=Ef"(i)y(ik)g: 6{14 )canberewritteninmatrixform,i.e.,266666664r(0;y)r(1;y)r(p;y)r(1;y)r(0;y).........r(1;y)r(p;y)r(0;y)3777777752666666641a1...ap377777775=266666664"20...0377777775 6{15 )isknownastheYule-WalkermethodforARspectralestimation[ 59 ].Givendatafy(i)gIi=1,onerstestimatestheautocovariancesequence8>><>>:^r(k;y)=1 6{15 )becomesasystemofp+1linearequationswithp+1unknownvariables, 111

PAGE 112

6{17 )and( 6{18 ),isnotgoodenoughintermsoftimecomplexity.Equation( 6{17 )computestheinversionofcovariancematrix,whosetimecomplexityisO(p3)[ 60 ,page755].Inaddition,inmostapplications,thereisnoaprioriinformationaboutthetrueorderp.Tocopewiththat,theYule-Walkersystemofequations,Equation( 6{15 ),hastobesolvedforp=1uptop=pmax,wherepmaxissomeprespeciedmaximumorder.ThetimecomplexityisO(p4max).Inthispaper,weusetheLevinson-DurbinAlgorithm(LDA)[ 59 ]toreducetimecomplexity.ItestimatesARsignalcoecientsrecursivelyintheorderp.Tofacilitatefurtherdiscussionandtoemphasizetheorderp,wedenoteRp+14=266666664r(0;y)r(1;y)r(p;y)r(1;y)r(0;y)r(p+1;y).........r(p;y)r(1;y)r(0;y)377777775; 112

PAGE 113

6{15 )asRp+12641~ap375=264"2p0375 1. functionLDA(:::) 2. Argument1:data,fy(i)gIi=1. 3. Argument2:order,p. 4. Return:parametersofAR(p)model,~apand"2p. 5. 6. (6{23) 7. (6{24) 8. fort1;:::;p rt4=[r(t;y);r(t1;y);;r(1;y)]T 10. endfor Figure6-3. Levinson-DurbinAlgorithm. Figure 6-3 givestheLDAalgorithmtoestimatecoecientsofAR(p)modelgivendatafy(i)gIi=1. 113

PAGE 114

6{17 )and( 6{18 ). 1. functionPSDEstimate(:::) 2. Argument1:data,fy(i)gIi=1. 3. Argument2:order,p. 4. Return:PSD($;y). 5. (6{30) 6. Figure6-4. ParametricPSDEstimateusingLevinson-DurbinAlgorithm. OncetheARmodelisestimated,onecanestimatePSDofsignalfy(i)gIi=1.TheprocedureisgiveninFigure 6-4 6{4 ))tobesecond-orderstationary.ThenitsPSDcanbeestimatedas:($;Pd)=PSDEstimatefPd(i)gIdi=1;p 6-2 ).Thus,onecanfurtherformulatethePSDintermsofrealfrequencyfasf(f;Pd)=2f Fs;Pd;f2Fs 114

PAGE 115

6{33 )showstherelationshipbetweentheperiodiccomponentsofastochasticprocessinthecontinuous-timedomainandtheshapeofitsPSDinthefrequencydomain.f(f;Pd)isacontinuousfunctioninfrequencydomain.Tohandleitinacomputer,weneedtodosamplinginfrequencydomain.Inotherwords,weselectaseriesoffrequencies,0f1
PAGE 116

6-1 ,wecollectvoiceandvideotrainingowsduringthetrainingphase.AfterprocessingtherawpacketdatathroughthefeatureextractionmoduleviaPSD,oneobtainstwosetsoffeaturevectors,(1)4=n~1(1);~1(2);:::;~1(N1)o; 61 ]andMetricMultidimensionalScaling(MDS)[ 62 ],whichidentifylinearstructure,andISOMAP[ 63 ]andLocallyLinearEmbedding(LLE)[ 64 ],whichidentifynon-linearstructure.Unfortunately,allthesemethodsassumethatdataareembeddedinonesinglelow-dimensionalsubspace.Thisassumptionisnotalwaystrue.Forexample,asdierentsoftwareusesdierentvoicecoding,itismorereasonabletoassumethatthePSDfeaturevectorofvoicetracisarandomvectorgeneratedfromamixturemodelthanasinglemodel.Insuchcase,itismorelikelythatthereareseveralsubspacesinwhichthefeaturevectorsareembedded.Thesameholdsforvideofeaturevectors.Asaresult,abetterschemeistorst,clusterthetrainedfeaturevectorsintoseveralgroups,knownassubspacedecomposition;andsecond,toidentifythesubspacestructureofeachgroup,knownassubspacebasesidentication.Wedescribethetwostepsinthefollowingsections. 116

PAGE 117

65 ]proposedamethodtodecomposesubspacesaccordingtotheminimumcodinglengthcriteria.Theideaistoviewthedatasegmentationproblemfromtheperspectiveofdatacoding/compression.Supposeonewantstondacodingscheme,C,whichmapsdatain2RMNtobitsequence.Asallelementsarerealnumbers,innitelongbitsequenceisneededtodecodewithouterror.Hence,onehastospecifythetolerabledecodingerror,,toobtainamappingwithnitecodinglength,i.e.,~nC1C~n22;for8n=1;:::;N: 65 ]thatthecodinglengthisupboundedbyLC()L()=N+K N2T+K 117

PAGE 118

6{39 )),intermsofminimumcodinglengthcriteria,shouldminimizecodinglengthofthesegmenteddata,i.e.,min^LC(;)=min(KXk=1LC(k)+KXk=1jkjlog2k 6{45 )isthesummationofcodinglengthofeachgroup,andthesecondoneisthenumberofbitsneededtoencodingmembershipofeachitemofintheKgroups.Theoptimalpartitionisachievedinthefollowingway.Letthesegmentationschemeberepresentedbythemembershipmatrix,k4=diag([1k;2k;:::;Nk])2RNN; 65 ,page34]provedthatthecodinglengthisboundedasfollows.^LC(;)KXk=1tr(k)+K 6{45 )and( 6{48 ),oneachievesaminimaxcriterion^=argminhmaxC^LC(;)i=argmin^L(;): 118

PAGE 119

1. functionMCLPartition(:::) 2. Argument1:setoffeaturevectors,=n~(1);:::;~(N)o Return:partitionof,=f1;:::;K;1[:::[K=;i\j=;for8i6=j;g. 4. Initialization:=nn~(1)o;n~(2)o;:::;n~(N)o;o whiletruedo 6. (6{51) 7. ifL(1[2)L(1;2)0then 8. break 9. else 10. =(nf1;2g)[f1[2g endif 12. endwhile 13. return Figure6-5. Pairwisesteepestdescentmethodtoachieveminimalcodinglength. ThereisnoclosedformsolutionforEquation( 6{49 ).Hong[ 65 ,page41]proposedapairwisesteepestdescentmethodtosolveit(Figure 6-5 ).Itworksinabottom-upway.Itstartswithapartitionschemethatassignseachelementoftoapartition.Then,ateachiteration,thealgorithmndstwosubsetsoffeaturevectorssuchthatbymergingthesetwosubsets,onecandecreasethecodinglengththemost(Equation( 6{51 )).Thisprocedurestopswhennofurtherdecreaseofcodinglengthcanbeachievedbymerginganytwosubsets.Usingtheabovemethod,weobtainapartitionofvoicefeaturevectorset(1), 119

PAGE 120

61 ]algorithmtoidentifysubspacebasesforeachsegmentation,n(i)k;k=1;:::;Ki;i=1;2o; 6-6 showsthealgorithm. 1. functionh~;^U;^;U;i=IdentifyBases2RMN; =h~1~;~2~;:::;~jj~i DoeigenvaluedecompositiononTsuchthatT=UUT; 5. 6. ^U=[~u1;~u2;:::;~uJ1] 7. U=[~uJ;~uJ+1;:::;~uM] 8. ^=diag21;:::;2J1 =diag2J;:::;2M endfunction Figure6-6. FunctionIdentifyBasesidentiesbasesofsubspace. InFigure 6-6 ,argumentrepresentsthefeaturevectorsetofonesegmentationandisauserdenedparameterwhichspeciesthepercentageofenergyretained,e.g.,90% 120

PAGE 121

6.4 .ApplyingthefunctionIdentifyBasesonallsegmentations,weobtainh~(i)k;^U(i)k;^(i)k;U(i)k;(i)ki=IdentifyBases((i)k) (6{56)for8k=1;:::;Ki,8i=1;2.Thesearetheoutputsofsubspaceidenticationmodule,andhencetheresultsoftrainingphase,inFigure 6-1 .Duringtheclassicationphase,theseoutputsareusedassystemparameters,whichwillbepresentedinthenextsection. 6.3 ,wepresentedanapproachtoidentifysubspacesspannedbyPSDfeaturevectorsoftrainingvoiceandvideoows.Specically,oneobtainsthefollowingparameters:h~(i)k;^U(i)k;^(i)k;U(i)k;(i)ki 121

PAGE 122

6-7 showstheprocedureofthevoice/videoclassier. 1. functiontype=VoiceVideoClassify~;A;V For8i=1;2,8k=1;:::;Kid(i)k=NormalizedDistance~;~(i)k;U(i)k;(i)k. 3. For8i=1;2,di=minkd(i)k. 4. ifd1V type=VOICE. 6. elseifd1>Aandd2
PAGE 123

6-1 tonetworktracclassication.Beforethat,werstdescribeexperimentsettingsinSection 6.5.1 6.5.2 ,twosetsofexperimentsareconductedontracgeneratedbySkype.InSection 6.5.3 ,othertwosetsofexperimentsareconductedontracgeneratedbySkype,MSN,andGTalk.Foreachsetoftheexperiments,weuseReceiverOperatingCharacteristics(ROC)curves[ 28 ,page107]astheperformancemetric.ROCcurveisacurveofdetectionprobability,PD,vs.falsealarmprobability,PFA,where,PDjH4=P(TheestimatedstateofnatureisHjThetruestateofnatureisH); 6-7 ),oneisabletogenerateROCcurve.Duringtheexperiments,wecollectednetworktracfromthreeapplications,i.e.,Skype,MSN,andGTalk.Foreachapplication,tracwascollectedintwoscenarios.Fortherstscenario,twolabcomputerslocatedinUniversityAandUniversityBrespectivelywerecommunicatingwitheachother.Therewasdirectconnectionbetweentwopeers.Forthesecondone,weusedarewalltoblockdirectconnectionbetweenthetwopeerssuchthattheapplicationwasforcedtouserelaynodes.Todoclassication,wechoserst10secondsofeachow,i.e.,Amax10seconds.WesetTs=0:5milliseconds.Hence,Id=20;000. 123

PAGE 124

6-8 showstheROCcurvesofclassifyingvoiceandvideoows. (b)Figure6-8. TheROCcurvesofsingle-typedowsgeneratedbySkype,(a)VOICEand(b)VIDEO. WethenconductexperimentsonhybridSkypeows.Inotherwords,eachowmaybeoftypeVOICE,VIDEO,FILE+VOICE,FILE+VIDEO,ornoneoftheabove.Figure 6-9 plotstheROCcurvesofthesevetypes,respectively. 124

PAGE 125

(b) (c) (d)Figure6-9. TheROCcurvesofhybridowsgeneratedbySkype,(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. SimilartoSection 6.5.2 ,werstconsiderthescenariowheneachowcarriesonetypeoftrac.Bytuningthethresholds,,A,andV,wegenerateROCcurvesofclassifyingvoiceandvideoows(Figure 6-10 ).Wethenconductexperimentsonhybridows.Figure 6-11 showstheROCcurvesofclassifyingVOICE,VIDEO,FILE+VOICE,andFILE+VIDEOows. 6-8 6-9 6-10 ,and 6-11 ,weshowsometypicalvaluesofPDandPFApairsinTable 6-1 .OnecanseethefollowingphenomenafromTable 6-1 125

PAGE 126

(b)Figure6-10. TheROCcurvesofsingle-typedowsgeneratedbySkype,MSN,andGTalk:(a)VOICEand(b)VIDEO. Table6-1: TypicalPDandPFAvalues. SkypeSkype+MSN+GTalkPFA(PD)SingleHybridSingleHybrid VOICE0(1)0(1)0(.995).002(.986)VIDEO0(.993)0(.965)0(.952)0(.948) 6-1 ,onenotesthatclassicationofVOICEtracismoreaccuratethanthatofVIDEO.Specically,wecanachieve100%accurateclassicationofSkypevoiceows.Thisisduetothefactthatvoicetrachashigherregularitythanvideodoes(Figure 5-8 andFigure 5-9 ).Onecanimmediatelytellthedominantperiodiccomponentat33Hzinthevoiceows.Thisfrequencycorrespondstothe30-millisecondIPDoftheemployedvoicecoding.Ontheotherhand,VideoPSDshavepeaksat0.Itmeansthatnon-periodiccomponentdominatesinvideoows.OnecanseethatPSDsofthetwovideoowsareclosetoeachother.ThatisthereasonwhyourapproachachieveshighclassicationaccuracybyusingPSDfeatures. 6-1 ,onecanseethattheclassicationofsingle-typedowsismoreaccuratethanthatofhybridows.Mixingmultipletypesoftractogetherislikeincreasingnoise.Hence,itisnotsurprisingthatclassicationaccuracyisreduced. 126

PAGE 127

(b) (c) (d)Figure6-11. TheROCcurvesofhybridowsgeneratedbySkype,MSN,andGTalk:(a)VOICE,(b)VIDEO,(c)FILE+VOICE,and(d)FILE+VIDEO. 127

PAGE 128

6.3 ,decomposesmultiplesubspacesintheoriginalhigh-dimensionalfeaturespace.Asaresult,PSDfeaturevectorsofSkypeandGTalkarelikelytobewithindierentsubspacesthanthoseofMSN.Therefore,wecanstillclassifytracaccurately. 128

PAGE 129


PAGE 130


PAGE 131


PAGE 132


PAGE 133

4{31 )Proof:Xu02f0;1gPi=uj(i)=u0(i)(u0)p((i)ju(i)=u0)=Xu02f0;1gPi=uj(i)=u0p((i)=u0;~Tn(i))p((i)j(i)=u0)=Xu02f0;1gp(i=u;(i)=u0;~Tni)=p(i=u;~Tni)=i(u): 4{32 )Proof:p(iji=u)Yj2(i)Xu02f0;1gP(j=u0ji=u)j(u0)=p(iji=u)Yj2(i)Xu02f0;1gP(j=u0ji=u)p(~Tjjj=u0)=p(iji=u)Yj2(i)p(~Tjji=u)=p(~Tiji=u)=i(u) (A{2) 133

PAGE 134

4{33 )Proof:i(u)i(u) Pu0pi=u0;~Tnip~Tniji=u=pi=u;~ Pu0pi=u0;~=pi=u;~ (A{3) 4{34 )Proof:i(u)(i)(u0)Pi=uj(i)=u0(i)(u0) [Pu00i(u00)i(u00)]Pu00Pi=u00j(i)=u0i(u00)=p~Tiji=up~T(i)j(i)=u0Pi=uj(i)=u0p(i)=u0;~Tn(i) hPu00pi=u00;~Tnip~Tiji=u00ihPu00Pi=u00j(i)=u0p~Tiji=u00i=pi=u;~Tij(i)=u0p(i)=u0;~Tn(i)p~T(i)j(i)=u0 hPu00pi=u00;~ihPu00pi=u00;~Tij(i)=u0i=pi=u;(i)=u0;~Ti[Tn(i)p~T(i)j(i)=u0 (A{4) 134

PAGE 135

[1] P.Mockapetris,\Domainnames-conceptsandfacilities,"RFC1034. [2] P.Mockapetris,\Domainnames-implementationandspecication,"RFC1035. [3] \Videoondemand,"Wikipedia.[Online].Available: on demand [4] D.Wu,Y.T.Hou,W.Zhu,Y.-Q.Zhang,andJ.M.Peha,\Streamingvideoovertheinternet:Approachesanddirections,"IEEETrans.CircuitsSyst.VideoTechnol.,vol.11,pp.282{300,Mar.2001. [5] K.Nichols,S.Blake,F.Baker,andD.Black,\Denitionofthedierentiatedserviceseld(dseld)intheipv4andipv6headers,"RFC2474. [6] S.Blake,D.Black,M.Carlson,E.Davies,Z.Wang,andW.Weiss,\Anarchitecturefordierentiatedservices,"RFC2475. [7] P.Almquist,\Typeofserviceintheinternetprotocolsuite,"RFC1349. [8] S.S.Kim,A.L.N.Reddy,andM.Vannucci,\Detectingtracanomaliesusingdiscretewavelettransform,"inProceedingsofInternationalConferenceonInformationNetworking(ICOIN),vol.III,Busan,Korea,Feb.2004,pp.1375{1384. [9] C.-M.Cheng,H.T.Kung,andK.-S.Tan,\Useofspectralanalysisindefenseagainstdosattacks,"inProceedingsofIEEEGlobecom2002,vol.3,Taipei,Taiwan,Nov.2002,pp.2143{2148. [10] A.Hussain,J.Heidemann,andC.Papadopoulos,\Aframeworkforclassifyingdenialofserviceattacks,"inProceedingsofACMSIGCOMM,Karlsruhe,Germany,Aug.2003. [11] H.Wang,D.Zhang,andK.G.Shin,\DetectingSYNoodingattacks,"inProc.IEEEINFOCOM'02,NewYorkCity,NY,June2002,pp.1530{1539. [12] T.Peng,C.Leckie,andK.Ramamohanarao,\DetectingdistributeddenialofserviceattacksusingsourceIPaddressmonitoring,"DepartmentofComputerScienceandSoftwareEngineering,TheUniversityofMelbourne,Tech.Rep.,2002.[Online].Available: [13] R.B.Blazek,H.Kim,B.Rozovskii,andA.Tartakovsky,\Anovelapproachtodetectionof\denial-of-service"attacksviaadaptivesequentialandbatch-sequentialchange-pointdetectionmethods,"inProc.IEEEWorkshoponInformationAssuranceandSecurity,WestPoint,NY,June2001,pp.220{226. [14] S.MukkamalaandA.H.Sung,\Detectingdenialofserviceattacksusingsupportvectormachines,"inProceedingsofIEEEInternationalConferenceonFuzzySystems,May2003. 135

PAGE 136

[15] S.Savage,D.Wetherall,A.Karlin,andT.Anderson,\Practicalnetworksupportforiptraceback,"inProc.ofACMSIGCOMM'2000,Aug.2000. [16] A.Lakhina,M.Crovella,andC.Diot,\Characterizationofnetwork-wideanomaliesintracows,"inProc.ACMSIGCOMMConferenceonInternetMeasurement'04,Oct.2004. [17] H.Wang,D.Zhang,andK.G.Shin,\Change-pointmonitoringforthedetectionofdosattacks,"IEEETransactionsonDependableandSecureComputing,no.4,pp.193{208,Oct.2004. [18] J.MirkovicandP.Reiher,\Ataxonomyofddosattacksandddosdefensemechanisms,"inProc.ACMSIGCOMMComputerCommunicationsReview'04,vol.34,Apr.2004,pp.39{53. [19] J.B.PostelandJ.Reynolds,\Filetransferprotocol,"RFC959,Oct.1985.[Online].Available: [20] K.Lu,J.Fan,J.Greco,D.Wu,S.Todorovic,andA.Nucci,\Anovelanti-ddossystemforlarge-scaleinternet,"inACMSIGCOMM2005,Philadelphia,PA,Aug.2005. [21] B.H.Bloom,\Space/timetrade-osinhashcodingwithallowableerrors,"Commun.ACM,vol.13,no.7,pp.422{426,July1970. [22] L.Fan,P.Cao,J.Almeida,andA.Z.Broder,\Summarycache:Ascalablewide-areawebcachesharingprotocol."IEEE/ACMTrans.Netw.,vol.8,no.3,June2000. [23] F.Chang,W.changFeng,andK.Li,\Approximatecachesforpacketclassication,"inIEEEINFOCOM2004,vol.4,Mar.2004,pp.2196{2207. [24] R.Rivest,\Themd5message-digestalgorithm,"RFC1321,Apr.1992.[Online].Available: [25] 7910.pdf [26] D.A.PattersonandJ.L.Hennessy,ComputerOrganizationandDesign:TheHardware/SoftwareInterface.SanFrancisco,CA:MorganKaufmann,1998,ch.5,6. [27] \Auckland-IVtracedata,"2001.[Online].Available: [28] L.L.Scharf,Statisticalsignalprocessing:detection,estimation,andtimeseriesanalysis.AddisonWesley,1991. [29] R.O.Duda,P.E.Hart,andD.G.Stork,PatternClassication,2nded.Wiley-Interscience,Oct.2000. [30] G.CasellaandR.L.Berger,StatisticalInference,2nded.DuxburyPress,June2001.

PAGE 137

[31] B.J.Frey,GraphicalModelsforMachineLearningandDigitalCommunication.Cambridge,MA:MITPress,1998. [32] M.C.Nechyba,\Maximum-likelihoodestimationformixturemodels:theemalgorithm,,"2003,coursenote.[Online].Available: eel6825.f2003/course materials/t4.em theory/em notes.pdf [33] Y.Weiss,\Correctnessoflocalprobabilitypropagationingraphicalmodelswithloops,"NeuralComputation,vol.12,pp.1{4,2000. [34] J.S.Yedidia,W.T.Freeman,andY.Weiss,\Generalizedbeliefpropagation,"AdvancesinNeuralInformationProcessingSystems,vol.13,pp.689{695,Dec.2000. [35] J.Pearl,ProbabilisticReasoninginIntelligentSystems:NetworksofPlausibleInference.MorganKaufmann,Sept.1988. [36] S.M.AjiandR.J.McEliece,\Thegeneralizeddistributivelaw,"IEEETrans.Inform.Theory,vol.46,pp.325{343,Mar.2000. [37] T.Richardson,\Thegeometryofturbo-decodingdynamics,"IEEETrans.Inform.Theory,vol.46,pp.9{23,Jan.2000. [38] R.J.McEliece,D.J.C.McKay,andJ.F.Cheng,\Turbodecodingasaninstanceofpearlsbeliefpropagationalgorithm,"IEEEJ.Select.AreasCommun.,vol.16,pp.140{52,Feb.1998. [39] F.KschischangandB.Frey,\Iterativedecodingofcompoundcodesbyprobabilitypropagationingraphicalmodels,"IEEEJ.Select.AreasCommun.,vol.16,pp.219{230,Feb.1998. [40] A.J.Viterbi,\Errorboundsforconvolutionalcodesandanasymptoticallyoptimumdecodingalgorithm,"IEEETrans.Inform.Theory,vol.13,pp.260{269,Apr.1967. [41] J.G.DAVIDFORNEY,\Theviterbialgorithm,"inProceedingsoftheIEEE,vol.61,Mar.1973,pp.268{278. [42] L.R.RABINER,\Atutorialonhiddenmarkovmodelsandselectedapplicationsinspeechrecognition,"inProceedingsoftheIEEE,vol.77,Feb.1989,pp.257{286. [43] D.Moore,K.Keys,R.Koga,E.Lagache,andkclay,\TheCoralReefsoftwaresuiteasatoolforsystemandnetworkadministrators,"inUsenixLISA.(2001),Dec.2001.[Online].Available: [44] C.Logg,\Characterizationofthetracbetweenslacandtheinternet,"July2003.[Online].Available: [45] I.A.N.Authority,\Portnumbers,"Aug.2006.[Online].Available:

PAGE 138

[46] T.Karagiannis,A.Broido,N.Brownlee,kcclay,andM.Faloutsos,\Isp2pdyingorjusthiding?"inIEEEGlobecom2004,2004. [47] S.Sen,O.Spatscheck,andD.Wang,\Accurate,scalableinnettworkidenticationofp2ptracusingapplicationsignatures,"inWWW,2004. [48] K.Wang,G.Cretu,andS.J.Stolfo,\Anomalouspayload-basednetworkintrusiondetection,"in7thInternationalSymposiumonRecentAdvancedinIntrusionDetection,Sept.2004,pp.201{222. [49] M.Roughan,S.Sen,O.Spatscheck,andN.Dueld,\Class-of-servicemappingforqos:Astatisticalsignature-basedapproachtoiptracclassication,"inACMInternetMeasurementConference,Taormina,Italy,2004. [50] T.Karagiannis,K.Papagiannaki,andM.Faloutsos,\Blinc:multileveltracclassicationinthedark,"inSIGCOMM'05:Proceedingsofthe2005conferenceonApplications,technologies,architectures,andprotocolsforcomputercommunications.NewYork,NY,USA:ACMPress,2005,pp.229{240. [51] C.Dewes,A.Wichmann,andA.Feldmann,\Ananalysisofinternetchatsystems,"inIMC'03:Proceedingsofthe3rdACMSIGCOMMconferenceonInternetmeasurement.NewYork,NY,USA:ACMPress,2003,pp.51{64. [52] D.Wu,T.Hou,andY.-Q.Zhang,\Transportingreal-timevideoovertheinternet:Challengesandapproaches,"ProceedingsoftheIEEE,vol.88,no.12,pp.1855{1875,December2000. [53] ITU-T,\G.711:Pulsecodemodulation(pcm)ofvoicefrequencies,"ITU-TRecommendationG.711,1989.[Online].Available: [54] ITU-T,\G.726:40,32,24,16kbit/sadaptivedierentialpulsecodemodulation(adpcm),"ITU-TRecommendationG.726,1990.[Online].Available: [55] ITU-T,\G.728:Codingofspeechat16kbit/susinglow-delaycodeexcitedlinearprediction,"ITU-TRecommendationG.728,1992.[Online].Available: [56] ITU-T,\G.729:Codingofspeechat8kbit/susingconjugate-structurealgebraic-code-excitedlinearprediction(cs-acelp),"ITU-TRecommendationG.729,1996.[Online].Available: [57] ITU-T,\G.723.1:Dualratespeechcoderformultimediacommunicationstransmittingat5.3and6.3kbit/s,"ITU-TRecommendationG.723.1,2006.[Online].Available: [58] Y.Wang,J.Ostermann,andY.-Q.Zhang,VideoProcessingandCommunications,1sted.PrenticeHall,2002.

PAGE 139

[59] P.StoicaandR.Moses,SpectralAnalysisofSignals,1sted.UpperSaddleRiver,NJ:PrenticeHall,2005. [60] T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein,IntroductiontoAlgorithms,2nded.DuxburyPress,Sept.2001. [61] L.I.Smith,\Atutorialonprincipalcomponentsanalysis,"Feb.2002.[Online].Available: tutorials/principal components.pdf [62] K.V.DeunandL.Delbeke,\Multidimensionalscaling,"UniversityofLeuven.[Online].Available: [63] J.B.Tenenbaum,V.deSilva,andJ.C.Langford,\Aglobalgeometricframeworkfornonlineardimensionalityreduction,"Science,vol.290,pp.2319{2323,Dec.2000. [64] S.T.RoweisandL.K.Saul,\Nonlineardimensionalityreductionbylocallylinearembedding,"Science,vol.290,pp.2323{2326,Dec.2000. [65] W.Hong,\Hybridmodelsforrepresentationofimagerydata,"Ph.D.dissertation,UniversityofIllinoisatUrbana-Champaign,Aug.2006.

PAGE 140

JieyanFanwasbornonJul26,1979inShanghai,China.Theonlychildinthefamily,hegrewupmostlyinhishometown,graduatingfromtheHighSchoolAliatedtoFudanUniversityin1997.HeearnedhisB.S.andM.S.inelectricalengineeringfromShanghaiJiaoTongUniversity,Shanghai,China,in2001and2004,respectively.HeiscurrentlyaPh.D.candidatewithelectricalandcomputerengineering,UniversityofFlorida,Gainesville,FL.Hisresearchinterestsarenetworksecurityandpatternclassication.UponcompletionofhisPh.D.program,JieyanwillbeworkinginYahoo!Inc,Sunnyvale,CA. 140