<%BANNER%>

Learning-Aided System Performance Modeling in Support of Self-Optimized Resource Scheduling in Distributed Environments

Permanent Link: http://ufdc.ufl.edu/UFE0021738/00001

Material Information

Title: Learning-Aided System Performance Modeling in Support of Self-Optimized Resource Scheduling in Distributed Environments
Physical Description: 1 online resource (146 p.)
Language: english
Creator: Zhang, Jian
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: bayesian, classification, clustering, distributed, knn, learning, pca, performance, prediction, scheduling
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: With the goal of autonomic computing, it is desirable to have a resource scheduler that is capable of self-optimization, which means that with a given high-level objective the scheduler can automatically adapt its scheduling decisions to the changing workload. This self-optimization capacity imposes challenges to system performance modeling because of increasing size and complexity of computing systems. Our goals were twofold: to design performance models that can derive applications' resource consumption patterns in a systematic way, and to develop performance prediction models that can adapt to changing workloads. A novelty in the system performance model design is the use of various machine learning techniques to effciently deal with the complexity of dynamic workloads based on monitoring and mining of historical performance data. In the environments considered in this thesis, virtual machines (VMs) are used as resource containers to host application executions because of their flexibility in supporting resource provisioning and load balancing. Our study introduced three performance models to support self-optimized scheduling and decision-making. First, a novel approach is introduced for application classification based on the Principal Component Analysis (PCA) and the k-Nearest Neighbor (k-NN) classifier. It helps to reduce the dimensionality of the performance feature space and classify applications based on extracted features. In addition, a feature selection model is designed based on Bayesian Network (BN) to systematically identify the feature subset, which can provide optimal classification accuracy and adapt to changing workloads. Second, an adaptive system performance prediction model is investigated based on a learning-aided predictor integration technique. Supervised learning techniques are used to learn the correlations between the statistical properties of the workload and the best-suited predictors. In addition to a one-step ahead prediction model, a phase characterization model is studied to explore the large-scale behavior of application's resource consumption patterns. Our study provides novel methodologies to model system and application performance. The performance models can self-optimize over time based on learning of historical runs, therefore better adapt to the changing workload and achieve better prediction accuracy than traditional methods with static parameters.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Jian Zhang.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Figueiredo, Renato J.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021738:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021738/00001

Material Information

Title: Learning-Aided System Performance Modeling in Support of Self-Optimized Resource Scheduling in Distributed Environments
Physical Description: 1 online resource (146 p.)
Language: english
Creator: Zhang, Jian
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: bayesian, classification, clustering, distributed, knn, learning, pca, performance, prediction, scheduling
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: With the goal of autonomic computing, it is desirable to have a resource scheduler that is capable of self-optimization, which means that with a given high-level objective the scheduler can automatically adapt its scheduling decisions to the changing workload. This self-optimization capacity imposes challenges to system performance modeling because of increasing size and complexity of computing systems. Our goals were twofold: to design performance models that can derive applications' resource consumption patterns in a systematic way, and to develop performance prediction models that can adapt to changing workloads. A novelty in the system performance model design is the use of various machine learning techniques to effciently deal with the complexity of dynamic workloads based on monitoring and mining of historical performance data. In the environments considered in this thesis, virtual machines (VMs) are used as resource containers to host application executions because of their flexibility in supporting resource provisioning and load balancing. Our study introduced three performance models to support self-optimized scheduling and decision-making. First, a novel approach is introduced for application classification based on the Principal Component Analysis (PCA) and the k-Nearest Neighbor (k-NN) classifier. It helps to reduce the dimensionality of the performance feature space and classify applications based on extracted features. In addition, a feature selection model is designed based on Bayesian Network (BN) to systematically identify the feature subset, which can provide optimal classification accuracy and adapt to changing workloads. Second, an adaptive system performance prediction model is investigated based on a learning-aided predictor integration technique. Supervised learning techniques are used to learn the correlations between the statistical properties of the workload and the best-suited predictors. In addition to a one-step ahead prediction model, a phase characterization model is studied to explore the large-scale behavior of application's resource consumption patterns. Our study provides novel methodologies to model system and application performance. The performance models can self-optimize over time based on learning of historical runs, therefore better adapt to the changing workload and achieve better prediction accuracy than traditional methods with static parameters.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Jian Zhang.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Figueiredo, Renato J.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021738:00001


This item has the following downloads:


Full Text





LEARNING-AIDED SYSTEM PERFORMANCE MODELING IN SUPPORT OF
SELF-OPTIMIZED RESOURCE SCHEDULING IN
DISTRIBUTED ENVIRONMENTS


















By

JIAN ZHANG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007






























2007 Jian Zhang



































To my family.









ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor, Professor Renato J.

Figueiredo, for his invaluable advice, encouragement, and support. This dissertation would

not have been possible without his guidance and support. My deep appreciation goes to

Professor Jose A.B. Fortes for participating in my supervisory committee and for all the

guidance and opportunities to work in the In-VIGO team that he gave me during my

Ph.D study. My deep recognition also goes to Professor Malay Ghosh and Professor Alan

George for serving on my supervisory committee and for their valuable ii-.. I ii- Many

thanks go to Dr. Mazin Yousif and Mr. Robert Carpenter from Intel Corporation for their

valuable input and generous funding for this research. Thanks also go to my colleagues

in the Advanced Computing Information Systems (ACIS) Laboratory for their discussion

of ideas and years of friendship. Last but not least, I owe a special debt of gratitude to

my family. Without their selfless love and support, I cannot imagine what I would have

achieved.









TABLE OF CONTENTS


page


ACKNOWLEDGMENTS .........


LIST O F TABLES . . . . . . . . . .


LIST OF FIGURES ...............

ABSTRACT ....................


CHAPTER


1 INTRODUCTION ................

1.1 Resource Performance Modeling .......
1.2 Autonomic Computing ...........
1.3 Learning ....................
1.3.1 Supervised Learning .........
1.3.2 Unsupervised Learning .......
1.3.3 Reinforcement Learning .......
1.3.4 Other Learning Paradigms ..... .
1.4 Virtual Machines .. ..........
1.4.1 Virtual Machine ('C!I '.teristics .
1.4.2 Virtual Machine Plant .. ....

2 APPLICATION CLASSIFICATION BASED ON


MONITORING


RNING OF RESOURCE CONSUMPTION PATTERNS

2.1 Introduction . . . . . .
2.2 Classification Algorithms .. ............
2.2.1 Principal Component Analysis ....
2.2.2 k-Nearest Neighbor Algorithm ........
2.3 Application Classification Framework ........
2.3.1 Performance Profiler .. ..........
2.3.2 Classification Center .. ..........
2.3.2.1 Data preprocessing based on expert
2.3.2.2 Feature selection based on principal
2.3.2.3 Training and classification .....
2.3.3 Post Processing and Application Database
2.4 Experimental Results .. ..............
2.4.1 Classification Ability .. ..........
2.4.2 Scheduling Performance Improvement .
2.4.3 Classification Cost .. .............
2.5 Related W ork .. .................
2.6 Conclusion . . . . . .


AND


knowledge.
component


LEA-


analysis


.









3 AUTONOMIC FEATURE SELECTION FOR APPLICATION CLASSIFICA-
TION ........... ................. .... .. .. ... 49


3.1 Introduction .........
3.2 Statistical Inference .....
3.2.1 Feature Selection
3.2.2 Bl, -i i Network
3.2.3 Mahalanobis Distance
3.2.4 Confusion Matrix .
3.3 Autonomic Feature Selection
3.3.1 Data Quality Assuror
3.3.2 Feature Selector .
3.3.3 Trainer ........
3.4 Experimental Results .


Framework


3.4.1 Feature Selection and Classification


Accuracy


3.4.2 Classification Validation .. ...................
3.4.3 Training Data Quality Assurance .. ...............
3.5 R elated W ork . . . . . . . . .
3.6 C conclusion . . . . . . . . .

4 ADAPTIVE PREDICTOR INTEGRATION FOR SYSTEM PERFORMANCE
PR ED ICTIO N S . . . . . . . . .

4.1 Introduction . . . . . . . . .
4.2 R elated W ork . . . . . . . . .
4.3 Virtual Machine Resource Prediction Overview .. ............
4.4 Time Series Models for Resource Performance Prediction .. ........
4.5 Algorithms for Prediction Model Selection .. ...............
4.5.1 k-Nearest Neighbor .. ......................
4.5.2 B li- -ii Classification . . . . . . .
4.5.3 Principal Component Analysis .. .................
4.6 Learning-Aided Adaptive Resource Predictor .. ..............
4.6.1 Training Phase . . . . . . . .
4.6.2 Testing Phase . . . . . . . .
4.7 Em pirical Evaluation .. .........................
4.7.1 Best Predictor Selection .. ...................
4.7.2 Virtual Machine Performance Trace Prediction .. ..........
4.7.2.1 Performance of k-NN based LARPredictor .. .......
4.7.2.2 Performance comparison of k-NN and B li, -i ii-classifier
based LARPredictor .. .................
4.7.2.3 Performance comparison of the LARPredictors and the
cumulative-MSE based predictor used in the NWS .
4.7.3 D discussion . . . . . . . .
4.8 C conclusion . . . . . . . . .










5 APPLICATION RESOURCE DEMAND PHASE ANALYSIS AND PREDIC-
TIONS. ................... ............... ...... 106

5.1 Introduction ... ................ ... .... ....... 106
5.2 Application Resource Demand Phase Analysis and Prediction Prototype 108
5.3 Data ('l-I. i,..g .......... ....... .............. 111
5.3.1 Stages in Cl.-1 .. . . 111
5.3.1 Stages in('CII-I ...... ........................... 111
5.3.2 Definitions and Notation .................. ... 112
5.3.3 k-means Clustering .................. ........ 113
5.3.4 Finding the Optimal Number of Cl- i .... . . ..... 114
5.4 Phase Prediction .................. ............. 117
5.5 Empirical Evaluation .................. ........... 118
5.5.1 Phase Behavior Analysis .................. ... 119
5.5.1.1 SPECseis96 benchmark ..... .......... 119
5.5.1.2 World Cup web log replay .............. 122
5.5.2 Phase Prediction Accuracy ............... .. .. 123
5.5.3 Discussion ............... .......... .. 125
5.6 Related Work . ............... ........... 126
5.7 Conclusion . ................ ............ 128

6 CONCLUSION . ............... ............ 135

REFERENCES. ..................... ...... .......... 137

BIOGRAPHICAL SKETCH .................. ............. 146










LIST OF TABLES


Tabl

2-1

2-2

2-3

2-4

3-1

3-2

3-3

3-4

4-1

4-2

4-3

4-4

4-5

4-6

5-1

5-2

5-3

5-4

5-5


Perform ance m etric list . ......................

List of training and testing applications . .............

Experimental data: application class compositions . .

System throughput: concurrent vs. sequential executions . ..

Sample confusion matrix with two classes (L 2) . ........

Sample performance metrics in the original feature set . .

Confusion matrix of classification results.. . .....

Performance metric correlation matrixes of test applications . ..

Normalized prediction MSE statistics for resources of VM1 . .

Normalized prediction MSE statistics for resources of VM2 . .

Normalized prediction MSE statistics for resources of VM3 . .

Normalized prediction MSE statistics for resources of VM4 . .

Normalized prediction MSE statistics for resources of VM5 . .

Best predictors of all the trace data.... . .....

Performance feature list ................... . .....

SPECseis96 total cost ratio schedule for the eight performance features .

Average phase prediction accuracy..... . ......

Performance feature list of VM traces.. . .....

Average phase prediction accuracy of the five VMs . .


e


page

. 35

. 37

. 40

. 44

. 56

. .. 59

. 65

. 70

. 96

. 97

. 98

. 99

. 99

. 100

. 119

. 122

. 124

. 124

. 126









LIST OF FIGURES


Figure

1-1 Structure of an autonomic element . ............

1-2 Classification system representation . ...........

1-3 Virtual machine structure . .................

1-4 VMPlant architecture . .......

2-1 Sample of principal component analysis . .........

2-2 k-nearest neighbor classification example . .

2-3 Application classification model . .............

2-4 Performance feature space dimension reductions in the application


classification


process . . . . . . . . . . .. .

2-5 Sample clustering diagrams of application classifications . .

2-6 Application class composition diagram. . . .....

2-7 System throughput comparisons for ten different schedules . . ..

2-8 Application throughput comparisons of different schedules . . ..

3-1 Sample B ,i, -i ,i, network generated by feature selector . . .

3-2 Feature selection model .... . .

3-3 B li,. -i ,i-network based feature selection algorithm for application classification

3-4 Average classification accuracy of 10 sets of test data versus number of features
selected in the first experiment . . .

3-5 Two-class test data distribution with the first two selected features . .

3-6 Five-class test data distribution with first two selected features . . .


3-7 Comparison of distances between cluster centers derived from expert-selected
and automatically selected feature sets ................ .... 66

3-8 Training data clustering diagram derived from expert-selected and automat-
ically selected feature sets .............. .......... .. 67

3-9 Classification results of benchmark programs ................ . 69

4-1 Virtual machine resource usage prediction prototype ............. 78

4-2 Sample XML schema of the VM performance DB ... . .... 80


page

16

19

21

23

28

31

32


34

39

42

43

44

54

57

60


63

63

66










4-3 Learning-aided adaptive resource predictor workflow

4-4 Learning-aided adaptive resource predictor dataflow

4-5 Best predictor selection for trace VM2Joadl5 .


4-6

4-7

4-8

4-9

4-10

4-11

4-12

4-13

5-1

5-2

5-3

5-4


Best predictor selection for trace VM2_PktIn . .

Best predictor selection for trace VM2_Swap . .

Best predictor selection for trace VM2_Disk . .

Predictor performance comparison (VM1) . ..

Predictor performance comparison (VM2) . ..

Predictor performance comparison (VM3) . ..

Predictor performance comparison (VM4) . ..

Predictor performance comparison (VM) . ..

Application resource demand phase analysis and predict

Resource allocation strategy comparison . ...

Application resource demand phase prediction workflow

Phase analysis of SPECseis96 CPU_user . ...


5-5 Phase analysis of WorldCup'98 BytesIn

5-6 Phase analysis of WorldCup'98 Bytesout


.

.


. . . 88

.. . 92

.. . 93

.. . 94

.. . 95

. .. . 10 1

. .. . 02

. .. . 03

. .. . 04

. .. . 05

ion prototype ...... 109

. .. . 115

. . . 129

. .. . 30

. .. . 33

. .. . 34









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

LEARNING-AIDED SYSTEM PERFORMANCE MODELING IN SUPPORT OF
SELF-OPTIMIZED RESOURCE SCHEDULING IN
DISTRIBUTED ENVIRONMENTS

By

Jian Zhang

December 2007

C('! i: Renato J. Figueiredo
Major: Electrical and Computer Engineering

With the goal of autonomic computing, it is desirable to have a resource scheduler

that is capable of self-optimization, which means that with a given high-level objective the

scheduler can automatically adapt its scheduling decisions to the changing workload. This

self-optimization capacity imposes challenges to system performance modeling because of

increasing size and complexity of computing systems.

Our goals were twofold: to design performance models that can derive applications'

resource consumption patterns in a systematic way, and to develop performance prediction

models that can adapt to changing workloads. A novelty in the system performance

model design is the use of various machine learning techniques to efficiently deal with

the complexity of dynamic workloads based on monitoring and mining of historical

performance data. In the environments considered in this thesis, virtual machines (VMs)

are used as resource containers to host application executions because of their flexibility in

supporting resource provisioning and load balancing.

Our study introduced three performance models to support self-optimized scheduling

and decision-making. First, a novel approach is introduced for application classification

based on the Principal Component Analysis (PCA) and the k-Nearest Neighbor (k-NN)

classifier. It helps to reduce the dimensionality of the performance feature space and

classify applications based on extracted features. In addition, a feature selection model is









designed based on B ,i, i i Network (BN) to systematically identify the feature subset,

which can provide optimal classification accuracy and adapt to changing workloads.

Second, an adaptive system performance prediction model is investigated based

on a learning-aided predictor integration technique. Supervised learning techniques are

used to learn the correlations between the statistical properties of the workload and the

best-suited predictors.

In addition to a one-step ahead prediction model, a phase characterization model is

studied to explore the large-scale behavior of application's resource consumption patterns.

Our study provides novel methodologies to model system and application perfor-

mance. The performance models can self-optimize over time based on learning of historical

runs, therefore better adapt to the changing workload and achieve better prediction

accuracy than traditional methods with static parameters.









CHAPTER 1
INTRODUCTION

The vision of autonomic computing [1] is to improve manageability of complex

IT systems to a far greater extent than current practice through self-o .i1fiu-iii i:

self-healing, self-optimization, and self-protection. To perform the self-configuration

and self-optimization of applications and associated execution environments and to realize

dynamic resource allocation, both resource awareness and application awareness are

important. In this context, there has been substantial research on effective scheduling

policies [2-6] with given resource and application specifications. While there are several

methods for obtaining resource specification parameters (e.g., CPU, memory, and disk

information from the /proc file system in Unix systems), application specification is

challenging to describe due to the following factors: 1) lack of knowledge and control of

the application source codes, 2) multi-dimensionality of application resource consumption

patterns, and 3) multi-stage resource consumption patterns of long-running applications.

Furthermore, the dynamics of system performance .I:-:i ivate the difficulties of performance

description and prediction.

In this dissertation, an integrated framework consisting of algorithms and middleware

for resource performance modeling is developed. It includes system performance prediction

models and application resource demand models based on learning of historical executions.

A novelty of the performance model designs is their use of machine learning techniques

to efficiently and robustly deal with the complex dynamical phenomena of the workload

and resource availability. In addition, virtual machines (VMs) are used as resource

containers because they provide a flexible management platform that is useful for both the

encapsulation of application execution environments and the .I.- regation and accounting

of resources consumed by an application. In this context, resource scheduling becomes

a problem of how to dynamically allocate resources to virtual machines (which host

application executions) to meet the applications' resource demands.









The rest of this chapter is organized as follows: Section 1.1 gives an overview of

resource performance modeling. Sections 1.2, 1.3, and 1.4, briefly introduce autonomic

computing, machine l. iiiir- and virtual machine concepts.

1.1 Resource Performance Modeling

Performance is a key criterion in the design, procurement, and use of computer

systems. As such, achieving the highest performance for a given cost becomes the goal

of system designers. In the context of computing, a system could be any collection

of resources including hardware and software components. To measure the system

performance, a set of metrics, which refers to the criteria used to evaluate the performance

of the system, are selected. The following are the definitions of some commonly used

performance metrics [7]:

Response time: The interval between a user's request and the system response.

Ti,.h ,i q-', The rate (request per unit of time) at which the requests can be serviced

by the system. Utilization: The fraction of time the resource is busy servicing requests.

Reliability: The probability that the system will satisfactorily perform the task for

which it was designed or intended, for a specified time and in a specified environment.

Availability: The fraction of the time the system is available to service users' requests.

In system procurement studies, the cost/performance ratio is commonly used as a

metric for comparing systems. Three techniques for performance evaluation are analytical

modeling, simulation, and measurement. Sometimes it is helpful to use two or more

techniques, either simultaneously or sequentially.

Computer system performance measurements involve monitoring the system while it

is being subjected to a particular workload. In order to perform meaningful measurements,

the workload should be carefully selected based on the services exercised by the workload,

the level of detail, representativeness, and timeliness. Since a real user environment is

generally not repeatable, it is necessary to study the real user environments, observe the

key characteristics, and develop a workload model that can be used repeatedly. This









process is called workload characterization. Once a workload model is available, the effect

of changes in the workload and system can be studied in a controlled manner by simply

changing the parameters of the model. Various workload characterization techniques such

as Principal Component A,.j.l.:.- (PCA) and classifications are used to characterize the

workloads under study in this work and will be discussed in the following chapters. In

addition, various machine learning techniques are used to learn the parameters of the

performance model from historical data.

1.2 Autonomic Computing

With technology advances, the number of computing devices keeps increasing.

Management complexity grows with the increasing device volumes, increasing demand

for IT professionals and the corresponding labor cost. With the motivation to free IT

administrators from details of system operation and maintenance while providing 24 x 7

service, IBM started an initiative in 2001 which has been termed autonomic computing [1].

The essence of autonomic computing is to enable self-managed systems, which includes the

following aspects:

Self-configuration: Automated system configuration in accordance with high-level

policies.

Self-l, .'.:,.,j: Automated system fault detection, diagnoses, and recovery (including

both hardware and software) .

Self-optimization: Continuous system and component performance and efficiency

improvement.

Self-protection: Automated system defense against malicious attacks or cascading

failures.

Autonomic computing presents challenges and opportunities in various areas such

as learning and optimization theory, automated statistical learning, and behavioral

abstraction and models [8]. This dissertation addresses some of the challenges in









































Figure 1-1. Structure of an autonomic element.


the application resource performance modeling to support self-configuration and

self-optimization of application execution environments.

Generally, an autonomic system is an interactive collection of autonomic elements:

individual system constituents that contain resources and deliver services to humans and

other autonomic elements. As Figure 1-1 shows, an autonomic element will typically

consist of one or more managed elements coupled with a single autonomic manager that

controls and represents them. The managed element could be a hardware resource, such

as storage, a CPU, or a software resource, such as a database, or a directory service, or

a large legacy system [1]. The monitoring process collects the performance data of the









managed element. Inference can be used to i,li/-l;..- the system performance and plan

accordingly. At last, it executes the plan based on the decisions made. In this work,

machine learning is used to gain the knowledge of the system performance under different

circumstances over historical runs.

1.3 Learning

The science of learning pl i', a key role in the fields of statistics, data mining, and

artificial intelligence, intersecting with areas of engineering and other disciplines. With

advances in computer technology, we currently have the ability to collect, store, and

process large amount of data, as well as to access them from geographically distributed

locations over computer networks. Machine Learning is programming computers to

optimize a performance criterion using example data or past experience. It uses the theory

of statistics in building mathematical models.

Machine learning is a natural solution to automation. It avoids knowledge-intensive

model building and reduces the reliance on expert knowledge. In addition, it can deal

with complex dynamical phenomena and enable the system to adapt to the changing

environments.

Traditionally there are generally three types of learning: supervised 1. i1:iiii.

unsupervised learning, and reinforcement learning.

1.3.1 Supervised Learning

In the context of supervised 1. i.ir- a learning system is a computer program that

makes decisions based on the accumulated experience contained in successfully solved

cases [9]. "Learniing consists of choosing or adapting parameters within the model

structure that work best on the samples at hand and others like them. One of the most

prominent and basic learning tasks is 1.i;-. ;/,i/.n or prediction, which is used extensively

in this work. For classification problems, a learning system can be viewed as a higher-level

system that helps build the decision-making system itself, called the classifier. Figure 1-2

illustrates the structure of a classification system and its learning process.









The set of potential observations relevant to a particular problem are called features,

which also go by a host of other names, including attributes and variables. Only correctly

solved cases will be used in building the specific classifier, which is called the training

phase of the classification. The pattern of feature values for each case is associated with

the correct classification or decision to form the sample cases, a set which is also called

the training data. Thus, learning in any of these systems can be viewed as a process of

generalizing these observed empirical associations subject to the constrains imposed by

the chosen classifier model. During the testing phase, the customized classifier is used

to associate a specific pattern of observations with a specific class. The learning method

introduced above is a form of supervised 1. iiiiii which learn by being presented with

preclassified training data.

1.3.2 Unsupervised Learning

Unsupervised learning methods can learn without any human intervention. This

method is particularly useful in situations where data need to be classified or clustered

into a set of classifications but where the classifications are not known in advance. In

other words, it fits the model to observations. It differs from supervised learning by the

fact that there is not a priori output.

1.3.3 Reinforcement Learning

Reinforcement learning refers to a class of problems in machine learning which

postulate an agent exploring an environment in which the agent perceives its current

state and takes actions. A system that uses reinforcement learning is given a positive

reinforcement when it performs correctly and a negative reinforcement when it performs

incorrectly. However, the information of why and how the learning system performed

correctly is not provided to it.

Reinforcement learning algorithms attempt to find a policy for maximizing cumulative

reward for the agent over the course of the problem. The environment is typically












Case Format
Features Classes



Sample Cases

Patterns
of Feature ec
V u Decisions
Values



General
General Learning
Classifier Le m
Model System

Training

Testing
Tesn .. Decision on Class
Case to Be Application-Specific D Assignment
Classified Classifier Ase
of Case


Figure 1-2. Classification system representation
During the training phase, labeled sample cases are used to derive the
unknown parameters of the /.i- .:;. r model. During the testing phase, the
customized l,- .;'7 r is used to associate a -... .:7' pattern of observations with
a "-/ ''. : class.


formulated as a finite-state Markov decision process (MlI)P), and reinforcement learning

algorithms for this context are highly related to dynamic programming techniques.

1.3.4 Other Learning Paradigms

In addition to the above three traditional learning methods, there are some other

learning paradigms:

Relational Learning / Structured Prediction: It predicts structure on sets of objects.

For example, it is trained on genome/proteome data with known relationships and can

predict graph structure on new sets of genomes/proteomes.









Semi-Supervised Learning: Given a mix of labeled and unlabeled data, it can get

better predictor than just training on labeled data.

Transductive Learning: It trains a classifier to give best predictor on a specific set of

test data.

Active Learning: It chooses or constructs optimal samples to train on next with the

objective to achieve best predictor with fewest labeled samples.

Nonlinear Dime ...: ..,.i.:;l. Reduction: It learns underlying complex manifolds of data

in high dimensional spaces.

In this work, various learning techniques are used to model the application resource

demand and system performance. These models can help to system to adapt to the

changing workload and achieve higher performance.

1.4 Virtual Machines

Virtual machines were first developed and used in the 1960s, with the best-known

example being IBM's VM/370 [10]. A "
independent, isolated operating systems (guest VM) to run on one physical machine (host

server), efficiently multiplexing system resources of the host machine [10].

A virtual-machine monitor (VMM) is a software lv--r that runs on a host platform

and provides an abstraction of a complete computer system to higher-level software.

The abstraction created by the VMM is called a virtual machine. Figure 1-3 shows the

structure of virtual machines.

1.4.1 Virtual Machine Characteristics

Virtual machines can greatly simplify system management (especially in environments

such as Grid computing) by raising the level of abstraction from that of the operating

system user to that of the virtual machine to the benefit of the resource providers and

users [11]. The following characteristics of virtual machines make them a highly flexible

and manageable application execution platform:











Guest application


Guest application


Guest operating system


Virtual-machine monitor


Host hardware


Figure 1-3.


Virtual machine structure
A virtual-machine monitor is a software l- -r that runs on a host platform and
provides an abstraction of a complete computer system to higher-level
software. The host platform may be the bare hardware (Type I VMM) or a
host operating system (Type II VMM). The software running above the
virtual-machine abstraction is called guest software (operating system and
applications).


C. -,n1,, ability: Virtual machines are highly configurable in terms of hardware

(memory, hard disk, and devices) as well as software (operating system, user applications,
and data). It is possible to use on-demand provisioning to adapt the machine configura-

tions to dynamic workloads. For example, recent technical advances of hotplug memory

[12] can support dynamic memory extension of VM guest without shutting down the

system.

Security: Virtual machines allow multiple operating systems (OS) to run on a

physical machine in a secure and isolated fashion. High utilization of physical resources

can be achieved by sharing them among multiple virtual machines.

C'h. I 1,. ..-,' Virtual machine state can be easily encapsulated into a set of files, which

are called VM checkpoints. In case of a fault, applications execution can be resumed from

the last checkpoint instead of restarted from the beginning. This checkpoint capacity can









help to shorten fault recovery times, especially for long-running applications, and maintain

Service Level Agreements (SLA).

Migration: With the checkpoint capability, migrating a virtual machine is simplified,

and can be achieved by (.. iving a set of files across servers. VM migration can be used

to optimize server load distribution dynamically. Recent technical advances have enabled

instant VM migrations. For example, Xen virtual machines can be migrated in the

order of seconds, and with millisecond downtimes [13]. VMware's VMotion can support

migration with zero down time [14]. Techniques based on Virtual File System (VFS) has

been studied in [15] to support VM migration across Wide-Area Networks (WANs).

1.4.2 Virtual Machine Plant

VMPlant Grid Service [16] handles virtual machine creation and hosting for classic

virtual machines (e.g. VMware [17]) and user-mode Linux platforms (e.g., UML [18])

via dynamic cloning, instantiation and configuration. The VMPlant has three 1i i' li

components: Virtual Machine Production Center (VMPC), Virtual Machine Warehouse

(VMWH) and Virtual Machine Information System (VMIS). The VMPC handles the

virtual machine's creation, configuration and destruction. It employs a configuration

pattern recognition technique to identify opportunities to apply the pre-cached virtual

machine state to accelerate the machine configuration process. The VMWH stores the

pre-cached machine images, monitors them and their host server's performance and

performs the maintenance activity. The VMIS stores the static and dynamic information

of the virtual machines and their host server. The architecture of the VMPlant is shown in

Figure 1-4.

The VMPlant provides API to VMShop for virtual machine creation, deconstruction,

and monitoring. The VMShop has three 1i ii 'i" components: VMCreater, VMCollecter and

VMReporter. The VMCreater handles the virtual machines' creation; The VMCollecter

handles the machines' deconstruction and suspension; The VMReporter handles

information request. In combination with a virtual machine shop service, VMPlants










VMCollector
T VMDest Order (VMID)


Figure 1-4. VMPlant architecture


deploiv, l across physical resources of a site allow clients (users and/or middleware

acting on their behalf) to instantiate and control client-customized virtual execution

environments. The plant can be integrated with virtual networking techniques (such as

VNET [19]) to allow client-side network management. Customized, application-specific

VMs can be defined in VMPlant with the use of a directed ., iv. 1i graph (DAG)

configuration. VM execution environments defined within this framework can then be

cloned and dynamically instantiated to provide a homogeneous application execution

environment across distributed resources.

In the context of the VMPlant, an application can be scheduled to run in a specific

virtual machine, which is called applicationVM. Therefore, the system performance metric

collected from the applicationVM can reflect and summarize the resource consumption of

the application.









CHAPTER 2
APPLICATION CLASSIFICATION BASED ON MONITORING AND LEARNING OF
RESOURCE CONSUMPTION PATTERNS

Application awareness is an important factor of efficient resource scheduling. This

chapter introduces a novel approach for application classification based on the Principal

Component Analysis (PCA) and the k-Nearest Neighbor (k-NN) classifier. This approach

is used to assist scheduling in heterogeneous computing environments. It helps to reduce

the dimensionality of the performance feature space and classify applications based on

extracted features. The classification considers four dimensions: CPU-intensive, I/O

and paging-intensive, network-intensive, and idle. Application class information and the

statistical abstracts of the application behavior are learned over historical runs and used to

assist multi-dimensional resource scheduling.

2.1 Introduction

Heterogeneous distributed systems that serve application needs from diverse users

face the challenge of providing effective resource scheduling to applications. Resource

awareness and application awareness are necessary to exploit the heterogeneities of

resources and applications to perform adaptive resource scheduling. In this context, there

has been substantial research on effective scheduling policies [2-4] with given resource and

application specifications. There are several methods for obtaining resource specification

parameters (e.g., CPU, memory, disk information from /proc in Unix systems). However,

application specification is challenging to describe because of the following factors:

Numerous ';:/. of applications: In a closed environment where only a limited number

of applications are running, it is possible to analyze the source codes of each application

or even plug in codes to indicate the application execution stages for effective resource

scheduling. However, in an open environment such as in Grid computing, the growing

number of applications and lack of knowledge or control of the source codes present

the necessity of a general method of learning application behaviors without source code

modifications.









Multi- ,.:l,, ,!.':.'.u.:jl./ of application resource consumption: An application's execution

resource requirement is often multi-dimensional. That is, different applications may stretch

the use of CPU, memory, hard disk or network bandwidth to different degrees. The

knowledge of which kind of resource is the key component in the resource consumption

pattern can assist resource scheduling.

Multi-stage applications: There are cases where long-running scientific applications

exhibit multiple execution stages. Different execution stages may stress different kinds of

resources to different degrees, hence characterizing an application requires knowledge of

its dynamic run-time behavior. The identification of such stages presents opportunities to

exploit better matching of resource availability and application resource requirement across

different execution stages and across different nodes. For instance, with process migration

techniques [20] [21] it is possible to migrate an application during its execution for load

balancing.

The above characteristics of grid applications present a challenge to resource

scheduling: How to learn and make use of an application's multi-dimensional resource

consumption patterns for resource allocation? This chapter introduces a novel approach

to solve this problem: application classification based on the feature selection algorithm,

Principal Component Analysis (PCA), and K-Nearest Neighbor (k-NN) classifier [22][23].

The PCA is applied to reduce the dimensionality of application performance metrics, while

preserving the maximum amount of variance in the metrics. Then, the k-Nearest Neighbor

algorithm is used to categorize the application execution states into different classes

based on the application's resource consumption pattern. The learned application class

information is used to assist the resource scheduling decision-making in heterogeneous

computing environments.

The VMPlant service introduced in C!i lpter 1.4.2 provides automated cloning

and configuration of application-centric Virtual Machines (VMs). Problem-solving

environments such as In-VIGO [24] can submit requests to the VMPlant service, which









is capable of cloning an application-specific virtual machine and configuring it with an

appropriate execution environment. In the context of VMPlant, the application can be

scheduled to run on a dedicated virtual machine, which is hosted by a shared 1,;/-,.:. .'

machine. Within the VM, system performance metrics such as CPU load, memory usage,

I/O activity and network bandwidth utilization, reflect the application's resource usage.

The classification system described in this chapter leverages the capability of

summarizing application performance data by collecting system-level data within a

VM, as follows. During the application execution, snapshots of performance metrics are

taken at a desired frequency. A PCA processor analyzes the performance snapshots and

extracts the key components of the application's resource usage. Based on the extracted

features, a k-NN classifier categorizes each snapshot into one of the following classes:

CPU-intensive, IO-intensive, memory-intensive, network-intensive and idle.

By using this system, resource scheduling can be based on a comprehensive diagnosis

of the application resource utilization, which conveys more information than CPU load

in isolation. Experiments reported in this chapter show that the resource scheduling

facilitated with application class composition knowledge can achieve better average system

throughput than scheduling without the knowledge.

The rest of the chapter is organized as follows: Section 2.2 introduces the PCA and

the k-NN classifier in the context of application classification. Section 2.3 presents the

classification model and implementation. Section 2.4 presents and discusses experimental

results of classification performance measurements. Section 2.5 discusses related work.

Conclusions and future work are discussed in Section 2.6.

2.2 Classification Algorithms

Application behavior can be defined by its resource utilization, such as CPU load,

memory usage, network and disk bandwidth utilization. In principle, the more information

a scheduler knows about an application, the better scheduling decisions it can make.

However, there is a tradeoff between the complexity of decision-making process and the









optimality of the decision. The key challenge here is how to find a representation of the

application, which can describe multiple dimensions of resource consumption, in a simple

way. This section describes how the pattern classification techniques, the PCA and the

K-NN classifier, are applied to achieve this goal.

A pattern classification system consists of pre-processing, feature extraction,

classification, and post-processing. The pre-processing and feature extraction are known

to significantly affect the classification, because the error caused by wrong features may

propagate to the next steps and stays predominant in terms of the overall classification

error. In this work, a set of application performance metrics are chosen based on expert

knowledge and the principle of increasing relevance and reducing redundancy [25].

2.2.1 Principal Component Analysis

Principal Component Analysis (PCA) [22] is a linear transformation representing

data in a least-square sense. It is designed to capture the variance in a dataset in terms of

principal components and reduce the dimensionality of the data. It has been widely used

in data analysis and compression.

When a set of vector samples are represented by a set of lines passing through

the mean of the samples, the best linear directions result in eigenvectors of the scatter

matrix the so-called "principal compels, ii as shown in Figure 2-1. The corresponding

eigenvalues represent the contribution to the variance of data. When the k largest

eigenvalues of n principal components are chosen to represent the data, the dimensionality

of the data reduces from n to k.

Principal component analysis is based on the statistical representation of a random

variable. Suppose we have a random vector population x, where


x (Xri,.. ,X)T (2-1)


and the mean of that population is denoted by


x E {x} (2-2)
















10

Principal C'omponent
5 .
CN ". ..
c ....* h.'.: :" -"

c 0
U)
.






-15 '

-10 ----------------------------


-15 -10 -5 0 5 10 15
Dimension 1


Figure 2-1. Sample of principal component analysis


and the covariance matrix of the same data set is


Cx E (x x) (X x)T (2-3)


The components of Cx, denoted by cij, represent the covariances between the random

variable components xi and xj. The component cii is the variance of the component xi.

From a sample of vectors xl, .- XM, we can calculate the sample mean and the

sample covariance matrix as the estimates of the mean and the covariance matrix.

The eigenvectors ei and the corresponding eigenvalues Ai can be obtained by solving

the equation


Cxei Aiei, i= 1,-- n (2-4)









For simplicity we assume that the Ai are distinct. These values can be found, for example,

by finding the solutions of the characteristic equation


ICx AI 0 (2-5)


where the I is the identify matrix having the same order than Cx and the I| denotes

the determinant of the matrix. If the data vector has n components, the characteristic

equation becomes of order n.

By ordering the eigenvectors in the order of descending eigenvalues (largest first), one

can create an ordered orthogonal basis with the first eigenvector having the direction of

largest variance of the data. In this way, we can find directions in which the data set has

the most significant amounts of energy.

Suppose one has a data set of which the sample mean and the covariance matrix have

been calculated. Let A be a matrix consisting of eigenvectors of the covariance matrix as

the row vectors.

By transforming a data vector x, we get


y = A (x tx) (2-6)


which is a point in the orthogonal coordinate system defined by the eigenvectors.

Components of y can be seen as the coordinates in the orthogonal base. We can

reconstruct the original data vector x from y by


x ATy + gx (2-7)


using the property of an orthogonal matrix A- = AT. The AT is the transpose of a

matrix A. The original vector x was projected on the coordinate axes defined by the

orthogonal basis. The original vector was then reconstructed by a linear combination of

the orthogonal basis vectors.









Instead of using all the eigenvectors of the covariance matrix, we may represent the

data in terms of only a few basis vectors of the orthogonal basis. If we denote the matrix

having the K first eigenvectors as rows by AK, we can create a similar transformation as

seen above


y = AK (x x) (2-8)


and


x = A y + x (2-9)


It means that we project the original data vector on the coordinate axes having the

dimension K and transforming the vector back by a linear combination of the basis

vectors. This method minimizes the mean-square error between the data and the

representation with given number of eigenvectors.

If the data is concentrated in a linear subspace, this method provides a way to

compress data without losing much information and simplifying the representation. By

picking the eigenvectors having the largest eigenvalues we lose as little information as

possible in the mean-square sense.

2.2.2 k-Nearest Neighbor Algorithm

K-Nearest Neighbor classifier (k-NN) is a supervised learning algorithm where the

result of new instance query is classified based on in iii il y of k-nearest neighbor category

[26]. It has been used in many applications in the field of data mining, statistical pattern

recognition, image processing, and many others. The purpose of this algorithm is to

classify a new object based on attributes and training samples. The classifiers do not

use any model to fit and only based on memory. Given a query point, we find k number

of objects or (training points) closest to the query point. The k-NN classifier decides

the class by considering the votes of k (an odd number) nearest neighbors. The nearest










0 0 0

S 0 0 A A 0 Class 1
0 0 0 A A 3 A Class 2


C1 O 0 A d3 X Test Data
0 o
00 0 A cA 3 OAD Class centroids

00 A AcA A
OO0
00 A A A 0
A O

A A A A D 0

If min( Id, d2l, Id -d31, Id2 d3|) > y ( predefined thresholdd,
test data X is qualified training data with class to whose centroid is min(d1, d2 ,d)

Figure 2-2. k-nearest neighbor classification example

neighbor is picked as the training data geometrically closest to the test data in the feature

space as illustrated in Figure 2-2.
In this work, a vector of the application's resource consumption snapshots is used

to represent the application. Each snapshot consists of a chosen set of performance

metrics. The PCA is used to preprocess the raw data to independent features for the

classifier. Then, a 3-NN classifier is used to classify each snapshot. The 1, ii 1ily vote of

the snapshots' classes is used to represent the class of the applications: CPU-intensive,

I/O and paging-intensive, network-intensive, or idle. A machine with no load except for

background load from system daemons is considered as in idle state.

2.3 Application Classification Framework

The application classifier is composed of a performance profiler, a classification center,

and an application database (DB) as shown in Figure 2-3. In addition, a monitoring





































Figure 2-3.


Data Pool


Application classification model
The Performance profiler collects performance metrics of the target
application node. The Cl. --.: ,l/, mn center classifies the application using
extracted key components and performs statistic ain i1, i of the classification
results. The Application DB stores the application class information. (m is the
number of snapshots taken in one application run, to0/t: are the beginning
ending times of the application execution, VMIP is the IP address of the
application's host machine).


system is used to sample the system performance of a computing node running an

application of interest.

2.3.1 Performance Profiler

The performance profiler is responsible for collecting performance data of the

application node. It interfaces with the resource manager to receive data collection

instructions, including the target node and when to start and stop.









The performance profiler can be installed on any node with access to the performance

metric information of the application node. In our implementation, the Ganglia [27]

distributed monitoring system is used to monitor application nodes. The performance

sampler takes snapshots of the performance metrics collected by Ganglia at a predefined

frequency (currently, 5 seconds) between the application's starting time to and ending

time tl. Since Ganglia uses multicast based on a listen / announce protocol to monitor the

machine state, the collected samples consist of the performance data of all the nodes in a

subnet. The performance filter extracts the snapshots of the target application for future

processing. At the end of profiling, an application performance data pool is generated.

The data pool consists of a set of n dimensional samples Anxm = (al, a2, am), where

m = (ti to)/d is the number of snapshots taken in one application run and d is the

sampling time interval. Each sample ai consists of n performance metrics, which include

all the default 29 metrics monitored by Ganglia and the 4 metrics that we added based

on the need of classification, including the number of I/O blocks read from/written to

disk, and the number of memory pages swapped in/out. A program was developed to

collect these four metrics (using vmstat) and the metrics were added to the metric list of

Ganglia's gmond.

2.3.2 Classification Center

The classification center has three components: the data preprocessor, the PCA

processor, and the classifier. To reduce the computation intensity and improve the

classification accuracy, it employs the PCA algorithm to extract the principal components

of the resource usage data collected and then performs classification based on extracted

data of the principal components.

2.3.2.1 Data preprocessing based on expert knowledge

Based on the expert knowledge, we identified 4 pairs of performance metrics as

shown in Table 2-1. Each pair of the performance metrics correlates to the resource

consumption behavior of the specific application class and has limited redundancies.










A,x,
/ a11 "' aIm
all .. a,
a21 a2m



a\ nl anm


Figure 2-4.


A'tpx

Preprocess / /
S11 1m
Preprocess a .. a2m PCA Classify Vote Class
---- --- ^ -- < ^1 1 -- ^ 'i ---- la s s
n>p : p>q q. q>1 (cil .. cTm
a p a pm q 1 bj bqm


Performance feature space dimension reductions in the application
classification process
m: The number of snapshots taken in one application run,
n: The number of performance metrics,
Anxm: All performance metrics collected by monitoring system,
A'pxm: The selected relevant performance metrics after the zero-mean and
unit-variance normalization,
Bqxm: The extracted key component metrics,
C1xm: The class vector of the snapshots,
Class: The application class, which is the ini i iily vote of snapshots' classes.


For example, performance metrics of CPU_System and CPU_User are correlated to

CPU-intensive applications; Bytes_In and Bytes_Out are correlated to Network-intensive

applications; IO_BI and IOBO are correlated to the IO-intensive applications; SwapIn

and Swap_Out are correlated to Memory-intensive applications. The data preprocessor

extracts these eight metrics of the target application node from the data pool based on our

expert knowledge. Thus it reduces the dimension of the performance metric from n = 33

to p = 8 and generates A'pxm as shown in Figure 2-4. In addition, the preprocessor also

normalizes the selected metrics to zero-mean and unit-variance.

2.3.2.2 Feature selection based on principal component analysis

The PCA processor takes the data collected for the performance metrics listed in

Table 2-1 as inputs. It conducts the linear transformation of the performance data and

selects the principal components based on the predefined minimal fraction variance. In

our implementation, the minimal fraction variance was set to extract exactly two principal

components. Therefore, at the end of processing, the data dimension gets further reduced

from p = 8 to q = 2 and the vector Bqxm is generated, as shown in Figure 2-4.









Table 2-1. Performance metric list
Performance Metrics Description
CPU_System / User Percent CPU_System / User
Bytes_In / Out Number of bytes per second
into / out of the network
IO_BI / BO Blocks sent to / received from
block device (blocks/s)
Swap_In / Out Amount of memory swapped
in / out from / to disk (kB/s)


2.3.2.3 Training and classification

The 3-Nearest Neighbor classifier is used for the application classification in our

implementation. It is trained by a set of carefully chosen applications based on expert

knowledge. Each application represents the key performance characteristics of a class. For

example, an I/O benchmark program, PostMark [28], is used to represent the IO-intensive

class. SPECseis96 [29], a scientific computing intensive program, is used to represent

the CPU-intensive class. A synthetic application, Pagebench, is used to represent the

Paging-intensive class. It initializes and updates an array whose size is '-i.--. r than the

memory of the VM, thereby inducing frequent paging activity. Ettcp [30], a benchmark

that measures the network throughput over TCP or UDP between two nodes, is used as

the training application of the Network-intensive class. The performance data of all these

four applications and the idle state are used to train the classifier. For each test data, the

trained classifier calculates its distance to all the training data. The 3-NN classification

identifies only three training data sets with the shortest distance to the test data. Then

the test data's class is decided by the i1 i il i ily vote of the three nearest neighbors.

2.3.3 Post Processing and Application Database

At the end of classification, an m dimension class vector clxm = (l, c2, cm)

is generated. Each element of the vector clxm represents the class of the corresponding

application performance snapshot. The 1i, in i ly vote of the snapshot classes determines

the application Class. The complete performance data dimension reduction process is

shown in Figure 2-4. In addition to a single value (Class) the application classifier also









outputs class composition, which can be used to support application cost models (Section

4.4). The post processed classification results together with the corresponding execution

time (t to) are stored in the application database and can be used to assist future

resource scheduling.

2.4 Experimental Results

We have implemented a prototype for application classification including a Perl

implementation of the performance profiler and a Matlab implementation of the

classification center. In addition, Ganglia was used to monitor the working status of

the virtual machines. This section evaluates our approach from the following three aspects:

the classification ability, the scheduling decision improvement and the classification cost.

2.4.1 Classification Ability

The application class set in this experiment has four classes: CPU-intensive, I/O and

paging-intensive, network-intensive, and idle. Application of I/O and paging-intensive

class can be further divided into two groups based on whether they have or do not have

substantial memory intensive activities. Various synthetic and benchmark programs,

scientific computing applications and user interactive applications are used to test

the classification ability. These programs represent typical application behaviors of

their classes. Table 2-2 summarizes the set of applications used as the training and the

testing applications in the experiments [28-38]. The 3-NN classifier was trained with the

performance data collected from the executions of the training applications highlighted in

the table. All the application executions were hosted by a VMware GSX virtual machine

(VM1). The host server of the virtual machine was an Intel(R) Xeon(TM\ ) dual-CPU

1.80GHz machine with 512KB cache and 1GB RAM. In addition, a second virtual

machine with the same specification was used to run the server applications of the network

benchmarks.














- U_ W. W-U-U-P


a,
*S
C
C
0)







0)
N
(U








C
+^






















b0
+0
e




.N
'E^
(U






If

a,

c/l
*a



*a

c/l












o o-
h ^
.^


-o


C






u)~






















.f 0



Cf


-s +
*





w *








u E
_____
00 .









Initially the performance profiler collected data of all the thirty-three (n = 33)

performance metrics once every five seconds (d = 5) during the application execution.

Then the data preprocessor extracted the data of the eight (p = 8) metrics listed in

Table 2-1 based on the expert knowledge of the correlation between these metrics and the

application classes. After that, the PCA processor conducted the linear transformation of

the performance data and selected principal components based on the minimal fraction

variance defined. In this experiment, the variance contribution threshold was set to extract

two (q = 2) principal components. It helps to reduce the computational requirements of

the classifier. Then, the trained 3-NN classifier conducts classification based on the data of

the two principal components.

The training data's class clustering diagram is shown in Figure 2-5 (a). The diagram

shows a PCA-based two-dimensional representation of the data corresponding to the five

classes targeted by our system. After being trained with the training data, the classifier

classifies the remaining benchmark programs shown in Table 2-2. The classifier provides

outputs in two kinds of formats: the application class-clustering diagram, which helps to

visualize the classification results, and the application class composition, which can be

used to calculate the unit application cost.

Figure 2-5 shows the sample clustering diagrams for three test applications. For

example, the interactive VMD application (Figure 2-5(d)) shows a mix of the idle class

when user is not interacting with the application, the I/O-intensive class when the user

is uploading an input file, and the Network-intensive class while the user is interacting

with the GUI through a VNC remote display. Table 2-3 summarizes the class compositions

of all the test applications. Figure 2-6 visualizes the class composition of some sample

benchmark programs. These classification results match the class expectations gained from

empirical experience with these programs. They are used to calculate the unit application

cost shown in section 4.4.































x Idle
0 10
CPU
a NET
O MEM



*


x Idle
* CPU


C
E 2
0
0 1
E


o.

a_


-6 -4 -2 0
Principal Component 1


2 4


-6 -4 -2 0
Principal Component 1


C,
E 2
C
0
S1
E
o
0
-7 0
o.a
2-


-2


-6 -4 -2 0
Principal Component 1


2 4


-6 -4 -2 0
Principal Component 1

D


Figure 2-5.


Sample clustering diagrams of application classifications. A)Training

data:Mixture B)SimpleScalar:CPU-intensive C)Autobench:Network-intensive

D)VMD:Interactive Principal Component 1 and 2 are the principal component

metrics extracted by PCA.


3
CM
E 2
C


o
0
0
0

2 -1
o_


2 4


2
C

1 1
E
0
0
7Z 0
2-
o.


x Idle
0 10
E NET


2 4















0 0, 0 0 O=t
t I I I n in 1- .
^o Mo 0a




0 0 00 00 0 0
Co 0>
I I I I I I I I I 0 0 0






0 C) 0o 0 0
- 0 0- o
I 0 I V I I I









0 C 0
o\ o in









0
0 0 0



- 6 6- 00













I0 00
I I~"~ I I ^ I In -
















--NC





< Um cn u < r >
Sa












*S 0 03
_?__ | _4









In addition, the experimental data also demonstrate the impact of changing execution

environment configurations on the application's class composition. For example, in

Table 2-3 when SPECseis96 with medium size input data was executed in VM1 with

256MB memory (SPECseis96_A), it is classified as CPU-intensive application. In the

SPECseis96_B experiment, the smaller physical memory (32MB) resulted in increased

paging and I/O activity. The increased I/O activity is due to the fact that less physical

memory is available to the O/S buffer cache for I/O blocks. The buffer cache size at run

time was observed to be as small as 1MB in SPECseis96_B, and as large as 200MB in

SPECseis96_A. In addition, the execution time gets increased from 291 minutes and 42

seconds in the first case to 426 minutes 58 seconds in the second case.

Similarly, in the experiments with PostMark, different execution environment

configurations changed the application's resource consumption pattern from one class to

another. Table 2-3 shows that if a local file directory was used to store the files to be read

and written during the program execution, the PostMark benchmark showed the resource

consumption pattern of the I/O-intensive class. In contrast, with an NFS mounted file

directory, it (PostMark_NFS) was turned into a Network-intensive application.

2.4.2 Scheduling Performance Improvement

Two sets of experiments are used to illustrate the performance improvement that a

scheduler can achieve with the knowledge of application class. These experiments were

performed on 4 VMware GSX 2.5 virtual machines with 256MB memory each. One of

these virtual machines (VM1) was hosted on an Intel(R) Xeon(T\M) dual-CPU 1.80GHz

machine with 512KB cache and 1GB RAM. The other three (VM2, VM3, and VM4) were

hosted on an Intel(R) Xeon(TM) dual-CPU 2.40GHz machine with 512KB cache and 4GB

RAM. The host servers were connected by Gigabit Ethernet.

The first set of experiments demonstrates that the application class information can

help the scheduler to optimize resource sharing among applications running in parallel to

improve system throughput and reduce throughput variances. In the experiments, three










Application Class Compositions


VMD
Sftp
Autobench
NetPIPE
PostMark NFS
o Stream
SPEC M 32

a. Bonnie
PostMark
SimpleScalar
CH3D
SPEC S 256
SPEC M 256

0% 20% 40% 60% 80% 100%
Percentage

o Idle m I/O m CPU m Network O Paging

Figure 2-6. Application class composition diagram


applications SPECseis96 (S) with small data size, PostMark (P) with local file directory

and NetPIPE Client (N) were selected, and three instances of each application were

executed. The scheduler's task was to decide how to allocate the nine application instances

to run on the 3 virtual machines (VM1, VM2 and VM3) in parallel, each of which hosted

3 jobs. The VM4 was used to host the NetPIPE server. There are ten possible schedules

available, as shown in Figure 2-7.

When multiple applications run on the same host machine at the same time, there

are resource contentions among them. Two scenarios were compared: in the first scenario,

the scheduler did not use class information, and one of the ten possible schedules was










System Throughput of Different Schedules
1600
1400
Ea 1200 -
fl o1000-
0-,
2 800
", 600
8 400
200
0
1 2 3 4 5 6 7 8 9 10
Schedule ID

Figure 2-7. System throughput comparisons for ten different schedules
1:{(SSS),(PPP),(NNN)}, 2:{(SSS),(PPN),(PNN)}, 3:{(SSP),(SPP),(NNN)},
4:{(SSP),(SPN),(PNN)}, 5:{(SSP),(SNN),(PPN)}, 6:{(SSN),(SPP),(PNN)},
7:{(SSN),(SPN),(PPN)}, 8:{(SSN),(SNN),(PPP)}, 9:{(SPP),(SPN),(SNN)},
10:{ (SPN), (SPN), (SPN) }
S -SPECseis96 (CPU-intensive), P -PostMark (I/O-intensive),
N -NetPIPE (Network-intensive).

selected at random. The other scenario used application class knowledge, alv--v allocating
applications of different classes (CPU, I/O and network) to run on the same machine
(Schedule 10, Figure 2-7). The system throughputs obtained from runs of all possible
schedules in the experimental environment are shown in Figure 2-7.
The average system throughput of the schedule chosen with class knowledge was
1391 jobs per d i'. It achieved the highest throughput among the ten possible schedules
22.11 larger than the weighted average of the system throughputs of all the ten possible
schedules. In addition, the random selection of the possible schedules resulted in large
variances of system throughput. The application class information can be used to facilitate
the scheduler to pick the optimal schedule consistently. The application throughput
comparison of different schedules on one machine is shown in Figure 2-8. It compares the










Application Throughput of Different Schedules
250
SPN
SPPN
a. 200

)SSN
150- SSN
SMIN
0 100 *MAX
CLOO
SAVG
(SP } 50 Figue SPN
< 0

SPECseis96 PostMark NetPIPE
Application

Figure 2-8. Application throughput comparisons of different schedules. MIN, MAX, and
AVG are the minimum, maximum, average application throughput of all the
ten possible schedules. SPN is the proposed schedule 10 {(SPN), (SPN),
(SPN)} in Figure 2-7.

Table 2-4. System throughput: concurrent vs. sequential executions
Execution Elapsed CH3D PostMark Time Taken to
Time (sec) Finish 2 Jobs
Concurrent 613 310 613
Sequential 488 264 752


throughput of schedule ID 10 (labeled SPN in Figure 2-8) with the minimum, maximum,

and average throughputs of all the ten possible schedules. By allocating jobs from different

classes to the machine, the three applications' throughputs were higher than average by

different degrees: SPECseis96 Small by 24.911' Postmark by 48.1;:', and NetPIPE by

4.2'' Figure 2-8 also shows that the maximum application throughputs were achieved

by sub-schedule (SSN) for SPECseis96 and (PPN) for NetPIPE instead of the proposed

(SPN). However, the low throughputs of the other applications in the sub-schedule make

their total throughputs sub-optimal.









The second set of experiments illustrates the improved throughput achieved by

scheduling applications of different classes to run concurrently over running them

sequentially. In the experiments, a CPU intensive application (CH3D) and an I/O

intensive application (PostMark) were scheduled to run in one machine. The execution

time for concurrent and sequential executions is shown in Table 2-4. The experiment

results show that the execution efficiency losses caused by the relatively moderate resource

contentions between applications of different classes were offset by the gains from the

utilization of idle capacity. The resource sharing of applications of different classes

improved the overall system throughput.

2.4.3 Classification Cost

The classification cost is evaluated based on the unit sample processing time in the

data extraction, PCA, and classification stage. Two physical machines were used in this

experiment: The performance filter in Figure 2-3 was running on an Intel(R) Pentium(R)

4 CPU 1.70GHz machine with 512MB memory. In addition, the application classifier was

running on an Intel(R) Pentium(R) III 750MHz machine with 256MB RAM.

In this experiment, a total of 8000 snapshots were taken with five-second intervals

for the virtual machine, which hosted the execution of SPECseis96 (medium). It took the

performance filter 72 seconds to extract the performance data of the target application

VM. In addition, it took another 50 seconds for the classification center to train the

classifier, perform the PCA feature selection and the application classification. Therefore

the unit classification cost is 15 ms per sample data, demonstrating that it is possible to

consider the classifier for online training.

2.5 Related Work

Feature selection [39] [25] and classification techniques have been applied to many

areas successfully, such as intrusion detection [40][41][42][43], text categorization [44], and

image and speech analysis. Kapadia's evaluation of learning algorithms for application

performance prediction in [45] shows that the nearest-neighbor algorithm has better









performance than the locally weighted regression algorithms for the tools tested. Our

choice of k-NN classification is based on conclusions from [45]. This thesis differs from

Kapadia's work in the following v--,v- First, the application class knowledge is used to

facilitate the resource scheduling to improve the overall system throughput in contrast

with Kapadia's work, which focuses on application CPU time prediction. Second, the

application classifier takes performance metrics as inputs. In contrast, in [45] the CPU

time prediction is based on the input parameters of the application. Third, the application

classifier employs PCA to reduce the dimensionality of the performance feature space. It is

especially helpful when the number of input features of the classifier is not trivial.

Condor uses process checkpoint and migration techniques [20] to allow an allocation

to be created and preempted at any time. The transfer of checkpoints may occupy

significant network bandwidth. Basney's study in [46] shows that co-scheduling of CPU

and network resources can improve the Condor resource pool's goodput, which is defined

as the allocation time when a remotely executing application uses the CPU to make

forward progress. The application classifier presented in this thesis performs learning of

application's resource consumption of memory and I/O in addition to CPU and network

usage. It provides a way to extract the key performance features and generate an abstract

of the application resource consumption pattern in the form of application class. The

application class information and resource consumption statistics can be used together

with recent multi-lateral resource scheduling techniques, such as Condor's Gang-matching

[47], to facilitate the resource scheduling and improve system throughput.
Conservative Scheduling [4] uses the prediction of the average and variance of the

CPU load of some future point of time and time interval to facilitate scheduling. The

application classifier shares the common technique of resource consumption pattern

analysis of a time window, which is defined as the time of one application run. However,

the application classifier is capable to take into account usage patterns of multiple kinds of

resources, such as CPU, I/O, network and memory.









The skeleton-based performance prediction work introduced in [48] uses a synthetic

skeleton program to reproduce the CPU utilization and communication behaviors of

message passing parallel programs to predict application performance. In contrast, the

application classifier provides application behavior learning in more dimensions.

Prophesy [49] employs a performance-modeling component, which uses coupling

parameters to quantify the interactions between kernels that compose an application.

However, to be able to collect data at the level of basic blocks, procedures, and loops,

it requires insertion of instrumentation code into the application source code. In

contrast, the classification approach uses the system performance data collected from

the application host to infer the application resource consumption pattern. It does not

require the modification of the application source code.

Statistical clustering techniques have been applied to learn application behavior

at various levels. Nickolayev et al applied clustering techniques to efficiently reduce

the processor event trace data volume in cluster environment [50]. Ahn and Vetter

conducted application performance ain i ,~i-; by using clustering techniques to identify the

representative performance counter metrics [51]. Both Cohen and ('!i 's [52] and our

work perform statistical clustering using system-level metrics. However, their work focuses

on system performance anomaly detection. Our work focuses on application classification

for resource scheduling.

Our work can be used to learn the resource consumption patterns of parallel

application's child process and multi-stage application's sub-stage. However, in this

study we focus on sequential and single-stage applications.

2.6 Conclusion

The application classification prototype presented in this chapter shows how to apply

the Principal Component Analysis and K-Nearest Neighbor techniques to reduce the

dimensions of application resource consumption feature space and assist the resource

scheduling. In addition to the CPU load, it also takes the I/O, network, and memory









activities into account for the resource scheduling in an effective way. It does not require

modifications of the application source code. Experiments with various benchmark

applications -L.-. -1 that with the application class knowledge, a scheduler can improve

the system throughput 22.11 on average by allocating the applications of different classes

to share the system resources.

In this work, the input performance metrics are selected manually based on expert

knowledge. In the next chapter, the techniques for automatically selecting features for

application classification are discussed.









CHAPTER 3
AUTONOMIC FEATURE SELECTION FOR APPLICATION CLASSIFICATION

Application classification techniques based on monitoring and learning of resource

usage (e.g., CPU, memory, disk, and network) have been proposed in Ci Ilpter 2 to aid in

resource scheduling decisions. An important problem that arises in application classifiers

is how to decide which subset of numerous performance metrics collected from monitoring

tools should be used for the classification. This chapter presents an approach based on

a probabilistic model (B i-, -i ,i Network) to systematically select the representative

performance features, which can provide optimal classification accuracy and adapt to

changing workloads.

3.1 Introduction

Awareness of application resource consumption patterns (such as CPU-intensive, I/O

and paging-intensive and network-intensive) can facilitate the mapping of workloads to

appropriate resources. Techniques of application classification based on monitoring and

learning of resource usage can be used to gain application awareness [53]. Well-known

monitoring tools such as the open source packages Ganglia [54] and dproc [55], and

commercial products such as HP's Open View [56] provide the capability of monitoring

a rich set of system level performance metrics. An important problem that arises is how

to decide which subset of numerous performance metrics collected from monitoring tools

should be used for the classification in a dynamic environment. In this chapter we address

this problem. Our approach is based on autonomic feature selection and can help to

improve the system's self-manageability [1] by reducing the reliance on expert knowledge

and increasing the system's adaptability.

The need for autonomic feature selection and application classification is motivated by

systems such as VMPlant [16], which provides automated resource provisioning of Virtual

Machine (VM). In the context of VMPlant, the application can be scheduled to run on a

dedicated virtual machine, whose system level performance metrics reflect the application's









resource usage. An application classifier categorizes the application into different classes

such as CPU-intensive, disk I/O-intensive, and network-intensive based on the selected

VM performance metrics.

To build an autonomic classification system with self-configurability, it is critical

to devise a systematic feature selection scheme that can automatically choose the most

representative features for application classification and adapt to changing workloads. This

chapter presents an approach of using a probabilistic model, the B i- -i i Network, to

automatically select the performance metrics that correlate with application classes and

optimize the classification accuracy. The approach also uses the Mahalanobis distance to

support online selection of training data, which enables the feature selection to adapt to

dynamic workloads. In the rest of this dissertation, we will use the terms metricss" and

1 ,I ilres" interchangeably.

In chapter 2, a subset of performance metrics were manually selected based on expert

knowledge to correlate to the resource consumption behavior of the application class.

However, expert knowledge is not ahv--li- available. In case of highly dynamic workload or

mass volume of performance data, the approach of manual configuration by human expert

is also not feasible. These present a need for a systematic way to select the representative

metrics in the absence of sufficient expert knowledge. On the other hand, the use of

the B li-, -i i, Network leaves the option open to integrate expert knowledge with the

automatic feature selection to improve the classification accuracy and efficiency.

Feature selection based on static selected application performance data, which are

used as the training set, may not ahv--li- provide the optimal classification results in

dynamic environments. To enable the feature selection to adapt to the changing workload,

it requires the system to be able to dynamically update the training set with data from

recent workload. A question that arises is how to decide which data should be selected

as training data. In this work, an algorithm based on Mahalanobis distance is used









to identify the training data which can represent the resource consumption pattern of

corresponding application class.

Our experimental results show the following. First, we observe correlations between

pairs of selected performance metrics, which justifies the use of Mahalanobis distance

as a means of taking the correlation into account in the training data selection process.

Second, there is a diminishing return of classification utility function (i.e. the ratio of

classification accuracy over the number of selected metrics) as more features are selected.

The experiments showed that above 9i' application classification accuracy can be

achieved with a small subset of performance metrics which are highly correlated with the

application class. Third, the application classification based on the selected features for a

set of benchmark programs and scientific applications matched our empirical experience

with these applications.

The rest of the chapter is organized as follows: The statistical techniques used

are described in Section 3.2. Section 3.3 presents the feature selection model. Section

3.4 presents and discusses the experimental results. Section 3.5 discusses related work.

Conclusions and future work are discussed in Section 3.6.

3.2 Statistical Inference

3.2.1 Feature Selection

Feature selection is a process that selects an optimal subset of original features based

on an evaluating criterion. The evaluation criterion in this work is the classification

accuracy. A typical feature selection process consists of four steps: subset generation,

subset evaluation, stopping criterion, and result validation [57]. Subset generation is a

process of heuristic search of candidate subsets. Each subset is evaluated based on the

evaluation criterion. Then the evaluation result is compared with the previously computed

best result. If it is better, it will replace the best result and the process continues until

the stop criterion is reached. The selection result is validated by different tests or prior

knowledge.









There are two major types of feature selection algorithms: the filter model and the

wrapper model. The filter model relies on general characteristics of data to evaluate

and select feature subsets without involving any mining algorithm. However, the

wrapper model requires one predetermined mining algorithm and uses its performance

as the evaluation criterion. In this work, a wrapper model is used to search for features

better suited to the classification algorithm (B i, -i i Network) aiming to improve the

classification accuracy. Our model employs a forward wrapper model based on B li-i i

Network. This model is introduced in detail in Section 3.3.2.

3.2.2 Bayesian Network

A B li-, -i i Network (BN) is a directed .1 iv .i graph (DAG) with a conditional

probability distribution for each node. Each node represents a domain variable, and each

arc between nodes represents a probabilistic dependency [58]. It can be used to compute

the conditional probability of a node, given the values of its predecessors; hence, a BN can

be used as a classifier that gives the posterior probability distribution of the class decision

node given the values of other nodes.

B li-, -i i, Networks are based on the work of the mathematician and theologian

Rev. Thomas B i,-- who worked with conditional probability theory in the late 1700s

to discover a basic law of probability, which was then called B i-, -' rule. The B i-, -' rule

includes a hypothesis, past experience, and evidence:

where we can update our belief in hypothesis H given the additional evidence E, and

the background context (past experience), c.

The left-hand term, P(HE, c) is called the posterior pj, .1,l''.:1.:;1 or the probability of

hypothesis H after considering the effect of the evidence E on past experience c.

The term P(H c) is called the a-priori p i1..,b.7 i /.:; of H given c alone.

The term P(EIH, c) is called the likelihood and gives the probability of the evidence

assuming the hypothesis H and the background information c is true.









Finally, the last term P(Elc) is independent of H and can be regarded as a

normalizing or scaling factor.

B li, -i i Networks capture B i, -' rule in a graphical model. They are very effective

for modeling situations where some information is already known and incoming data is

uncertain or partially unavailable (unlike rule-based or ::." it" systems, where uncertain

or unavailable data results in ineffective or inaccurate reasoning). This robustness in

the face of imperfect knowledge is one of the many reasons why B li-, -i i, Networks are

increasingly used as an alternative to other AI representational formalisms. B li-, -i iI

networks have been applied to many areas successfully, including map learning [59],

medical diagnosis [60][61], and speech and vision processing [62][63]. Compared with

other predictive models, such as decision trees and neural networks, and standard feature

selection model that is based on Principal Component Analysis (PCA), B li-, -i i, networks

also have the advantage of interpretability. Human experts can easily understand the

network structure and modify it to obtain better predictive models. By adding decision

nodes and utility nodes, BN models can also be extended to decision networks for decision

analysis [64].

Consider a domain U of n variables, xl, i x,. Each variable may be discrete having

a finite or countable number of states, or continuous. Given a subset X of variables xi,

where xi E U, if one can observe the state of every variable in X, then this observation

is called an instance of X and is denoted as X = p(xilxi, xi_l, i) = p(xil\li,})kx

for the observations xi = ki, xi c X. The "joint -I ... or U is the set of all instances

of U. p(X = kxlY = k, ) denotes the ,. i, I!.. !1d probability d( -in-y that X =

p(xii, x 1 x_,) = p(xilli, )kx given Y = k for a person with current state

information p(X Y, ) then denotes the "Generalized Probability Density Function"

(gpdf) for X, given all possible observations of Y. The joint gpdf over U is the gpdf for U.
A B li-, -i ,, network for domain U represents a joint gpdf over U. This representation

consists of a set of local conditional gpdfs combined with a set of conditional independence










Application Class


pksin load fifteen


Figure 3-1. Sample B ,i-, -i i: network generated by feature selector


assertions that allow the construction of a global gpdf from the local gpdfs. As shown

previously, the chain rule of probability can be used to ascertain these values:
k
p(xi,' ,Xk|) Jp(xixl,"" ,Xi-l,) (3 1)
i=1

One assumption imposed by B li, -i i: Network theory (and indirectly by the Product

Rule of probability theory) is that each variable xi, Hi C {xi, xi-1} must be a set of

variables that renders xi and {xl, xii} conditionally independent. In this way:


p(xilXl,- Xi-1, = p(xili, 0) (3-2)

A B li-, -i i Network Structure then encodes the assertions of conditional independence

in Equation 3-1 above. Essentially then, a B li, -i ,i Network Structure Bs is a directed

.I. il ic graph such that each variable in U corresponds to a node in Bs, and the parents of

the node corresponding to xi are the nodes corresponding to the variables in Fi.

Depending on the problem that is defined, either (or both) of the topology and

the probability distribution of B li-, -i i, Network can be pre-defined by hand or may be









learned. In this work, the B ,i, i Network with a tree structure and full observability

is assumed. Figure 3-1 gives a sample BN learned in the experiment. The root is the

application class decision node, which is used to decide an application class given the value

of the leaf nodes. The root node is the parent of all other nodes. The leaf nodes represent

selected performance metrics, such as network packets sent and bytes written to disk.

They are connected one to another in a series.

3.2.3 Mahalanobis Distance

The Mahalanobis distance is a measure of distance between two points in the

multidimensional space defined by multidimensional correlated variables [22] [65]. For

example, if xl and x2 are two points from the distribution which is characterized by

covariance matrix E-1, then the quantity


((xi X2)T 1 X2)) (3 3)

is called the Mahalanobis distance from xi to X2, where T denotes the transpose of a

matrix.

In the cases where there are correlations between variables, simple Euclidean distance

is not an appropriate measure, whereas the Mahalanobis distance can adequately account

for the correlations and is scale-invariant. Statistical analysis of the performance data

in Section 3.4.3 shows that there are correlations between the application performance

metrics with various degrees. Therefore, Mahalanobis distance between the unlabeled

performance sample and the class centroid, which represents the average of all existing

training data of the class, is used in the training data qualification process in Section 3.3.1.

3.2.4 Confusion Matrix

Confusion matrix [66] is commonly used to evaluate the performance of classification

systems. It shows the predicted and actual classification done by the system. The matrix

size is LxL, where L is the number of different classes. In our case where there are five

target application classes, the L is equal to 5.









The classification accuracy is measured as the proportion of the total number of

predictions that are correct. A prediction is considered correct if the data is classified to

the same class as its actual class. Table 3-1 shows a sample confusion matrix with L=2.

There are only two possible classes in this example: Positive and negative. Therefore, its

classification accuracy can be calculated as (a+d)/(a+b+c+d).

3.3 Autonomic Feature Selection Framework

Figure 3-2 shows the autonomic feature selection framework in the context of

application classification. In this section, we are going to focus on introducing the

classification training center, which enables the self-configurability for online application

classification. The training center has two 1n i' functions: quality assurance of training

data, which enables the classifier to adapt to changing workloads, and systematic feature

selection, which supports automatic feature selection. The training center consists of three

components: the data quality assuror, the feature selector, and the trainer.

3.3.1 Data Quality Assuror

The data quality assuror (DataQA) is responsible for selecting the training data for

application classification. The inputs of the DataQA are the performance snapshots taken

during the application execution. The outputs are the qualified training data with its

class, such as CPU-intensive.

The training data pool consists of representative data of five application classes

including CPU-intensive, I/O-intensive, memory-intensive, network-intensive, and idle.

Training data of each class c is a set of K< m-dimensional points, where m is the number

of application-specific performance metrics reported by the monitoring tools. To select the

Table 3-1. Sample confusion matrix with two classes (L 2)
Actual Predicted
Class Negative Positive
Negative a b
Positive c d











VMIP
(t1, t2)


CTC Classification Training Center DataQA Data Quality Assuror


Figure 3-2.


Feature selection model
The Performance p li'/. collects performance metrics of the target
application node. The Application ,. /i-.:;/7 r classifies the application using
extracted key components and performs statistic ain i ,~i-; of the classification
results. The DataQA selects the training data for the classification. The
Feature selector selects performance metrics which can provide optimal
classification accuracy. The Trainer trains the classifier using the selected
metrics of training data. The Application DB stores the application class
information. (to/ti: are the beginning / ending times of the application
execution, VMIP is the IP address of the application's host machine).


training data from the application snapshots, only n out of m metrics are extracted based

on previous feature selection result to form a set of Kc n-dimensional training points.




{Xk,1, Xk,,2, Xk ,,n}, k 1, 2,.. Kc (3-4

that comprise a cluster Cc. From [50], it follows that the n-tuple


~,= ( I x 2,- ')nJ (3-5


)


)









where


SKc
x(t) Zxkci,i =1,2, -- ,n (3-6)
kc=1

is called the centroid of the cluster Cc.

The training data selection is a three-step process: First the DataQA extracts the n

out of m metrics of the input performance snapshot to form a training data candidate.

Thus each candidate is represented by an n-dimensional point x = (xl, x2, xa).

Second, it evaluates whether the input candidate is qualified to be training data

representing one of the application class. At last, the qualified training data candidate

is associated with a scalar value Class, which defines the application class.

The first step is straight-forward. In the second and third steps, the Mahalanobis

distance between the training data candidate x and the centroid ec of cluster Cc is

calculated as follows:




dc(x) = ((x pc)'Ec '(x pc))' (3-7)

where c = 1, 2, .. 5 represents the application class and y 1 denotes the inverse

covariance matrix of the cluster Cc. The distance from the training data candidate x to

the boundary between two class clusters, for example C1 and C2, is Idl(x) d2(x) If

Idi(x) d2(x)| = 0, it means that the candidate is exactly at the boundary between

class 1 and 2. The further away the candidate is from the class boundaries, the better it

can represent a class. In other words, there is less probability for it to be misclassified.

Therefore, the DataQA calculates the distance from the candidate to boundaries of all

possible pairs of the classes. If the minimal distance to class boundaries, min(ldl -

d2a, Idl d3, .. d4 d5 ), is 'i.- -.-r than a predefined threshold 7, the corresponding

m-dimensional snapshot of the candidate is determined to be qualified training data of









Table 3-2. Sample performance metrics in the original feature set
Performance Description
Metrics
cpu_system / user Percent CPU_system / user
/ idle / idle
cpunice Percent CPU nice
bytes_in / out Number of bytes per second
into / out of the network
io_bi / bo Blocks sent to / received from
a block device (blocks/s)
swap_in / out Amount of memory swapped
in / out from / to disk (kB/s)
pktsin / out Packets in / out per second
proc_run Total number of running
processes
load_one / five One / Five / Fifteen minutes
/ fifteen load average


the class, whose centroid has the smallest Mahanalobis distance min(di, d2, d5) to the

snapshot. Automated and adaptive threshold setting is discussed in detail in [67].

In our implementation, Ganglia is used as the monitoring tool and twenty (m = 20)

performance metrics, which are related to resource usage, are included in the training

data. These performance metrics include 16 out of 33 default metrics monitored by

Ganglia and the 4 metrics that we added based on the need of classification. The four

metrics include the number of I/O blocks read from/written to disk, and the number of

memory pages swapped in/out. A program was developed to collect these four metrics

(using vmstat) and added them to the metric list of Ganglia's monitoring daemon gmond.

Table 3-2 shows some sample performance metrics of the training candidate.

The first time quality assurance was performed by human expert at the initialization.

The subsequent assurance can be conducted automatically by following the above steps to

select representative training data for each class.

3.3.2 Feature Selector

Feature selector is responsible for selecting the features which are correlated with

the application's resource consumption pattern from the numerous performance metrics










Input: C(Fo, F1, ,FN-1) // training data set with N features
Input: Class // class of training data (teacher for learning)
Output: Sbest // selected feature subset
begin
initialize Sbest {}
initialize Amax = 0; // maximum accuracy
D = discretize( C ); // convert continuous to discrete features
repeat
initialize Anode = 0; // max accuracy for each node
initialize Fnode = 0; // selected feature for each node
foreach F E ({Fo, F1, ,FN-1} Sbest) do
Accuracy eval(D, Class, Sbest U {F});
// evaluate Bayesian network with extra feature F
if Accur i > Anode then // store the current feature
Anode Accuracy;
Fnode F;
end
end
if Anode > Amax then
Sbest = Sbest U {Fnode);
Amax Anode;
Anode Anode + 1;
end
until (Anode < Ama) ;
end

Figure 3-3. B ,, -i ,-network based feature selection algorithm for application
classification


collected from monitoring tools. By filtering out metrics which contribute less to the

classification, it can help to not only reduce the computational complexity of subsequent

classifications, but also improve classification accuracy.

In our previous work [53], representative features were selected manually based

on expert knowledge. For example, performance metrics of cpu_system and cpu_user

are correlated to the behavior of CPU-intensive applications; bytesin and bytes_out

are correlated to network-intensive applications; io_bi and io_bo are correlated to the

I/O-intensive applications; swapin and swap_out are correlated to memory-intensive

applications. However, to support on-line classification, it is necessary for feature selection

to have the ability to adapt to changing workloads. Therefore, the static selection









conducted by human expert may not be sufficient in a highly dynamic environment.

A feature selection scheme, which can automatically select the representative features

for application classification in a systematic way, can help to solve the problem. This

automated feature selection enables the application classifier to self-configure its input

feature subset to adapt to the changing workload.

A wrapper algorithm based on B li, -i i network is emploi-, by the feature selector

to conduct the feature selection. As introduced in Section 3.2.1, although this feature

selection scheme reduces the reliance on human experts' knowledge, the B li-, -i I

network's interpretability leaves the options open to integrate the expert knowledge

into the selection scheme to build a better classification model.

Figure 3-3 shows the feature selection algorithm. It starts with an empty feature

subset Sbest = {}. To search for the best feature F, it uses the temporary feature set

{Sbest U F} to perform B li, -i Network classification for the discrete training data D.

The classification accuracy is calculated by comparing the classification result and the

true answer of the Class information contained in the training data. After the evaluation

of accuracy using all remaining features ({F1, F2, FN-1} Sbest), the best accuracy

is stored to Anode. If Anode is better than the previous best accuracy Ama, achieved, the

corresponding feature node is added to the feature subset to form the new subset. This

process is repeated until the classification accuracy cannot be improved any more by

adding any of the remaining features into the subset.

3.3.3 Trainer

The classification trainer manages the training of the application classifier. It

monitors the updating status of the training data pool. Every time the DataQA qualifies

new training data, it replaces the oldest data in the training data pool with the new data.

When the percentage of new training data in the pool reaches a predefined threshold (for

example, i'-. ), the trainer sends a request to the feature selector to start the feature

selection process to generate the updated feature subset. After receiving the updated









feature subset, it calls the classifier to perform classification of the data in the updated

training data pool using the old and new feature subsets respectively. Then it compares

the classification accuracy of the two. If the accuracy achieved by the new feature subset

is higher than the one achieved by the previous subset, the selected feature is updated.

Otherwise, it remains the same.

3.4 Experimental Results

We have implemented a prototype for the feature selector based on Matlab. This

section shows the experimental results of feature selection for data collected during the

execution of a set of applications representative of each class (CPU, I/O, memory and

network intensive) and the classification accuracy achieved. In addition, statistical analysis

of the performance metrics was conducted to justify the reason of using Mahalanobis

distance in the training data quality assurance process.

In the experiments, all the applications were executed in a VMware GSX 2.5 virtual

machine with 256MB memory. The virtual machine was hosted on an Intel(R) Xeon(T\ i)

dual-CPU 1.80GHz machine with 512KB cache and 1GB RAM. The CTC and application

classifier were running on an Intel(R) Pentium(R) III 750MHz machine with 256MB RAM.

3.4.1 Feature Selection and Classification Accuracy

Two sets of experiments were conducted offline to evaluate our feature selection

algorithm. In both experiments, the training data, described by 20 performance metrics,

consists of performance snapshots of applications belonging to different classes. In the

experiments, tenfold cross validation were performed. The training data was randomly

divided into two parts: A combination of 50 '. of the data from each class was used to

train the feature selector (training set) to derive the feature subset, and the other 5(0'. was

used as test set to validate the features selected by calculating its classification accuracy.

The first experiment was designed to show the relationship between classification

accuracy and the number of features selected. The second experiment was designed to














0.95


S0.9
0

0 0.85


0.8
O
0.75


0.7
0 2 4 6 8
Number of Selected Features

Figure 3-4. Average classification accuracy of 10 sets of test data versus number of
features selected in the first experiment




28
0 10
26 Memory
0
24 0 o

0 9) C
S22- 00 0

20


0o 0
18


16
1000 1200 1400 1600 1800 2000
bytes in

Figure 3-5. Two-class test data distribution with the first two selected features









show that prior-class information can be used to achieve higher classification accuracy

with smaller number of features.

In the first experiment, the training data consist of performance snapshots of five

classes of applications, including CPU-intensive, I/O-intensive, memory-intensive, and

network-intensive applications, and the snapshots collected from an idle application-VM,

which has only "background ii. i- from system activity (i.e. without any application

execution during the monitoring period). The feature selector's task is to select those

metrics which can be used to classify the test set into five classes with optimal accuracy.

In all the ten iterations of cross validation, two performance metrics (cpu_system and

load_fifteen) were ahli.-, selected as the best two features. Figure 3-6 shows a sample test

data distribution with these two features. If we project the data to x-axis or y-axis, we

can see that it is more difficult to differentiate the data from each class by using either

cpu_system or loadfifteen alone than using both metrics. For example, the cpusystem

value ranges of network-intensive application and I/O-intensive application largely

overlap. This makes it hard to classify these two applications with only cpu_system metric.

Compared with the one-metric classification, it is much easier to decide which class the

test data belong to by using information of both metrics. In other words, the combination

of multiple features is more descriptive than a single feature.

The classification accuracy versus the number of features selected for the above

learned B li-, -i in network is plotted in Figure 3-4. It shows that with a small number

of features (3 to 4), it can achieve above ,9 '. classification accuracy for this 5-class

classification.

In the second experiment, the training data consist of performance snapshots of two

classes of applications, I/O-intensive and memory-intensive. Figure 3-5 shows its test

data distribution with the first two selected features, bytes-in and pkts_in. A comparison

of Figure 3-6 and Figure 3-5 shows that with reduced number of application classes,

higher classification accuracy can be achieved with less number of features. For example,









Table 3-3. Confusion matrix of classification results with expert-selected and
automatically-selected feature sets. A)Automatic B)Expert
Actual Classified as
Class Idle CPU IO Net Mem
Idle 4938 0 62 0 0
CPU 231 4746 23 0 0
IO 20 86 2888 6 0
Net 0 12 8 4980 0
Mem 0 0 0 0 5000
A

Actual Classified as
Class Idle CPU IO Net Mem
Idle 4962 0 38 0 0
CPU 4 4882 10 0 104
IO 20 10 2797 0 173
Net 0 0 24 4970 6
Mem 3 0 36 0 4961
B

The bold numbers along the diagonal are the number of
correctly classified data.


in this experiment, if we know that the application belongs to either I/O-intensive or

memory-intensive class, with two selected features, I'., classification accuracy can be

achieved versus S7'. accuracy in the 5-class case. It shows the potential of using pair-wise

classification to improve the classification accuracy for multi-class cases. Using pair-wise

approach for multi-class classification is a topic of future research.

3.4.2 Classification Validation

This set of experiments targets to validate the feature selection experiment results

with the Principal Component Analysis (PCA) and k-Nearest Neighbor (k-NN) based

application classification framework described in [53].

First, the training data distributions based on principal components, which are

derived from automatically selected features in Section 3.4.1 and manually selected

features in previous work [53], are shown in Figure 3-8. Distances between each pair

of class centroids in Figure 3-8 are calculated and ploted in Figure 3-7. It shows that

















0.6
0.4
0.2
0
-0.2


40
CDU


60
system (%)


Figure 3-6. Five-class test data distribution with first two selected features


Automatic
I I Expert










... IL


1 2 3 4 5 6 7
Cluster Pair


8 9 10


Figure 3-7.


Comparison of distances between cluster centers derived from expert-selected
and automatically selected feature sets
l:idle-cpu 2:idle-I/O 3:idle-net 4:idle-mem 5:cpu-I/O
6:cpu-net 7:cpu-mem 8:I/O-net 9:I/O-mem 10:net-mem


x Idle
CPU
*0 10
o Network
O Memory

CIBiO0 00 G(1IO 0



- 0 coo 0 0 GD 0-






























0
Principal


2
Component 1


0 2
Principal Component 1


Figure 3-8. Training data
automatically


clustering diagram derived from expert-selected and
selected feature sets A)Automatic B)Expert


x Idle
0 10
CPU
NET
0 MEM






0


4t


-2
-4


x Idle
0 10
CPU
NET
0 MEM

0
0 C:


o
O(POc


-21
-4


Im I I









the distances between 9 out of 10 pairs of cluster centroids are bigger in the automatic

selection case than the expert's manual selection case. It means that competitively distinct

class clusters can be formed with the 2 principal components which were derived from the

automatically selected features compared with the expert selected features.

Second, the PCA and k-NN based classifications were conducted with both the expert

selected 8 features in previous work [53] and the automatically selected features in Section

3.4.1. Table 3-3 shows the confusion matrices of the classification results. If data are

classified to the same classes as their actual classes, the classifications are considered as

correct. The classification accuracy is the proportion of the total number of classifications

that were correct. The confusion matrices shows that a classification accuracy of 98.05'

can be achieved with automatically selected feature set, which is similar to the 98.1 !'

accuracy achieved with expert selected feature set. Thus the automatic feature selection

that is based on B ,i, i i Network can reduce the reliance on expert knowledge while

offering competitive classification accuracy compared to manual selection by human

expert.

In addition, a set of 8 features selected in the 5-class feature selection experiment

in Section 3.4.1 was used to configure the application classifier and the same training

data used in the feature selection experiment were used to train the application classifier.

Then the trained classifier conducted classification for a set of three benchmark programs:

SPECseis96 [29], PostMark and PostMark_NFS [28]. SPECseis96 is a scientific application

which is computing-intensive but also exercises disk I/O in the initial and end phases of its

execution. PostMark originally is a disk I/O benchmark program. In PostMark_NFS,

a network file system (NFS) mounted directory was used to store the files which

were read/written by the benchmark. Therefore, PostMark_NFS performs substantial

network-I/O rather than disk I/O. The classification results are shown in Figure 3-9. The

results show that I,' of the SPECseis96 test data were classified as cpu-intensive, 95'

of the PostMark data were classified as I/O-intensive, and 61 of the PostMark_NFS















2.b
o0 I
2 CPU
cj
- 1.5
0
E 0

0- 0.5
0.5
0



-1
-0.5 *


-1 0 1 2 3
Principal Component 1

A

2.5
O IO
2 CPU
cj
S1.5


0
o
- 0.5
U-
0-
0..

-0.5

-1
-1 0 1 2 3
Principal Component 1

B

2.5
o 10
2 O CPU
0 (B NET
1.5 0 NE
I 0 0
- 1
E
o
- 0.5 n

0.

-0.5 *

-1


-1 0 1 2
Principal Component 1

C


3 4


Figure 3-9. Classification results of benchmark programs A)SPECseis96 B)PostMark

C)PostMarkNFS Principal components 1 and 2 are the principal component

metrics extracted by PCA.









Table 3-4. Performance metric correlation matrixes of test applications. A)Correlation
matrix of SPECseis96 performance data B)Correlation matrix of PostMark
performance data C)Correlation matrix of NetPIPE performance data
Metric 1 2 3 4 5 6
1 1.00 -0.21 -0.34 0.74 0.20 -0.02
2 -0.21 1.00 -0.16 -0.02 -0.17 -0.06
3 -0.34 -0.16 1.00 -0.60 0.20 -0.05
4 0.74 -0.02 -0.60 1.00 -0.19 0.04
5 0.20 -0.17 0.20 -0.19 1.00 0.12
6 -0.02 -0.06 -0.05 0.04 0.12 1.00
A

Metric 1 2 3 4 5 6
1 1.00 -0.24 0.22 0.34 -0.08 -0.13
2 -0.24 1.00 -0.22 0.18 0.04 -0.02
3 0.22 -0.22 1.00 0.33 0.30 0.18
4 0.34 0.18 0.33 1.00 0.42 0.47
5 -0.08 0.04 0.30 0.42 1.00 0.20
6 -0.13 -0.02 0.18 0.47 0.20 1.00
B

Metric 1 2 3 4 5 6
1 1.00 0.29 0.31 0.48 0.27 0.30
2 0.29 1.00 0.49 0.39 0.75 0.95
3 0.31 0.49 1.00 0.50 0.59 0.52
4 0.48 0.39 0.50 1.00 0.42 0.39
5 0.28 0.75 0.59 0.42 1.00 0.75
6 0.30 0.95 0.52 0.39 0.75 1.00


-loadive
loadfifteen


2-pktsin
5-pkts_out


cpu_system
bytes_out


Correlations those are larger than 0.5 are highlighted
with bold characters


data were classified as network-intensive. The results matched with our empirical

experience with these programs and are close to the results of expert-selected-feature

based classification, which shows 85'. cpu-intensive for SPECseis96, 97'. I/O-intensive for

PostMark, and I-'. network-intensive for PostMark_NFS.









3.4.3 Training Data Quality Assurance

This set of experiments shows the need of using Mahalanobis distance in the training

data's quality assurance testing process.

The data quality assuror classifies each unlabeled test data by identifying its nearest

neighbor among all class centroids. Its performance thus depends crucially on the distance

metric used to identify the nearest class centroid. In fact, a number of researchers have

demonstrated that nearest neighbor classification can be greatly improved by learning an

appropriate distance metric from labeled examples [65].

Table 3-4 shows the correlation coefficients of each pair of the first six performance

metrics collected during the application execution, including loadfive, pktsin, cpu_system,

loadfifteen, pkts_out, and bytesout. Three applications are used in these experiments

including: SPECseis96 [29], PostMark [28] and NetPIPE [34].

The experiments show that there are correlations between pairs of performance

metrics in various degrees. For example, NetPIPE's bytes_out metric are highly correlated

with its pkts_in, pkts_out, and cpu_system metrics. In the cases where there are

correlations between metrics, distance metric which can take the correlation into

account when determining its distance from the class centroid, should be used. Therefore,

Mahalanobis distance is used in the training data selection process.

3.5 Related Work

Feature selection [39] [68] and classification techniques have been applied to many

areas successfully, such as intrusion detection [69][40][42], text categorization [44], speech

and image processing [62] [63], and medical diagnosis [60] [61].

The following works applied these techniques to ain i 1. -- system performance.

However they differ from each other from the following aspects: goals of feature selection,

the features under study, and implementation complexity.

Nich.-,I, et al. used statistical clustering techniques to identify the representative

processors for parallel application performance tuning [50]. Only event tracing of the









selected processors are needed to capture the interaction between application and system

components. It helps to reduce the large event data volume, which can perturb the system

performance. Their approach does not require modification of application source code.

Ahn et al. applied various statistical techniques to extract the important performance

counter metrics for application performance analysis [51]. Their prototype can support

parallel application's performance analysis by collecting and .I.: eating local data. It

requires annotation of application source code as well as appropriate operating system and

library support to collect process information, which is based on hardware counters.

Cohen et al. [52] studied correlation between component performance metrics and

SLO violation in Internet server platform. There are some similarities between their work

and ours in terms of level of performance metrics under study and type of classifier used.

However, our study differs from theirs in the following v- -v-. First, our study focuses on

application classification (CPU-intensive, I/O and paging intensive, and network-intensive)

for resource scheduling. Their study focused on performance anomaly detection (SLO

violation and compliance). Second, our prototype targets to support online classification.

It addressed the training data qualification problem to adapt the feature selection to

changing workload. However, online training data selection problems were not the focus

of [52]. Third, in our prototype, virtual machines were used to host application executions

and summarize application's resource usage. The prototype supports a wide range of

applications, such as scientific programs and business online transaction system. [52]

studied web application in three-tier client/server systems.

In addition to [52], Aguilera et al. [70] and 1Tf |ii., [71] also studied performance

analysis of distributed systems. However, they considered message-level traces of system

activities instead of system level performance metrics. Both of them treated components

of distributed systems as black-box. Therefore, their approaches do not require application

and middleware modifications.









3.6 Conclusion

The autonomic feature selection prototype presented in this chapter shows how

to apply statistical analysis techniques to support online application classification. We

envision that this classification approach can be used to provide first-order analysis of

the dominant resource consumption patterns of an application. This chapter shows that

autonomic feature selection enables classification without requiring expert knowledge in

the selection of relevant low-level performance metrics.









CHAPTER 4
ADAPTIVE PREDICTOR INTEGRATION FOR SYSTEM PERFORMANCE
PREDICTIONS

The integration of multiple predictors promises higher prediction accuracy than the

accuracy that can be obtained with a single predictor. The challenge is how to select the

best predictor at any given moment. Traditionally, multiple predictors are run in parallel

and the one that generates the best result is selected for prediction. In this chapter, we

propose a novel approach for predictor integration based on the learning of historical

predictions. Compared with the traditional approach, it does not require running all the

predictors simultaneously. Instead, it uses classification algorithms such as k-Nearest

Neighbor (k-NN) and B ,i, ~i classification and dimension reduction technique such

as Principal Component Analysis (PCA) to forecast the best predictor for the workload

under study based on the learning of historical predictions. Then only the forecasted best

predictor is run for prediction.

4.1 Introduction

Grid computing [72] enables entities to create a Virtual Organization (VO) to share

their computation resources such as CPU time, memory, network bandwidth, and disk

bandwidth. Predicting the dynamic resource availability is critical to adaptive resource

scheduling. However, determining the most appropriate resource prediction model a priori

is difficult due to the multi-dimensionality and variability of system resource usage. First,

the applications may exercise the use of different type of resources during their executions.

Some resource usages such as CPU load may be relatively smoother whereas others such

as network bandwidth are bustier. It is hard to find a single prediction model which works

best for all types of resources. Second, different applications may have different resource

usage patterns. The best prediction model for a specific resource of one machine may not

wok best for another machine. Third, the resource performance fluctuates dynamically due

to the contention created by competing applications. Indeed, in the absence of a perfect

prediction model, the best predictor for any particular resource may change over time.









This chapter introduces a Learning-Aided Adaptive Resource Predictor (LARPre-

dictor), which can dynamically choose the best prediction model suited to the workload at

any given moment. By integrating the prediction results generated by the best predictor

of each moment during the application run, the LARPredictor can outperform any single

predictor in the pool. It differs from the traditional mix-of-expert resource prediction

approach in that it does not require running multiple prediction models in parallel all

the time to identify the best predictors. Instead, the Principal Component Analysis

(PCA) and classification algorithm such as k-Nearest Neighbor (k-NN) are used to

forecast the best prediction model from a pool based on the monitoring and learning of

the historical resource availability and the corresponding prediction performance. The

learning-aided adaptive resource performance prediction can be used to support dynamic

VM provisioning by providing accurate prediction of the resource availability of the host

server and the resource demand of the applications that are reflected by the hosting

virtual machines.

Our experimental results based on the analysis of a set of virtual machine trace data

show:

1. The best prediction model is workload specific. In the absence of a perfect

prediction model, it is hard to find a single predictor which works best across virtual

machines which have different resource usage patterns.

2. The best prediction model is resource specific. It is hard to find a single predictor

which works best across different resource types.

3. The best prediction model for a specific type of resource of a given VM trace varies

as a function of time. The LARPredictor can adapt the predictor selection to the change

of the resource consumption pattern.

4. In the experiments with a set of trace data, The LARPredictor outperformed the

observed single best predictor in the pool for 44.2 :'. of the traces and outperformed the

cumulative-MSE based prediction model used in the Network Weather Service system









(NWS) [73] for 66.1' of the traces. It has the potential to consistently outperform any

single predictor for variable workloads and achieve 18., :'.- lower MSE than the model used

in the NWS.

The rest of the chapter is organized as follows: Section 4.2 gives an overview of

related work. Section 4.4 describes the linear time series prediction models used to

construct the LARPredictor and Section 4.5 describes the learning techniques used for

predictor selection. Section 4.6 details the work flow of the learning-aided adaptive

resource predictor. Section 4.7 discusses the experimental results. Section 4.8 summarizes

the work and describes future direction.

4.2 Related Work

Time series analysis has been studied in many areas such as financial forecasting [74],

biomedical signal processing [75], and geoscience [76]. In this work, we focus on the time

series modeling for computer resource performance prediction.

In [77] and [78], Dinda et al. conducted extensive study of the statistical properties

and the predictions of host load. Their work indicates that CPU load is strongly

correlated over time, which implies that history-based load prediction schemes are feasible.

They evaluated the predictive power of a set of linear models including autoregression

(AR), moving average (\A4), autoregression integrated moving average (ARIMA),

autoregression fractionally integrated moving average (ARFIMA), and window-mean

models. Their results show that the AR model is the best in terms of high prediction

accuracy and low overhead among the models they studied. Based on their conclusion, the

AR model is included in our predictor pool to leverage its performance.

To improve the prediction accuracy, various adaptive techniques have been exploited

by the research community. In [4], Yang et al. developed a t. ,i d-l-' --based prediction

model that predicts the next value according to the tendency of the time series change.

Some increment/decrement value are added/subtracted to the current measurement

based on the current measurement and some other dynamic information to predict the









next value. Zhang et al. improved the performance of t.- d-. l-v-based model by using

a polynomial fitting method to generate predictions based on the data several steps

backward [79]. In addition, in [80], Liang et. al proposed a multi-resource prediction

model that uses both the autocorrelation of the CPU load and the cross correlation

between the CPU load and free memory to achieve higher CPU load prediction accuracy.

Vazhkudai et al. [81] [82] used linear regression to predict the data transfer time from

network bandwidth or disk throughput.

The Network Weather Service (NWS) [73] performs prediction of both network

throughput and latency for host machines distributed with different geographic distances.

Both the NWS and the LARPredictor use the mix-of-expert approach to select the

best predictor at any given moment. However, they differ from each other in the way

of best predictor selection. The prediction model used in the NWS system runs a

set of predictors in parallel to track their prediction accuracies. A cumulative error

measurement, Mean Square Error (\!V1 ), is calculated for each predictor. The one that

generates the lowest prediction error for the known measurements is chosen to make a

forecast of future measurement values. Section 4.6 shows that the LARPredictor only uses

parallel prediction during the training phase. In the testing phase, it uses the PCA and

k-NN classifier to forecast the best predictor for the next value based on the learning of

historical prediction performances. Only the forecasted best predictor is run to predict the

next value.

The mix-of-expert approach has been applied to the text recognition and cate-

gorization area. The combination of multiple classifiers has been proved to be able to

increase the recognition rate in difficult problems when compared with single classifier [83].

Different combination strategies such as weighted voting and probability-based voting and

dimensionality reduction based on concept indexing are introduced in [84].

4.3 Virtual Machine Resource Prediction Overview

This section gives an overview of virtual machine resource prediction.














VMID
DevicelD
VMM VM (
Performance Profiler
DB

VM Host





VM: Virtual Machine
VMM: Virtual Machine Monitor Prediction
DB: Database [(x, ,,),...(x,_ 1,1)] QA
QA: Quality Assuror
m: Prediction window size
j: Quality assurance window size
ts/te: Starting / ending time stamps
Figure 4-1. Virtual machine resource usage prediction prototype
The monitor agent, which is installed in the Virtual Machine Monitor (VMM),
collects the VM resource performance data and stores them in the round robin
VM Performance Database. The profiler extracts the performance data of a
given time frame for the VM indicated by VMID and deviceID. The
LARPredictor select the best prediction model based on learning of historical
predictions, predicts the resource performance for time t+1, and stores the
prediction results in the prediction database. The prediction results can be
used to support the resource i,,i,.i.i. r to perform dynamic VM resource
allocation. The Performance Q;,n.:,lli Assuror (QA) audits the LARPredictor's
performance and orders re-training for the predictor if the performance drops
below a predefined threshold.


Our virtual machine resource prediction prototype, illustrated in Figure 4-1, models

how the VM performance data are collected and used to predict the value for future time

to support resource allocation decision-making.

A performance monitoring agent is installed in the Virtual Machine Monitor (VMM)

to collect the performance data of the guest VMs. In our implementation, VMware's ESX

virtual machines are used to host the application execution and the ; n,,, -.i,.: tool [85]

of ESX is used to monitor and collect the performance data of the VM guests and host









from the server host machine's /proc nodes. The vmkusage tool samples every minute,

and updates its data every five minutes with an average of the one-minute statistics over

the given five-minute interval. The collected data is stored in a Round Robin Database

(RRD). Table 2-1 shows the list of performance features under study in this work.

The profiler retrieves the VM performance data, which are identified by vmlD,

devicelD, and a time window, from the round robin performance database. The data of

each VM device's performance metric form time series (xt-m+l, xt) with an identical

interval, where m is the data retrieval window size. The retrieved performance data with

the corresponding time stamps are stored in the prediction database. The [vmID, devicelD,

timeStamp, metricName] forms the combinational primary key of the database. Figure 4-2

shows the XML schema of the database and sample database records of virtual machines

such as VM1, which has one CPU, two Network Interface Cards (NIC), and two virtual

hard disks.

The LARPredictor takes the time series performance data (y,_t ,Y_-1) as inputs,

selects the best prediction model based on the learning of historical prediction results,

and predicts the resource performance Yt of future time. The detail description of the

LARPredictor's work flow is given in Section 4.6. The predicted results are stored in

the prediction DB and can be used to support the resource manager's dynamic VM

provisioning decision-making.

The Prediction Q;,l.:iJ Assuror (QA) is responsible for monitoring the LARPredic-

tor's performance, in terms of MSE. It periodically audits the prediction performance by

calculating the average MSE of historical prediction data stored in the prediction DB.

When the average MSE of the data in the audit window exceeds a predefined threshold,

it directs the LARPredictor to re-train the predictors and the classifier using recent

performance data stored in the database.















1 154979300 % Time stamp of first sample
300 % Sampling interval
1155066000 % Time stamp of last sample
290 % Total number of samples
12 % Total number of performance
features

cpu_usedsec
cpu_ready
mem_size
mem_swapped
netl_rKB
netl_wKB
net2_rKB
net2_wKB
hdl_r
hd 1 _w
hd2_r
hd2_w




Figure 4-2. Sample XML schema of the VM performance DB


4.4 Time Series Models for Resource Performance Prediction

Time series is defined as an ordered sequence of values of a variable at equally spaced

time intervals. A general linear process {Zt} is one that can be represented as a weighted

linear combination of the present and past terms of a white noise process:


Zt = at + Tlat-1 + j2at-2 + ** (4 1)


where {Zt} denotes the observed time series, {at} denotes an unobserved white noise

series, and {Ji} denotes the weights. In this thesis, performance snapshots of virtual

machine's resources including CPU, memory, disk, and network bandwidth are taken

periodically to form the time series {Zt} under study.









Time series analysis accounts for the fact that those data points taken over time

may have an internal structure (such as autocorrelation, trend, or seasonal variation)

that should be accounted for. The purpose of time series analysis is generally two-fold:

to understand or model the stochastic mechanism that gives rise to an observed series

and to predict or forecast future values of a series based on the history of that series [86].

Time series analysis techniques have been widely applied to forecasting in many areas such

as economic fo.i i. i-i:- sales forecasting, stock market in ,-i-- communication traffic

control, and workload projection. In this work, simple time series models, such as LAST,

sliding-window average (SWAVG), and autoregressive (AR), are used to construct the

LARPredictor to support online prediction. However, the LARPredictor prototype may be

generally used with other prediction models studied in [78] [73] [4].

LAST model: The LAST model predicts all future values to be the same as the last

measured value:


Zt = Zt-1 (4-2)


SW_AVG model: The sliding-window average model predicts the future values by

taking the average over a fixed-length history:


Zt z (4-3)
n t--

AR model: A pth-order autoregressive process Zt can be represented as follows:


Zt = q1Zt-1 + q2Zt-2 + -+ q pZt-p + at (4-4)

The current value of the series Zt is a linear combination of the p latest past values

of itself plus a term at, which incorporates everything new in the series at time t that is

not explained by the past values. Yule-Walker technique is used in the AR model fitting in

this work.









Generally, LAST performs better for smooth trace data and AR performs better for

peaky data. In this thesis, an approach to dynamically construct a resource predictor

using multiple predictors such as LAST, AR, and SW_AVG is proposed to predict the VM

resource performance.

The prediction performance is measured in mean squared error (MS\-l) [87], which

is defined as the average squared difference between independent observations and

predictions from the fitted equation for the corresponding values of the independent

variables.


MSE(O) = E[(O 0)2] (4-5)


where 0 is the estimator of a parameter 0 in a statistical model.

4.5 Algorithms for Prediction Model Selection

In the absence of a perfect generation model, the best resource prediction model

varies with the machine workload. Learning algorithms are used to learn the relationship

between the workload and suited prediction model. In this work, classification algorithms

are used to forecast the best prediction model for a given workload based on the learning

of historical predictions. In addition, the Principal Component Analysis (PCA) technique

is used to reduce the computational cost of the classification process by reducing the

dimension of the feature space of the input data of the classifiers.

There are two types of classifiers: nonparametric and parametric. The parametric

classifier exploits prior information to model the feature space. When the assumed

model is correct, parametric classifiers outperform nonparametric ones. In contrast, the

nonparametric classifiers do not make such assumption and are more robust. However, the

nonparametric classifiers tend to suffer from curse of dimensionality, which means that the

demand of a number of samples grows exponentially with the dimensionality of the feature

space. In this section, we are going to introduce a nonparametric classifier, k-NN classifier,

and a parametric classifier, B -i i classier, which are used for best predictor selection in









this work. While we have chosen to use the k-NN and B li-, -i ,i classification algorithms

due to its prior success in a large number of classification problems, such as handwritten

digits and satellite image scenes, our methodology may be generally used with other types

of classification algorithms.

4.5.1 k-Nearest Neighbor

The k-Nearest Neighbor (k-NN) classifier is memcr, -li.,o Its training data consist

of the N pairs (xi,pi), (xN, N) where pi is a class label taking values in 1, 2, P.

In this work, the P represents the number of prediction models in the pool. The training

data are represented by a set of points in the feature space, where each point xi is

associated with its class label pi. Classification of testing data xj is made to the class

of the closest training data. For example, given a test data xj, the k training data

xr, r = 1, k closest in distance to xj are identified. The test data is classified by using

the 1i i, i ilty vote among the k (an odd number) neighbors.

Since the features under study, such as CPU percentage and network receivedbytes/-

sec, have different units of measure, all features are normalized to have zero mean and unit

variance [88]. In this work, "(! ..-. is determined by Euclidean distance (Equation 4-6).


dij = ||xi Xj|| (4-6)


As a nonparametric method, the k-NN classifier can be applied to different time series

without modification. To address the problem associated with high dimensionality, various

dimension reduction techniques can be used in the data preprocessing.

4.5.2 Bayesian Classification

The B li- i i classifier is based on the well-known probability theorem, "Bw, -

formula". Suppose that we know both the prior probabilities P(Uj) and the conditional

densities p(xl j), where x and u represent a feature vector and its state (e.g., class),

respectively. The joint probability density can be written in two v--o,-i p(wy, x)









P(wjyx)p(x) = p(xlJj)P(wj). Rearranging these leads us to "Bayes formula":

P(wjx) -p(xj)P() (4-7)
p(x)

where in this case of c categories


p(x) p(x|yj)P(j). (4-8)
j 1

Then, the posterior probabilities P(cj |x) can be computed from p(x cj) by B-,-.-

formula. In addition, Bi v- formula can be expressed informally in English by -zwiing that

likelihood x prior
posterior = (4-9)
evidence

The multivariate normal density has been applied successfully to a number of

classification problems. In this work the feature vector can be modeled as a multivariate

normal random variable.

The general multivariate normal density in d dimensions is written as


p(x) =
(27)d/2 I I 1/2

exp (x ) (x ), (4-10)

where x is a d-component column vector, tt is the d-component mean vector, E is the

d-by-d covariance matrix, and XII and E-1 are its determinant and inverse, respectively.

Further, we let (x tt)T denote the transpose of x tt.

The minimization of the probability of error can be achieved by use of the discrimi-

nant functions


gi(x) IlnP(wi;x) = lnp(xlw;) + lnP(wi). (4-11)









This expression can be evaluated if the densities p(xlui) are multivariate normal. In

this case, we have

1 d
g (x) -(x pfE-1(x ) ln27
2 2
1
-n i + In P(u.). (4-12)
2

The resulting classification is performed by evaluating discriminant functions. When

the workload have similar statistical property, the B ,v -i i classifier derived from one

workload trace can be applied to another directly. In case of highly variable workload, the

retraining of the classifier is necessary.

4.5.3 Principal Component Analysis

The Principal Component A,.,,li,.:- (PCA) [22][88], also called Karhunen-Lodve trans-

form, is a linear transformation representing data in a least-square sense. The principal

components of a set of data in RP provide a sequence of best linear approximations to

those data, of all ranks q < p.

Denote the observations by xl,, x2, XN, and the parametric representation of the

rank-q linear model is as follows:




f(A) = + VqA, (4-13)


where p is a location vector in Rp, Vq is a p x q matrix with q orthogonal unit vectors

as columns, which are called eigenvectors, and A is a vector of q parameters, which are

called .:ij ,..l',. These eigenvectors are the principal components. The corresponding

eigenvalues represent the contribution to the variance of data. Often there will be just a

few (= k) large eigenvalues and this implies that k is the inherent dimensionality of the

subspace governing the data. When the k largest eigenvalues of q principal components are

chosen to represent the data, the dimensionality of the data reduces from q to k.









In this work, the PCA is used to reduce the prediction input data dimensions. It

helps to reduce the computing intensity of the subsequent classification process.

4.6 Learning-Aided Adaptive Resource Predictor

This section describes the work flow of the Learning-Aided Adaptive Resource Pre-

dictor (LARPredictor) illustrated in Figure 4-3. The prediction consists of two phases: a

training phase and a testing phase. During the training phase, the best predictors for each

set of training data are identified using the traditional mix-of-expert approach. During

the testing phase, the classifier forecasts the best predictor for the test data based on the

knowledge gained from the training data and historical prediction performance. Then only

the selected best predictor is run to predict the resource performance. Both phases include

the data pre-processing and the Principal Component All 1.,-i' (PCA) process.

The features under study in this work, as shown in Table 2-1, include CPU, memory,

network bandwidth, and disk I/O usages. Figure 4-4 illustrates how the features are

processed to form the prediction database. Since the features have different units of

measure, a data pre-processor was used to normalize the input data with zero mean and

unit variance. The normalized data are framed according to the prediction window size to

feed the PCA processor.

4.6.1 Training Phase

The training phase of both the k-NN and the B li-,-i ,i classifiers mainly consists

of two processes: Prediction model fitting and best predictor identification. The set of

training data with the corresponding best predictors are used for the k-NN classification in

the testing phase. The unknown parameters of the B li, -i in classifier are estimated from

on the training data.

The LAST and SW_AVG models do not involve any unknown parameters. They can

be used for predictions directly. The parametric prediction models such as the AR model,

which contain unknown parameters, require model fitting. The model fitting is a process















































Figure 4-3.


Learning-aided adaptive resource predictor workflow
The input data are normalized and framed with the prediction window size m.
The Principal Component A,.l;,.: (PCA) is used to reduce the dimension of
the input data from the window size m to n(n < m). All prediction models are
run in parallel in the training phase to identify the best predictor for each set
of training data. The I/i-.:/7. r is used to forecast the best predictor for the
testing data based on the knowledge gained from the training data. Only the
best predictor is used to predict the future value of the testing data.









x- (xx2, x.) kNN Training Data
Normalization xz x x x x' Predict Past History Best Predictor
S Framing I I PCA x & Select P
X' (x'x' x X
S (n m+l x-m+2 x J Lx--+, xu- +j ....2.



Figure 4-4. Learning-aided adaptive resource predictor dataflow
First, the u training data Xl,, is normalized to Xx,, and subsequently framed
to X'-_m+1)xm according to the predictor order m. The PCA processor is used
to reduce the dimension of each set of training data from m to n before
prediction. Then the predictors are run in parallel with the inputs X"$,_m+l)xn
and the one that gives the smallest MSE is identified as the best predictor to
be associated with the corresponding training data in the prediction database.
The dimension reduction of the testing data is similar to the training data's
and is not shown here.


to estimate the unknown parameters of the models. The Yule-Walker equation [86] is used

in the AR model fitting in this work.

For window based prediction models, such as SW_AVG and AR, the PCA algorithm

is applied to reduce the input data dimension. The naive mix-of-expert approach

is used to identify the best predictor pi for each set of pre-processed training data

(exp.(xix'+ ... x +m_ )). All prediction models are run in parallel with the training

data and the one which generates the least MSE of prediction is identified as the best

predictor pi, which is a class label taking values in (LAST, AR, SW_AVG) to be associated

with the training data. The u pairs of PCA-processed training data and the corresponding

best predictors [(x",p,), (x, pU)] form the training data of the classifiers.

As a non-parametric classifier, the k-NN classifier does not have an obvious training

phase. The 1i i. r task of its training phase is to label the training data with class

definitions. As a parametric classifier, the B -i ,in classifier uses the training data to

derive its unknown parameters, which are the mean and covariance matrix of training data

of each class, to form the classification model.









4.6.2 Testing Phase

Similar to the training phase, the testing data are normalized using the normalization

coefficient derived from the training phase and framed with the prediction window size

m. Then the PCA is used to reduce the dimension of the preprocessed testing data

(iy'tmrru-m+i '-1)) from m to n.
In the testing phase of the LARPredictor that is based on k-NN classifier, the

Euclidean distances between all PCA processed test data (y/'_/ yt_ +ii ... y ') and all

training data X"-_,+m)xx in the reduced n dimensional feature space are calculated and

the k (k = 3 in our implementation) training data which have the shortest distances to the

testing data are identified. The 1 I i' ,i ity vote of the k nearest neighbors' best predictor

will be chosen as the best predictor to predict y't based on the (y_,-, _-_+, ... ,y_-1) in

case of the AR model or the SWAVG model and y't = y'_, in case of the LAST model.

The prediction performance can be obtained by comparing the predicted value y't with the

normalized observed value y'.

In the testing phase of the LARPredictor that is based on B i-, -i in classifier, test

data are preprocessed the same as the k-NN classifier. The PCA-processed test data

(yt'_- y t"-+, ... yt-i) are p]li--:. 'l into the discriminant function (4 12) derived in
Section 4.5.2. The parameters in the discriminant function for each class, the mean vector

and covariance matrix, are obtained during the training phase. Then, each test data is

classified as the class of the largest discriminant function.

The testing phase differs from the training phase in that it does not require running

multiple predictors in parallel to identify the one which is best suited to the data and

gives the smallest MSE. Instead, it forecasts the best predictor by learning from historical

predictions. The reasoning here is that these nearest neighbors' workload characteristics

are closest to the testing data's and the predictor that works best for these neighbors

should also work best for the testing data.









4.7 Empirical Evaluation

We have implemented a prototype for the LARPredictor including Perl and Shell

scripts of the profiler to extract and profile the performance data from the round robin

performance database, and a Matlab implementation of the LARPredictor. This section

evaluates the prediction performance of the LARPredictor using traces of five virtual

machines as follows:

/VM: Hosts a web server, Globus GRAM/\ I )S and GridFTP services, and a PBS

head node.

VMI:' Hosts a Linux-based port-forwarding proxy for VNC sessions.

VM3: Hosts a WindowsXP based calendar.

VM4: Host a web server, a list server, and Wiki server.

VM5: Host a web server.

These virtual machines were hosted by a physical machine with an Intel(R)

Xeon(TM) 2.0GHz CPU, 4GB memory, and 36GB SCSI disk. VMware ESX server

2.5.2 was running on the physical host. The ; nI ,;-.ir.: tool was run on the ESX server

to collect the resource performance data of the guest virtual machines every minute and

store them in a round robin database. The profiler was used to extract the data with given

VMID, DevicelD, performance metric, starting and ending time stamps, and intervals. In

this experiment, the performance data of a 24-hour period with 5-minute intervals were

extracted for VM2, VM3, VM4, and VMS. The data of a 7-d-4 period with 30-minute

intervals of VM1 were extracted. The data of a given VMID, DevicelD, and performance

metrics form a time series under study. The time series data were normalized with zero

mean and unit variance.

4.7.1 Best Predictor Selection

This set of experiments illustrates the adaptive predictor selection of the LARPre-

dictor. The k-NN classifier was used to forecast the best predictor among the LAST,

AR, and SW-AVG for the workload understudy. Only the selected best predictor is









used for performance prediction. VM2 was used in the experiments. Fig. 4-5 shows the

predictor selections for CPU fifteen minute load average during a 12 hour period with a

sampling interval of 5 minutes. The top plot shows the observed best predictor by running

three prediction models in parallel. The middle plot shows the predictor selection of the

LARPredictor and the bottom plot shows the cumulative-MSE based predictor selection

used in the NWS. Similarly the predictor selection results of the trace data of other

resources are shown as follows: Network packets in per second in Fig. 4-6, total amount of

swap memory in Fig. 4-7, and total disk space in Fig. 4-8.

These experimental results show that the best prediction model for a specific

type of resource of a given trace varies as a function of time. In the experiment, the

LARPredictor can better adapt the predictor selection to the changing workload than

the cumulative-MSE based approach presented in the NWS. The LARPredictor's average

best predictor forecasting accuracy of all the performance traces of the five virtual

machines is 515 *'- which is 20.1t' higher than the accuracy of 4 ,. ,' achieved by the

cumulative-MSE based predictor used in the NWS for the workload studied.

4.7.2 Virtual Machine Performance Trace Prediction

This set of experiments is used to check the prediction performance of the LARPre-

dictor. Section 4.7.2.1 shows the prediction accuracy of the k-NN based LARPredictor

and all the predictors in the pool. Section 4.7.2.2 compares the prediction accuracy and

execution time of the k-NN based LARPredictor and the B ii i i based LARPredictor.

In addition, Section 4.7.2.3 benchmarks the performance of the LARPredictors and the

cumulative-MSE based prediction model used in the NWS.

In the experiments, ten-fold cross validation were performed for each set of time series

data. A time stamp was randomly chosen to divide the performance data of a virtual

machine into two parts: 5(0' of the data was used to train the LARPredictor and the

other 50' was used as test set to measure the prediction performance by calculating its

prediction MSE.








Cn
4


o 2


20
P 0 20 40 60 80 100 120 14(
4
4-4---------------
$o 2



0 20 40 60 80 100 120 14(
d 4 I, ,I I

Cn I
s 2-

D) 0
0 20 40 60 80 100 120 14(
Time Index
Figure 4-5. Best predictor selection for trace VM2Joadl5
Predictor Class: 1 LAST, 2 AR, 3 SW_AVG

4.7.2.1 Performance of k-NN based LARPredictor
The k-NN algorithm was used for classification in this experiment. In the training
phase, the training data were used to derive the regression coefficients of the AR model.
In addition, the three prediction models were run in parallel. The prediction error was
calculated by comparing the predicted value with the observed value. For each prediction,
the model that gave the smallest absolute value of the error was identified as the best
predictor to be associated with the corresponding training data.
In the testing phase, the 3NN classifier was used to forecast the best predictors of
the testing data. First, for each set of testing data of the prediction window size, the
PCA was applied to reduce the data dimension from m, which was 5 or 16, to n = 2 in


)


)


)







CO
4




2 0
0 20 40 60 80 100 120 140
4

0o 2
re O

0 20 40 60 80 100 120 140



S .
^ J32-

ug 0
S0 20 40 60 80 100 120 140
Time Index
Figure 4-6. Best predictor selection for trace VM2_PktIn
Predictor Class: 1 LAST, 2 AR, 3 SW_AVG

this experiment. Then the Euclidean distances between the test data and all the training
data in the reduced feature space were calculated. The three training data which had
the shortest distances to the testing data were identified and the il' Pi i y vote of their
associated best predictors was forecasted to be the best predictor of the testing data.
At last, the forecasted best predictor was run to predict the future value of the testing
data. The MSE of each time series was calculated to measure the performance of the
LARPredictor. Tables 4-1, 4-2, 4-3, 4-4, and 4-5 show the prediction performance of
the LARPredictor with current implementation (LAR) and the three prediction models
including LAST, AR, and SW_AVG for all resource performance traces of the five virtual
machines. Also shown in these tables is the computed MSE for a perfect LARPredictor








4

o 2

*OY
20
% 0 20 40 60 80 100 120 140






4
,..-----------------

S 2
S 0 20 40 60 80 100 120 140


So 2
rd 0


S0 20 40 60 80 100 120 140
Time Index
Figure 4-7. Best predictor selection for trace VM2_Swap
Predictor Class: 1 LAST, 2 AR, 3 SW_AVG

(P-LAR). The MSE of the P-LAR model shows the upper bound of the prediction
accuracy that can be achieved by the LARPredictor. The MSE of the best predictor
among LAR, LAST, AR, and SW_AVG is highlighted with italic bold numbers.
Table 4-6 shows the best predictor among LAST, AR, and SW_AVG for all the
resource performance metrics and VM traces. The symbol "*" indicates the cases in which
the LARPredictor achieved equal or higher prediction accuracy than the best of the three
predictors. Overall, the AR model performed better than the LAST and the SWAVG
models.
The above experimental results show:








Cn
4


S0 2


20
P 0 20 40 60 80 100 120 140

4-4---------------
I I I I I I I

;0 0 2


2 0 20 40 60 80 100 120 140






0
0 20 40 60 80 100 120 140
Time Index
Figure 4-8. Best predictor selection for trace VM2_Disk
Predictor Class: 1 LAST, 2 AR, 3 SWAVG

1. It is hard to find a single prediction model among LAST, AR, and SW_AVG
that performs best for all types of resource performance data for a given VM trace. For
example, for the VMl's trace data shown in Table 4-1, each of the three models (LAST,
AR, and SW) outperformed the other two for a subset of the performance metrics. In this
experiment, only the AR model worked best for the trace data of VM3.
2. It is hard to find a single prediction model among the three that perform best
consistently for a given type of resources across all the VM traces. In the experiment, only
the AR model worked best for the CPU performance predictions.
3. The LARPredictor achieved better-than-expert performances using the
mix-of-expert approach for 44.2 :' of the workload traces. It shows the potential for the








Table 4-1. Normalized prediction MSE statistics for resources of VM1
Predictors
Perf.Metrics P-LAR LAR LAST AR SW
CPU usedsec 0.6976 0.9508 1.1436 0.9456 1.0352
CPU ready 0.6775 0.9632 1.1699 0.9579 1.0333
Memory_size 0.2071 0.2389 0.2298 0.2379 0.4883
Memory swapped 0.2071 0.2386 0.2298 0.2379 0.4883
NIC1 received 0.3981 0.5436 1.836 0.5436 0.9831
NIC1 transmitted 0.3776 0.5845 1.8236 0.5845 0.9829
NIC2 received 0.9788 0.9912 1.4392 0.9966 1.0397
NIC2 transmitted 0.3983 0.5463 1.8406 0.5463 0.9843
VD1 read 0.9062 1.0215 1.2849 0.9754 1.0511
VD1 write 0.7969 0.9587 1.1905 0.9473 1.0566
VD2 read 1 1.2156 1.4191 1.1536 1.035
VD2 write 0.662 0.9931 1.1572 0.9929 1.0292


duration = 168 hours, interval = 30 minutes,


prediction order =16


LARPredictor to outperform any single predictor in the pool and approach the prediction
accuracy of the P-LAR by improving the best predictor forecasting / classification
accuracy. How to further improve the predictor classification accuracy is a topic of our
future research.
4.7.2.2 Performance comparison of k-NN and Bayesian-classifier based
LARPredictor
In this experiment, a set of VM trace with 138,240 performance data were used
to feed the LARPredictor. Half of the data were used for training and the other half
were used for testing. A B ,v- i i:-classifier based LARPredictor was implemented.
Fig. 4-9 shows the prediction performance comparisons between it and the k-NN based
LARPredictor for all the resources of VM1. The profile report of the Matlab program
execution showed that it cost the kNN based LARPredictor 205.8 second CPU time, with
193.5 seconds in the testing phase and 12.3 seconds in the training phase. It took 132.1








Table 4-2. Normalized prediction MSE statistics for resources of VM2
Predictors
Perf.Metrics P-LAR LAR LAST AR SW
CPU usedsec 0.8142 1.1158 1.2476 1.0311 1.0912
CPU ready 0.7873 1.0128 1.2167 1.0166 1.0948
Memory_size 0.5328 0.6213 0.637 0.6262 0.79
Memory swapped 0.5328 0.6214 0.637 0.6262 0.7901
NIC1 received 0.4872 0.6189 0.6663 0.611 0.6831
NIC1 transmitted 0.7581 1.0138 1.0303 1.0209 1.0737
NIC2 received 0.6626 0.89 0.8765 0.8923 1.0242
NIC2 transmitted 0.7434 0.9924 1.0266 0.9949 1.0775
VD1 read 0.9582 1.0467 1.2249 1.0264 1.0912
VD1 write 0.7733 1.0744 1.1574 1.0129 1.0748
VD2 read 1.0208 1.4153 1.4155 1.0843 1.0972
VD2 write 0.7389 0.9941 1.0816 0.9372 1.0792


duration = 24


hours, interval = 5 minutes,


prediction order = 5


second CPU time for the B il, -i oi based LARPredictor to finish execution with 120.8
second testing phase and 11.3 second training phase.
The experimental results show that the prediction accuracy in terms of normalized
MSE of the B li-i ,i-classifier based LARPredictor is about 3.>'-. worse than the k-NN
based one. However, it shortened the CPU time of the testing phase by 37.57'.
4.7.2.3 Performance comparison of the LARPredictors and the cumulative-
MSE based predictor used in the NWS
This section compares the prediction accuracy of the LARPredictors and the NWS
predictor. Fig. 4-9, 4-10, 4-11, 4-12, and 4-13 shows the prediction accuracy of the perfect
LARPredictor that has l(1i'- best predictor forecasting accuracy (P-LARP), the k-NN
and B ,vi -i ,i based LARPredictors (KnnLARP and B ,i -LARP), the cumulative MSE
of all history based predictor used in the NWS (Cum.MSE), and the cumulative-MSE








Table 4-3. Normalized prediction MSE statistics for resources of VM3
Predictors
Perf.Metrics P-LAR LAR LAST AR SW
CPU usedsec 0.9883 1.0395 1.4341 1.0376 1.0989
CPUready 0.6826 0.9502 1.6594 0.9502 1.0921
Memory_size 0.5009 0.6169 0.6818 0.6216 0.7481
Memory_swapped 0 0 0 NaN 0

NIC1 transmitted 0.9931 1.0514 1.3068 1.0665 1.0943
VD1 read 0 0 0 NaN 0
VD1 write 0 0 0 NaN 0
VD2 read 0.9728 1.0276 1.3969 1.0281 1.1016
VD2 write 0.8696 0.9938 1.245 0.9946 1.0815
duration = 24 hours, interval = 5 minutes, prediction order = 5

based predictor of a fixed window size (n=2 in this experiment) used in the NWS
(W-Cum.MSE).
The experimental results show that without running all the predictors in parallel all
the time, for 66.1.7' of the traces, the LARPredictor outperformed the cumulative-MSE
based predictor used in the NWS. The perfect LARPredictor shows the potential to
achieve 18.,.'. lower MSE in average that the cumulative-MSE based predictor.
4.7.3 Discussion
PCA is an optimal way to project data in the mean-square sense. The computational
complexity of estimating the PCA is O(d2W) + O(d3) for the original set of W x
d-dimensional data [89]. In the context of resource performance time series prediction,
W = 1 and d is the prediction window size. The typical small input data size in this
context makes the use of the PCA feasible. There also exist computationally less expensive
methods [90] for finding only a few eigenvectors and eigenvalues of a large matrix; in our
experiments, we use appropriate Matlab routines to realize these.








Table 4-4. Normalized prediction MSE statistics for resources of VM4
Predictors
Perf.Metrics P-LAR LAR LAST AR SW
CPU usedsec 0.2819 0.3781 1.7 0.3787 1.1859
CPUready 0.4339 0.59 1.6385 0.5904 1.1689
Memory size 0.3453 0.4638 0.4615 0.4624 0.6628
Memory_swapped 0.2042 0.2595 0.2571 0.2596 0.3592
NIC1 received 0.7175 0.9853 1.0552 0.9231 0.9313
NIC1 transmitted 0.8713 1.0169 1.2649 1.0075 1.0501
NIC2 received 0.7026 1.0695 1.1324 1.0253 1.0699
NIC2 transmitted 0.8423 1.0276 1.3369 1.0363 1.0753
VD1 read 0.7452 0.9679 1.2066 0.9658 0.9832
VD1 write 0.6985 0.9766 1.136 0.9836 0.98
VD2 read 1.01 1.1296 1.4181 1.0608 1.0973
VD2 write 0.8121 1.0134 1.2204 1.0152 1.0474
duration = 24 hours, interval = 5 minutes, prediction order = 5
Table 4-5. Normalized prediction MSE statistics for resources of VM5
Predictors
Perf.Metrics P-LAR LAR LAST AR SW
CPU usedsec 0.9165 1.0731 1.503 1.0875 1.1023
CPU ready 0.5578 0.964 1.7314 0.8673 1.108
Memory size 0.5569 0.6094 0.66 0.6115 0.8873
Memory_swapped 0.5499 0.6043 0.6498 0.6073 0.8719
NIC1 received 0 0 0 NaN 0
NIC1 transmitted 0 0 0 NaN 0
NIC2 received 0.7894 0.9264 1.1422 0.9232 0.887
NIC2 transmitted 0.967 1.0162 1.3807 1.014 1.1051
VD1 read 1.0165 1.2856 1.3429 1.078 1.0744
VD1 write 0.811 0.946 1.0775 0.9376 1.0391
VD2 read 0 0 0 NaN 0
VD2 write 1.0115 1.0691 1.4048 1.0653 1.0969
duration = 24 hours, interval = 5 minutes, prediction order = 5









Table 4-6. Best predictors of all the trace data.
The predictors shown in the table have the smallest MSE among all the three
predictors (LAST, AR, and SW_AVG). The "*" symbol indicates that the
LARPredictor outperforms the best predictor in the predictor pool.
Perf. Metrics VM1 VM2 VM3 VM4 VM5
CPU_usedsec AR AR AR AR* AR*
CPUready AR AR* AR* AR* AR
Mem_size LAST AR* AR* LAST AR*
Mem_swap LAST AR* LAST AR*
NIC1_Rx AR* AR AR* AR
NIC1_Tx AR* AR* AR* AR
NIC2_Rx AR* LAST AR SW_AVG
NIC2_Tx AR* AR* AR* AR
VDljread AR AR AR SW_AVG
VDl_write AR AR SW_AVG* AR
VD2_read SW_AVG AR AR* AR
VD2_write AR AR AR* AR* AR


The k-NN does not have an off-line learning phase. The "training p! ,- in k-NN is

simply to index the N training data for later use. Therefore, its training complexity of

k-NN is 0(N) both in time and space. In the testing phase, the k nearest neighbors of

a testing data can be obtained 0(N) time by using a modified version of quicksort [91].

There are fast algorithms for finding nearest-neighbors [92] [93] also.

Three simple time series models were used in this experiment to show the potential

of using dynamic predictor selection based on learning to improve prediction accuracy.

However, the LARPredictor prototype may be generally used with other more sophisti-

cated prediction models such as these studied in [78] [73] [4]. Generally, the more predictors

in the pool and the more complex the predictors are, it is more beneficial to use the

LARPredictor because the classification overhead can be better amortized by running only

single predictor at any given time.

4.8 Conclusion

The best prediction model varies with the types of resources and workload from

time to time. We have developed a time series resource prediction model, LARPredictor,

which can adapt the predictor selection to the changing workload. The k-NN classifier and










Predictor Performance Comparison (VM1)


1.4
1.2
LU
i> 1
N 0.8
N
S0.6
E
S0.4
z
0.2
0


1 2 3 4 5 6 7 8 9 10 11 12
Performance Metric ID

U P-LARP U Knn-LARP O Bays-LARP O Cum.MSE U W-Cum.MSE


Figure 4-9. Predictor performance comparison (VM1)
1 CPU_usedsec, 2 CPU_ready, 3 Mem_size, 4 Mem_swap,
5 NICl_rx, 6 NICl_tx, 7 NIC2-rx, 8 NIC2_tx,
9 VDl_read, 10 VDl_write, 11 VD2_read, 12 VD2_write


the B -i ,i classifier are used to forecast the best predictor for the workload based on

the learning of historical load characteristics and prediction performance. The principal

component analysis technique has been applied to reduce the input data dimension of

the classification process. Our experimental results with the traces of the full range

of virtual machine resources including CPU, memory, network and disk show that the

LARPredictor can effectively identify the best predictor for the workload and achieve

prediction accuracies that are close to or even better than any single best predictor.
























Predictor Performance Comparison (VM2)


1.4
1.2
LU
u 1
0.8
N
0.6

o 0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12
Performance Metric ID

SP-LARP Knn-LARP OBays-LARP OCum.MSE W-Cum.MSE

Figure 4-10. Predictor performance comparison (VM2)
1 CPU_usedsec, 2 CPUready, 3 Mem_size, 4 Mem_swap,
5 NICl_rx, 6 NICl_tx, 7 NIC2_rx, 8 NIC2_tx,
9 VDlread, 10 VDl_write, 11 VD2_read, 12 VD2_write























Prediction Performanc Comparison (VM3)


U
1 2 3 4 5 6 7 8 9 10
Performance Metric ID

*P-LARP kNN-LARP OBays-LARP OCum.MSE W-Cum.MSE

Figure 4-11. Predictor performance comparison (VM3)
1 CPU_usedsec, 2 CPU_ready, 3 Mem_size, 4 Mem_swap,
5 NIC1_rx, 6 NIC1_tx, 7 NIC2_rx, 8 NIC2_tx,
9 VDl_read, 10 VDl_write, 11 VD2_read, 12 VD2_write
















Prediction Performance Comparison (VM4)


1.6
1.4
n 1.2
"I 1
. 0.8
E 0.6
o 0.4
z
0.2
0


1 2 3 4 5 6 7 8 9 10 11 12
Performance Metric ID
SP-LARP Knn-LARP OBays-LARP OCum.MSE W-Cum.MSE
Figure 4-12. Predictor performance comparison (VM4)
1 CPU_usedsec, 2 CPUready, 3 Mem_size, 4 Mem_swap,
5 NICl_rx, 6 NICl_tx, 7 NIC2_rx, 8 NIC2_tx,
9 VDlread, 10 VDl_write, 11 VD2sread, 12 VD2_write


11 11 11 11 11 11 1 11 11 11 11 11 l






















Prediction Performance Comparison (VM5)


Performance Metric ID


*P-LARP Knn-LARP OBays-LARP OCum.MSE W-Cum.MSE

Figure 4-13. Predictor performance comparison (VM5)
1 CPU_usedsec, 2 CPU_ready, 3 Mem_size, 4 Mem_swap,
5 NIC1_rx, 6 NIC1_tx, 7 NIC2_rx, 8 NIC2_tx,
9 VDl_read, 10 VDl_write, 11 VD2_read, 12 VD2_write









CHAPTER 5
APPLICATION RESOURCE DEMAND PHASE ANALYSIS AND PREDICTIONS

Profiling the execution phases of applications can help to optimize the utilization

of the underlying resources. This chapter presents a novel system level application-

resource-demand phase analysis and prediction approach in support of on-demand

resource provisioning. This approach explores large-scale behavior of applications' resource

consumption, followed by ain ,1,i -i; using a set of algorithms based on clustering. The

phase profile, which learns from historical runs, is used to classify and predict future

phase behavior. This process takes into consideration applications' resource consumption

patterns, phase transition costs and penalties associated with Service-Level Agreements

(SLA) violations.

5.1 Introduction

Recently there has been a renewed interest in using virtual machines) (VM) as a

container [94] of the application's execution environment both in academia and industry

[11] [16] [95]. This is motivated by the idea of providing computing resources as a utility

and charging the users for a specific usage. For example, in August 2006, Amazon

launched its Beta version of VM-based Elastic Compute Cloud (EC2) web service. EC2

allows users to rent virtual machines with specific configurations from Amazon and can

support changes in resource configurations in the order of minutes. In systems that

allow users to reserve and reconfigure resource allocations and charge based upon such

allocations, users have an incentive to request no more than the amount of resources an

application needs. A question which arises here is: how to adapt the resource provisioning

to the changing workload?

In this chapter, we focus on modeling and analyzing long-running applications' phase

behavior. The modeling is based on monitoring and learning of the applications' historical

resource consumption patterns, which likely varies over time. Understanding such behavior

is critical to optimizing resource scheduling. To self-optimize the configuration of an









application's execution environment, we first develop analytical tools necessary to

automatically and efficiently discover similarities and changes in application's resource

consumption over time, which is referred to as p! .- b, !i iv i .

In this context, a phase is defined as a set of intervals within an application's

execution that have similar system-level resource consumption behavior, regardless of

temporal .,di i,:ency. It means that a phase may reappear many times as an application

executes. Phase ,. r.-;l... ,i.m partitions a set of intervals into phases with similar

behavior. In this chapter, we introduce an application resource demand phase analysis

and prediction prototype, which uses a combination of clustering and supervised learning

techniques to investigate the following questions:

1) Is there a phase behavior in the application's resource consumption patterns? If so,

how many phases should be used to provide optimal resource provisioning?

2) Based on the observations of historical phase behaviors, what is the predicted next

phase of the application's execution?

3) How do phase transition frequency and prediction accuracy affect resource

allocation? Answers to these questions can be used to decide the time and space allocation

of resources.

To make optimization decisions, this prototype takes the application's resource

consumption patterns, phase transition costs, and penalties associated with Service

Level Agreement (SLA) violations into account while making optimization decisions.

The prediction accuracy is fed back to guide future phase analysis. This prototype does

not require any instrumentation of the application source codes and can generally work

with both physical and virtual machines which can provide monitoring of system level

performance metrics.

Our experimental results with the CPU and the network performance traces of

SPECseis96 and WorldCup98 access log replay show that:









1. The total cost is a function of the number of phases. To best determine the

number of phases used for prediction it is necessary to account for the application's

resource usage patterns, unit resource cost and unit resource re-provisioning cost

associated with the phase transitions, and the penalty associated with SLA violations

caused by mispredictions.

2. For applications with phase behavior, typically with a small number of phases the

savings gained from phase-based resource reservation can outweigh the costs associated

with the increased number of re-provisioning and the penalties caused by mispredictions.

3. The phase prediction accuracy decreases as the number of phases increases. With

the current prototype, an average of above 91' phase prediction accuracy can be achieved

for the CPU and network performance features where four phases are considered.

The rest of this chapter is organized as follows: Section 5.2 presents the application

phase analysis and prediction model. Section 5.3 and 5.4 detail the algorithms used for

phase analysis and prediction. Section 5.5 presents experimental results. Section 5.6

discusses related work. Section 5.7 draws conclusions and discusses future work.

5.2 Application Resource Demand Phase Analysis and Prediction Prototype

Our application phase analysis and prediction prototype, illustrated in Figure

5-1, models how the application VM's performance data are collected and analyzed to

construct the corresponding application's phase profile and how the profile is used to

predict its next phase. In addition, it shows how process quality indicators, such as

phase prediction accuracy, are monitored and used as feedback signals to tune the system

performance (such as application response time) towards the goal defined in the SLA.

A performance monitoring agent is used to collect the performance data of the

application VM, which serves as the application container. The monitoring agent can

be implemented in various v--i In this work, Ganglia [54], a distributed monitoring

system, and the ;. n,';.i., tool [85] provided by VMware ESX server, are used to monitor



























VM: Virtual Machine
VMM: Virtual Machine Monitor
DB: Database
ARM: Application Resource Manager
CQ: Clustering Quality


Figure 5-1.


Application resource demand phase ,n i1 Ji-; and prediction prototype
The phase i .,.i.l. analyzes the performance data collected by the monitoring
agent to find out the optimal number of phases n E [1, m]. The output phase
profile is stored in the application phase database (DB) and will be used as
training data for the phase predictor. The predictor predicts the next phase of
the application resource usage based on the learning of its historical phase
behaviors. The predicted phase can be used to support the application
resource i,,'.'.j, r's (ARM's) decisions regarding resource provisioning. The
auditor monitors and evaluates the performance of the analyzer and predictor
and orders re-training of the phase predictor with the updated workload profile
when the performance measurements drop to below a predefined threshold.


the application containers. The collected performance data is stored in the performance

database.

The phase analyzer retrieves the time-series VM performance data, which are

identified by vmID, FeaturelD, and a time window (ts, t), from the performance database.

Then it performs phase analysis using algorithms based on clustering to check whether

there is a phase behavior in the application's resource consumption patterns. If so, it

continues to find out how many phases in a numeric range are best in terms of providing

the minimal resource reservation costs. The output phase profile, which consists of the









defined number of phases, corresponding cluster centroids and resource usage statistics

of each phase, is stored in the application phase database. The details of data clustering

algorithms are described in Section 5.3.

The phase profile is used as training data of the phase predictor. In the presence of

phase behavior, the phase predictor can perform on-line prediction of the next phase of the

application's resource usage based on the learning of historical phase behaviors as shown

in Section 5.4. The predicted phase information can be used to support the application

resource i,,i,:, r's decisions regarding resource re-provisioning requests to the resource

scheduler.

The auditor monitors and evaluates the health of the phase analysis and prediction

process by performing quality control of each component. Clustering quality can be

measured by the similarity and compactness of the clusters using various internal indices

introduced in [96]. The phase predictor's performance is measured by its prediction

accuracy. The application response time is used as an external signal for total quality

control and checked against the Quality of Service (QoS) defined in the SLA. Local

performance tuning is ii -.-- 1. when the auditor observes that the component-level

service quality drops to below a predefined threshold. For example, when the real-time

workload varies to a degree which makes it statistically significantly different from the

training workload, the phase prediction accuracies may drop. Upon detection, the auditor

can order a phase analysis based on recent workload to update the phase profile and

subsequently order a re-training for the phase predictor. If the re-training still can not

improve the total quality of service to a satisfactory level, the resource reservation strategy

falls back from the phase-based reservation to a conservative str il. /; which reserves the

largest amount of resources the user is willing to p i, during the whole application run.

Automated and adaptive threshold setting is discussed in detail in [67].









5.3 Data Clustering

C'!l-I i ing is an important data mining technique for discovering patterns in the

data. It has been used effectively in many disciplines such as pattern recognition, biology,

geology, and marketing.

At a high-level, the problem of clustering is defined as follows: Given a set U of

n samples u u2, u,, we would like to partition U into k subsets U1, U2,. Uk

such that the samples assigned to each subset are more similar to each other than the

samples assigned to different subsets. Here, we assume that two samples are similar if they

correspond to the same phase.

5.3.1 Stages in Clustering

A typical pattern clustering activity involves the following steps [97]:

(1) Pattern representation, which is used to obtain an appropriate set of features to

use in clustering. It optionally consists of feature extraction and/or selection. Feature

selection is the process of identifying the most effective subset of the original features to

use in clustering. Feature extraction is the use of one or more transformations of the input

features to produce new salient features.

In the context of resource demand phase analysis, the features under study are the

system level resource performance metrics as shown in Table 5-1. For one dimension

(1< In- 1~.- which is the case of this work, the feature selection is as simple as choosing

the performance metric which is instructive to the allocation of the corresponding system

resource. For clustering based on multiple performance metrics, feature extraction

techniques such as Principal Component Analysis (PCA) may be used to transform the

input performance metrics to a lower dimension space to reduce the computing intensity of

subsequent clustering and improve the clustering quality.

(2) Definition of a pattern proximity measure appropriate to the data domain. The

pattern proximity is usually measured by a distance function defined on pairs of patterns.

In this work, the most popular metric for continuous features, Euclidean distance is used









to measure the dissimilarity between two patterns. It works well when a data set has

"(, i 1iip I or -, ii. il" clusters. In case of clustering in the multi-dimensional space,

normalization of the continuous features can be used to remove the tendency of the

largest-scaled feature to dominate the others. In addition, Mahalanobis distance can be

used to remove the distortion caused by the linear correlation among features as discussed

in C'!i pter 3.

(3) Clustering or grouping: The clustering can be performed in a number of v--v- [97].

The output clustering can be hard (a partition of the data into groups) or f; ..;, (where

each pattern has a variable degree of membership in each of the output clusters). A hard

clustering can be obtained from a fuzzy partition by thresholding the membership value.

In this work, one of the most popular iterative clustering methods, k-means algorithm as

detailed in Section 5.3.3, is used.

5.3.2 Definitions and Notation

In this chapter, we follow the terms and notation defined in [97]

A pattern(or feature vector or observation) is a single data item used by the

clustering algorithm. It typically consists of a vector of d measurements.

The individual scalar components of a pattern are called features (or attributes).

d is the dimensionality of the pattern or of the pattern space.

A class refers to a state of nature that governs the pattern generation process

in some cases. More concretely, a class can be viewed as a source of patterns whose

distribution in feature space is governed by a probability density specific to the class.

Clustering techniques attempt to group patterns so that the classes thereby obtained

reflect the different pattern generation processes represented in the pattern set.

A distance measure is a metric on the feature space used to quantify the similarity of

patterns.









5.3.3 k-means Clustering

The k-means is one of the most popular clustering algorithms. It is intended for

situations in which all variables are of the quantitative type, and squared Euclidean

distance is chosen as the dissimilarity measure. The Euclidean distance between two

points xi and xj in a d-dimensional space is written as


d2(x, Xj) (ik Xj,)2 Xi Xi (5-1)

In case of clustering in the multi-dimensional space, normalization of the continuous

features can be used to remove the tendency of the largest-scaled feature to dominate the

others. In addition, Mahalanobis distance can be used to remove the distortion caused by

the linear correlation among features as discussed in chapter 3.

The k-means algorithm works as follows [97]:

(1) ('!i i... k cluster centers to coincide with k randomly-chosen patterns inside the

hyper volume containing the pattern set.

(2) Assign each pattern to the closest cluster center.

(3) Recompute the cluster centers using the current cluster memberships.

(4) If a convergence criterion is not met, go to step 2. Typical convergence criteria

are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease

in squared error.

The algorithm has a time complexity of O(n), where n is the number of patterns,

and a space complexity of O(k), where k is the number of clusters. The algorithm is

order-independent; for a given initial seed set of cluster centers, it generates the same

partition of the data irrespective of the order in which the patterns are presented to the

algorithm. However, the algorithm is sensitive to initial seed selection and even in the best

case, it can produce only hyperspherical clusters.









5.3.4 Finding the Optimal Number of Clusters

One of the most venerable problems in cluster analysis is to find the optimal number

of clusters in the data. Many statistical methods and computational algorithms have been

developed to answer this question using external indices and/or internal indices [96]. The

best number of clusters in the context of phase analysis discussed in this work is the one

that gives minimal total costs. The process to find out the optimal number of clusters of

the application workload is explained as follows.

Let u, = u(to + nAt) denote the resource usage sampled at time t = to + nAt

during the execution of an application. As shown in Section 5.3.3, when the clustering

with input parameter k (i.e., the number of clusters) is performed for a resource usage set

U = {ul, U2, }, the subset Ui of resource usages that belong to the ith phase can be

written as:


U, = {ulVu e phase i}, 1 < i < k. (52)


Resource reservation strategy: Phase-based resource reservation is performed. For intervals

whose resource usages belong to the ith phase, the local maximum amount of resource

usage U rm of the phase i is reserved:


U,x = max (ulVu e U,), 1 < i< k (5-3)


and the total resource reservation R over the whole execution period can be written as
k
R(k) = Um x (size of U) (5-4)
i=1

where k is the number of clusters used for clustering algorithm and the size of Ui is defined

as the number of elements of the subset Ui. Compared to the conservative reservation

strategy, which reserves the global maximum amount of resources over the whole execution

period, the phase-based reservation strategy can better adapt the resource reservation to

the actual resource usage and reduce the resource reservation cost as shown in Figure 5-2,











========Actual usage Phase-based -Conservative


m m .--- - - -
SI I

= 06 As






'-t- ,-- A At22
S------------- ----->



Time At1

Figure 5-2. Resource allocation strategy comparison
Phase-based resource allocation strategy can adapt the time (At) and space
(As) granularity of the allocation to the actual resource usage. It presents cost
reduction opportunity compared to the coarse-grained conservative strategy.


which illustrates the difference between the two reservation strategies using a hypothetical

workload.

Phase transition cost: The second factor for determining the number of phases is

the transition cost caused by switching between different phases. Define the transition

cost TR(k) as the number of transitions among k phases. The total cost TC(k) can be

calculated from the resource reservation R(k) and phase transition TR(k) as


TC(k) = CIR(k) + C2TR(k) (5-5)


where C1 and C2 denote the unit cost per resource usage and transition. The best number

of phases, kbest, should minimize the total cost. Therefore, kbest is derived as


best =arg min TC(k)
1 arg min [R(k) + C x TR(k)], (5-6)
1








where C denotes the transition factor, which is the ratio of C2 to C1, and K is the

maximum number of phases.

Encoding misprediction p y.J '.all; cost: The algorithm can be extended to phase

prediction as well as phase analysis of resource usage. The determination of the best

number of phases remains the same, whereas the cost function has to be changed to

take over- or under-provisioning caused by prediction error into account. Generally the

mispredictions consist of two possible cases: over-provisioning and under-provisioning.

Over-provisioning refers to the cases that the resource reservation based on prediction is

larger than the actual usage. It guarantees that the application response time is equal

to or less than the time defined in the SLA. In this case, the penalty is the cost of the

over-reserved resource, which has been encoded in the cost model already. In case of

under-provisioning, the application's execution time will be enlarged because of the

resource constrain. The performance degradation is approximated by the y. '.i/ll; in the

total cost function. The penalty is defined as the difference between the under-reserved

resource and the actual resource usage, and can be written as


penalty if u < UI (57)
V U Unax if U> Upax


k
P(k) Upena" (5-8)
i= 1

where k is the number of phases. Taking both the phase transition and misprediction costs

into account, the general total cost function is modified as


TC'(k) = C1R(k) + C2TR(k) + C3P(k) (5-9)









where C1, C2, and C3 denote the unit cost per resource usage, switching, and penalty.

Therefore, k ,, is derived as


k,, = arg min TC'(k)
1 arg mmin [R(k) + C x TR(k) + CpP(k)] (5-10)
1
where C is the transition factor, Cp denote the discount factor for misprediction penalty,

which is the ratio of C3 to C1, and K is the maximum number of phases.

5.4 Phase Prediction

This section describes the work flow of the application resource demand phase

prediction illustrated in Figure 5-3. The prediction consists of two stages: a training

stage and a testing stage. During the training stage, the number of the clusters in

the application resource usage, the corresponding cluster centroids, and the unknown

parameters of the time series prediction model of the resource usage are determined.

During the testing stage, the one-step ahead resource usage is predicted and classified as

one of the clusters.

Both stages start from pattern representation and framing. In the step of pattern

representation, the collected performance data of the application VM are profiled to

extract only the features which will be used for clustering and future resource provisioning.

For example, in the one-dimension case discussed in this thesis, the training data of a

specific performance feature (Xx1,,, see Table 5-1), are extracted, where u is the total

number of input data. Then the extracted performance data Xix>1 are framed with the

prediction window size m to form data X'(Iu-m+1)xm-

The training stage mainly consists of two processes: prediction model fitting and

phase behavior analysis. The algorithms defined in Section 5.3.3 and 5.3.4 are used to

find out the number of phases k, which gives the lowest total resource provisioning cost.

The output phase profile is used to train the phase predictor. In addition, the unknown

parameters of the resource predictor are estimated from the training data. In this thesis,









a time-series prediction model, autoregressive (AR), is used for its simplicity and proven

success in computer system resource prediction [78]. However, this prototype can generally

work with any other time-series prediction models. In case of highly dynamic workloads,

the Learning-Aided Resource Predictor (LARPredictor) developed in C'!i pter 4 can be

used. The LARPredictor uses a mix-of-experts approach, which adaptively chooses the

best prediction model from a pool of models based on learning of the correlations between

the workload and fitted prediction models of historical runs.

Similar to the training stage, the testing data are extracted Y1ix, and framed with

the prediction window size m. The framed testing data Y'v-nm+I)xm are used as input

of the fitted resource predictor to predict the future resource usage Y'i,. The phase

predictor classifies the predicted resource usages Y'ix, into the phases P'1Ix based on the

phase profile learned in the training stage Similarly, the phase predictions for the actual

resource usage Yx1, are performed to generate Pix,. Then the corresponding predicted

phases P'1Ix (which are based on predicted resource usage) and Pixv (which are based on

actual resource usage) are compared to evaluate the phase prediction accur ; ., which is

defined as the ratio of the number of matched phase predictions over the total number of

phase predictions.

5.5 Empirical Evaluation

We have implemented a prototype for the phase analysis and prediction model

including Perl and Shell scripts to extract and profile the performance data from

the performance database, and a Matlab implementation of the phase analyzer and

predictor. This section shows the experimental results of the phase analysis and prediction

performance evaluations using traces collected from the batch executions of SPECseis96,

a scientific benchmark program, and replay of the WorldCup98 web access log. In all

the experiments, ten-fold cross validation was performed for each set of time series

performance data.









Table 5-1. Performance feature list
Performance Features Description
CPU_System / User Percent CPU_System / User
BytesIn / Out Number of bytes per second into / out of
the network
IO_BI / BO Blocks sent to / received from block device
(blocks/s)
Swap_In / Out Amount of memory swapped in / out
from / to disk (kB/s)


5.5.1 Phase Behavior Analysis

This set of experiments illustrates how the cost model presented in Section 5.3.4 can

be used to find out the best number of clusters for an application workload. The Ganglia

monitoring daemon was used to collect the performance data of the application container.

Table 5-1 shows the list of performance features under study in the experiments.

5.5.1.1 SPECseis96 benchmark

In this experiment, the SPECseis96 benchmark, which is a CPU-intensive workload

representing a scientific application [53], was hosted by a VMware GSX virtual machine.

The host server of the virtual machine was an Intel(R) Xeon(TM\ ) dual-CPU 1.80GHz

machine with 512KB cache and 1GB RAM. The Ganglia daemon was installed in the

guest VM and run to collect the resource performance data once every five seconds

(5secs/interval) and store them in the performance database. During feature represen-

tation, the data were extracted based on given VMID, FeaturelD, starting and ending

time stamps to form time series data under study. Then subsequent phase analysis was

performed for the 8000 performance snapshots collected during the monitoring periods.

Figure A shows a sample set of training data of the CPU_user ( .) of the VM

including the actual resource usages (Actual Rsc), reserved resources based on the k-mean

clustering (k=3) (Rsvd Rsc) and based on the conservative reservation strategy (Consrv

Rsc). Figure B shows a sample set of the corresponding testing data including the actual

resource usage (Actual Rsc), the resource reservation based on actual resource usage (Rsvd









Rsc (Actual)), the predicted resource usage by the AR prediction (Predicted Rsc), and the

resource reservation based on the predicted usage (Rsvd Rsc (Predict)).

Figures C and D show that, with increasing number of phases, two of the deter-

minants in the cost model including the number of phase transitions TR(k) and the

misprediction penalty P(k) increase monotonically. The other determinant of the cost

model, the amount of reserved resources R(k), is shown by the lowest curve with index

C = 0 in Figure E. It indicates that with increasing number of phases the total reserved

resources of the training set is decreasing monotonically. This result is because with the

increasing number of phases, the resource allocation can be performed at time scales of

finer granularity. However, there is a diminishing return of the increased number of phases

because of the increasing phase transition costs and misprediction penalties.

In the first analysis, we assume each resource reservation scheme to be r1.:,; .;;,./

i.e., it reserves resources based on exact knowledge of future workload requirements. This

assumption eliminates the impact of inaccuracies introduced by the phase predictor.

In this case, Equation (5-6), which takes the resource reservation cost and the phase

transition cost into account while deciding the optimal number of phases, can be

applied as shown in Figure E. In this figure, the total cost over the whole testing period

is measured by CPU usage in percentage. The discount factor C denotes the CPU

percentage that will cost for each phase transition: C = CPU( .) TransitionDuration.

For example, the bottom line of C = 0 shows the case of no transition cost, which gives

the lower bound of the total cost. For another instance, C = 21 11' implies a 13-second

transition period (2.6intervals 5secs/interval) with the assumption of 1011t' CPU

consumption during the transition period. When the discount factor C increases from 0 to

260, the best number of phases kbest, which can provide the lowest total cost, decreases

gradually from 10 to 2. The phase profile depicted in Figure E can be used to decide the

number of phases that should be used in the phase-based resource reservation to minimize

the total cost with given available transition options. For example, VMware ESX supports









on-line resource reprovisioning on the same cluster node. So the transition time can be

virtually close to zero (C = 0). In this case, 10 phases can be used. If the transition takes

8 seconds (C = 156), which is achievable with intra-cluster VM migration for resource

1 I''' -i ..i:.i- four phases work the best. When the transition cost exceeds the level that

the reduced resource reservation can justify for the workload under study, the total cost is

an increasing function of the number of phases. In this case, it is better to fall back from

the phase-based resource reservation strategy to the conservative one.

The impact of inaccuracies introduced by the phase predictor is shown in Figure F. In

addition to the resource reservation costs and the phase transition costs, this experiment

also took the phase mis-prediction penalty costs into accounts while calculating the total

cost. For example, for each unit of down-size mis-predicted resource, a penalty of 8-times

(Cp = 8) of the unit resource cost is imposed. Comparing Figure E to Figure F, we can

see that adding penalty into the cost model will increase the final costs to the user for the

same set of k and C and potentially reduce the workload's best number of phases k'_best

for the same set of C and Cp.

Finally a total cost ratio p is defined to be the relative total cost using k phases

TC'(k) to the total cost of 1 phase TC'(1).


p =TC'(k)/TC(1). (5-11)


Intuitively, p measures the cost savings achieved using phase-based reservation strategy

over the conservative one. Thus, the smaller the value of p, the more efficient a

phase-based reservation scheme. Table 5-2 gives a sample total cost schedule (C = 52

and Cp = 8) for each of the eight performance features of SPECseis96. It shows that

by changing the resource provisioning strategy from the conservative approach (k = 1)

to the phase-based provisioning (k = 3), 29.5' total cost reduction for CPU usage can

be achieved. For spiky trace data such as disk I/O and memory usage, the total cost

reduction can be as high as !' '.









ratio schedule for the eight performance features


Performance Number of phases (k)
Features 1 2 3 4 5 6 7 8 9 10
CPU user 1.00 0.80 0.75 0.75 0.75 0.77 0.78 0.78 0.80 0.83
CPU_system 1.00 0.67 0.66 0.65 0.64 0.66 0.67 0.69 0.70 0.71
Bytes_in 1.00 0.97 0.96 0.96 0.96 0.96 0.96 0.95 0.95 0.95
Bytes_out 1.00 0.95 0.90 0.88 0.90 0.87 0.87 0.87 0.87 0.87
IO BI 1.00 0.57 0.52 0.55 0.56 0.58 0.62 0.63 0.62 0.64
IO BO 1.00 0.57 0.53 0.55 0.57 0.61 0.60 0.61 0.64 0.63
Swap_in 1.00 0.54 0.55 0.59 0.59 0.60 0.61 0.63 0.64 0.65
Swap_out 1.00 0.51 0.47 0.49 0.54 0.55 0.57 0.58 0.59 0.61
(Total cost ratio p = TC'(k)/TC'(1), where C = 52 and Cp = 8)

5.5.1.2 World Cup web log replay

In this experiment, phase characterization was performed for the performance data

collected from a network-intensive application, 1998 World Cup web access log repliv.

The workload used in this experiment was based on the 1998 World Cup trace [98].

The openly available trace containing a log of requests to Web servers was used as an

input to a client replay tool, which enabled us to exercise a realistic Web-based workload

and collect system-level performance metrics using Ganglia in the same manner that was

done for the SPECseis96 workload. For this study, we chose to replay the five hour (from

22:00:01 Jun.23 to 3:11:20 Jun.24) log of the least loaded server (serverlD 101), which

contained 130,000 web requests.

The phase analysis and prediction techniques can be used to characterize performance

data collected from not only virtual machines but also physical machines. During the

experiment, a physical server with sixteen Intel(R) Xeon(TM) MP 3.00GHz CPUs

and 32GB memory was used to execute the replay clients to submit requests based on

submission intervals, HTTP protocol types (1.0 or 1.1), and document sizes defined in

the log file. A physical machine with Intel(R) Pentium(R) 4 1.70GHz CPU and 512MB

memory was used to host the Apache web server and a set of files which were created

based on the file sizes described in the log.


Table 5-2. SPECseis96 total cost









To perform the web log replay, a Matlab program was developed to profile the

binary access log file and extract the entries of the target web server. The i i. 1I."

tool provided by [98] was used to convert the binary log into the Common Log Format.

A modified version of the Real-Time Web Log Replayer [99] was used to analyze and

generate the files needed by the log replayer and perform the replay.

Figures 5-5 and 5-6 show the phase characterization results of the performance

features bytes-in and bytesout of the web server. The interesting observation from Figures

A and B is that the number of phase transitions and mis-prediction penalties do not

alv--,v- monotonically increase with the increasing number of phases. As a result, the

phase profile shown in Figure C argues that three-phase based resource provisioning gives

the lowest total cost with given C = [150k, 750k] and Cp = 8. The results implies that

the phase profile is highly workload dependent. The prototype presented in this thesis can

help to construct and analyze the phase profile of the application resource consumption

and decide the proper resource provisioning strategy.

5.5.2 Phase Prediction Accuracy

As one of the cost determinant, the misprediction penalty is a function of the phase

prediction accuracy. This section evaluates the performance of phase prediction model

introduced in Section 5.4. A performance measurement, prediction accur r ;, is defined as

the ratio of the number of performance snapshots, whose predicted phases match with the

observed phases, to the total number of performance snapshots collected during the testing

period.

Table 5-3 shows the phase prediction accuracies for the performance traces of the

main resources consumed by the SPECseis96 and the WorldCup98 workloads. Generally,

the phase prediction accuracy of each performance feature decreases with increasing

number of phases. It explains why the penalty curve rises monotonically with the

increasing number of phases in Figure D. With current implementation, an average of

95' accuracy can be achieved for the network performance traces of the WorldCup98 log









Table 5-3. Average phase prediction accuracy
Performance Number of phases (k)
Application Features 1 2 3 4 5 6 7 8 9 10
Bytesin 1.00 0.99 0.99 0.98 0.98 0.97 0.97 0.96 0.96 0.96
WorldCup98 Bytes_out 1.00 0.94 0.94 0.92 0.91 0.89 0.87 0.88 0.86 0.84
CPU user 1.00 0.95 0.90 0.87 0.85 0.81 0.78 0.77 0.74 0.69
SPECseis96 CPU_system 1.00 0.94 0.87 0.83 0.83 0.79 0.76 0.74 0.73 0.69


Table 5-4. Performance feature list of VM traces
Perf. Features Description
CPU_Ready The percentage of time that the virtual machine
was ready but could not get scheduled to run on
a physical CPU.
CPU_Used The percentage of physical CPU resources used
by a virtual CPU.
Mem_Size Current amount of memory in bytes the virtual
machine has.
Mem_Swap Amount of swap space in bytes used by the
virtual machine.
Net_RX/TX The number of packets and the MBytes per
second that are transmitted and received by a NIC.
Disk_RD/WR The number of I/Os and KBytes per second
that are read from and written to the disk.


replay and an average of 85'


accuracy can be achieved for the CPU performance traces of


SPECseis96 for the four-phase cases.

In addition to the above two applications, we also evaluated the prediction per-

formance of the phase predictor using traces of a set of five virtual machines. These

virtual machines were hosted by a physical machine with an Intel(R) Xeon(T\ 1) 2.0GHz

CPU, 4GB memory, and 36GB SCSI disk. VMware ESX server 2.5.2 was running on

the physical host. The ; nl -.i,,: tool was run on the ESX server to collect the resource

performance data of the guest virtual machines every minute and store them in a round

robin database. The performance features under study in this experiment are shown in

Table 5-4.









In this experiment, a virtual machine (VM1), which hosts a web server, Globus

GRAM/l\ I)S and GridFTP services, and a PBS head node, is used. Its trace data of a

7-dwi period with 30-minute intervals were extracted. During this period, a total of 310

jobs were executed varying with a mix of 93.55' short running jobs (1 to 2 seconds),

3.7' medium running jobs (2 to 10 minutes), and 2 "-' long running jobs (45 to

50 minutes). In addition to the VM1, a set of 24-hour period with 5-minute intervals

performance traces of additional four virtual machines were evaluated as well: VM2,

which hosts a Linux-based port-forwarding proxy for VNC sessions, VM3, which hosts a

WindowsXP based calendar, VM4, which hosts a web server, a list server, and Wiki server,

and VM5, which hosts a web server.

Table 5-5 shows the average phase prediction accuracies for each of the 12 perfor-

mance features over all the five VMs. It shows that with increasing number of phases

the phase prediction accuracy of each performance feature decreases monotonically. The

prediction accuracies vary with the performance features under study. With current

implementation, an average of 83.25'. accuracy can be achieved across the phase

predictions of all the twelve performance features for the two phase cases.

5.5.3 Discussion

In the phase analysis and prediction experiments, the following assumptions regarding

the components of the cost model are made:

1. A clear mapping between resource consumption and response time is assumed for

the application container. This might not alv--- be true for all types of applications.

More complex performance/queuing models may be needed to provide an accurate

mapping in case of complex applications.

2. A dedicated machine is assumed for the application container to collect the

performance data. In case that multiple applications co-exist on the same hosting

machine, a more sophisticated method of data collection, for example .i-- negating

performance data of the processes that belong to the same application, may be needed.








Table 5-5. Average phase prediction accuracy of the five VMs
Performance Number of Phases
Features 1 2 3 4 5 6 7 8 9 110
CPUUsed 1.00 0.85 0.69 0.60 0.51 0.48 0.43 0.44 0.38 0.35
CPU_Ready 1.00 0.81 0.67 0.52 0.45 0.36 0.36 0.32 0.33 0.32
Mem Size 1.00 0.91 0.84 0.71 0.70 0.59 0.57 0.52 0.50 0.48
Mem_Swap 1.00 0.96 0.89 0.89 0.83 0.75 0.71 0.70 0.66 0.64
NIC #1RX 1.00 0.58 0.54 0.47 0.41 0.39 0.37 0.34 0.30 0.28
NIC #1TX 1.00 0.56 0.48 0.42 0.39 0.35 0.29 0.26 0.29 0.25
NIC#2_RX 1.00 0.93 0.77 0.70 0.61 0.55 0.46 0.33 0.31 0.24
NIC#2_TX 1.00 0.88 0.81 0.76 0.71 0.63 0.53 0.48 0.56 0.45
Diskl Read 1.00 0.97 0.92 0.86 0.80 0.73 0.64 0.56 0.52 0.44
Diskl Write 1.00 0.94 0.87 0.78 0.70 0.67 0.63 0.59 0.58 0.55
Disk2 Read 1.00 0.67 0.61 0.55 0.50 0.49 0.47 0.46 0.41 0.38
Disk2 Write 1.00 0.93 0.84 0.76 0.60 0.57 0.51 0.46 0.41 0.38


3. In this work, one dimensional phase analysis and prediction is performed. However
the prototype can generally work for multi-dimension resource provisioning cases also.
For clustering in the multi-dimension space, additional pattern representation techniques
such as Principal Component Analysis (PCA) can be used to project the data to lower
dimensional space to reduce the computing intensity. In addition, the transition factor
C will represent the unit transition cost defined in the pricing schedule of the resource
provider.
Developing prediction models for parallel and multi-tier applications is part of our
future research.
5.6 Related Work
Recently, application's phase behavior has drawn a growing research interest for
different reasons. First, tracking application phases enables workload dependent dynamic
management of power/performance trade-offs [100][101]. Second, phase characterization
that summarizes application behavior with representative execution regions can be used









to reduce the high computation costs of large-scale simulations [102] [103]. Our purpose to

study the phase behavior is to support dynamic resource provisioning of the application

containers.

In addition to the purpose of study, our approach differs from traditional program

phase analysis in the following v--,-

1) Performance metric under study: In the area of power management and simulation

optimization for computer architecture research, the metrics used for workload charac-

terization are typically Basic Block Vectors (BBV) [102] [101], conditional branch counter

[104], and instruction working set [105]. In the context of application VM/container's

resource provisioning, the metrics under study are the system level performance features,

which are instructive to VM resource provisioning such as those shown in Table 5-1.

2) Knowledge of the program codes: While [102] [101] [104] at least requires profiling

of program binary codes, our approach requires neither instrumentation nor access of

program codes.

3) This thesis answers the question !:i.-- many clusters are b. -I in the context of

system level resource provisioning.

In [106], Dhodapkar et al. compared three dynamic program phase detection

techniques discussed in [102], [104], and [105] using a variety of performance metrics, such

as sensitivity, stability, performance variance and correlations between phase detection

techniques.

In addition, other related work on resource provisioning include: Urgaonkar et al.

studied resource provisioning in a multi-tier web environment [107]. Wildstrom et al.

developed a method to identify the best CPU and memory configuration from a pool of

configurations for a specific workload [108]. ('!: .- et al. have proposed a hierarchical

architecture that allocates virtual clusters to a group of applicaitons [109]. Kusic et al.

developed an optimization framework to decide the number of servers to allocate to









each cluster to maximize system revenue [110]. Tesauro et al. used a combination of

reinforcement learning and queuing model for system performance management [5].

5.7 Conclusion

The application resource demand phase analysis and prediction prototype presented

in this chapter shows how to apply statistical learning techniques to support on-demand

resource provisioning. This chapter shows how to define the phases in the context of

system level resource provisioning and provides an approach to automatically find out

the number of phases which can provide optimal cost. The proposed cost model can

take the resource cost, phase transition cost, and prediction accuracy into account. The

experimental results show that an average of above 9,' i of phase prediction accuracy can

be achieved in the experiments across the CPU and network performance features under

study for the four-phase cases. With the knowledge of the system level application phase

behavior, we envision dynamic optimization of resource scheduling during the application

run can be performed to improve system utilization and reduce the cost for the user.

Providing more informative phase prediction can help to achieve this goal and is part of

our future research.











Resource Pattern

I I I I I.... I I


Training Data


(ixu Clxk)


P'
SIxv


Testing Data


)xm I


Yr
I
(v-m+l)xm





----------


S1
_* 1v


-I Phase Prediction Accuracy


Figure 5-3.


Application resource demand phase prediction workflow
In the training stage, the u performance data Xix,, of features) used in the
subsequent phase analysis are extracted (pattern representation) and framed
with prediction window size m. The unknown parameters of the resource
predictor is estimated during model fitting using the framed training data
X't(u-+1)x. In addition, the clustering algorithms introduced in Section 5.3
are used to construct the application phase profile including the phase labels
ix,u for all the samples and the calculated cluster centroids Clxk. In the
testing phase, the phase predictor uses the knowledge learned from the phase
profile to predict the future phases P1ixv based on the predicted resource
usage Y/x1,, and Pix, based on observed actual resource usage Yix,,, and
compare them to evaluate the phase prediction accur r ;,


7f
xv V











100


80


20 3'
Time Index
A


100


80


10 20 30 40 50
Time Index


Figure 5-4.


Phase analysis of SPECseis96 CPU_user A) Sample training data B)Sample
testing data C)Phase transitions D)Misprediction penalties E)Total cost
without penalty F)Total cost with penalty (Cp = 8)












1400

S1200

1000
0
S800

600
0
S400

S200


2 4 6 8
Number of Phases
(1


2 4 6 8
Number of Phases
D


Figure 5-4. Continued












6 x106
CO
S---C=52
--5 --- C=104
S.) C=156
+ 4 ----C=208


4-
S--------- C=260

3


2

S 1 I I II
0 2 4 6 8 10
I Number of Phases
E

6
6 x106

---- C=52
5 --E- C=104

+ 4' -------C=208
S-- --C=260


o 3


2



0 2 4 6 8 10
I Number of Phases
F

Figure 5-4. Continued


























2 4 6
Number of Phases
A


2 4 6
Number of Phases
B


2 4 6
Number of Phases
C


8 10


8 10


8 10


Figure 5-5. Phase analysis of WorldCup'98 BytesIn A)Phase transitions B)Misprediction
penalties C)Total cost with penalty (C, = 8)


1.6


g 1.55

E 1.5

" 1.45

1.4


1.35
0















1500


1000 -


S500 -



0
0


2 4 6
Number of Phases
B


2 4 6
Number of Phases
C


Figure 5-6. Phase analysis of WorldCup'98 Bytesout A)Pha
penalties C)Total cost with penalty (Cp = 8)


8 10




se transitions B)Misprediction


2000


2 4 6
Number of Phases


8 10


8 10









CHAPTER 6
CONCLUSION

Self-management has drawn increasing attentions in the last few years due to

the increasing size and complexity of computing systems. A resource scheduler that

can perform self-optimization and self-configuration can help to improve the system

throughput and free system administrators from labor-intensive and error-prone tasks.

However, it is challenging to equip a resource scheduler with such self- capacities because

of the dynamic nature of system performance and workload.

In this dissertation, we propose to use machine learning techniques to assist system

performance modeling and application workload characterization, which can provide

support for on-demand resource scheduling. In addition, virtual machines are used

as resource containers to host application executions for the ease of dynamic resource

provisioning and load balancing.

The application classification framework presented in C'! lpter 2 used the Principal

Component Analysis (PCA) to reduce the dimension of the performance data space.

Then the k-Nearest Neighbor k-NN algorithm is used to classify the data into different

classes such as CPU-intensive, I/O-intensive, memory-intensive, and network-intensive. It

does not require modifications of the application source code. Experiments with various

benchmark applications -,i--. -1 that with the application class knowledge, a scheduler

can improve the system throughput 22.11 on average by allocating the applications of

different classes to share the system resources.

The feature selection prototype presented in C(i ipter 3 uses a probabilistic model

(B i-, i i, Network) to systematically select the representative performance features,

which can provide optimal classification accuracy and adapt to changing workloads. It

shows that autonomic feature selection enables classification without requiring expert

knowledge in the selection of relevant low-level performance metrics. This approach

requires no application source code modification nor execution intervention. Results from









experiments show that the proposed scheme can effectively select a performance metric

subset providing above 911' i classification accuracy for a set of benchmark applications.

In addition to the application resource demand modeling, C'! lpter 4 proposes a

learning based adaptive predictor, which can be used to predict resource availability. It

uses the k-NN classifier and PCA to learn the relationship between workload characteristic

and suited predictor based on historical predictions, and to forecast the best predictor

for the workload under study. Then, only the selected best predictor is run to predict the

next value of the performance metric, instead of running multiple predictors in parallel

to identify the best one. The experimental results show that this learning-aided adaptive

resource predictor can often outperform the single best predictor in the pool without a

priori knowledge of which model best fits the data.

The application classification and the feature selection techniques can be used

to define the application resource consumption patterns at any given moment. The

experimental results of the application classification -i -:.- -1 that allocating applications

which have complementary resource consumption patterns to the same server can improve

the system throughput.

In addition to one-step-ahead performance prediction, C'!i lter 5 studied the large

scale behavior application resource consumption. Clustering based algorithms have

been explored to provide a mechanism to define and predict the phase behavior of the

application resource usage to support on-demand resource allocation. The experimental

results show that an average of above 91' of phase prediction accuracy can be achieved

for the four-phase cases of the benchmark workloads.









REFERENCES


[1] J. Kephart and D. C'! i -- "The vision of autonomic computing," Computer, vol. 36,
no. 1, pp. 41-50, 2003.

[2] Y. Yang and H. Casanova, "Rumr: Robust scheduling for divisible workloads.," in
Proc. 12th High-Performance Distributed Cor0nTl.',.: Seattle, WA, June 22-24, 2003,
pp. 114-125.

[3] J. M. Schopf and F. Berman, "Stochastic scheduling," in Proc. AC'I/IEEE
Conference on Super '..in-,l,.:., Portland, OR, Nov. 14-19, 1999, p. 48.

[4] L. Yang, J. M. Schopf, and I. Foster, "Conservative scheduling: Using predicted
variance to improve scheduling decisions in dynamic environments," in Proc.
AC'I/IEEE conference on Super ..,,,,il.:.,,i Nov. 15-21, 2003, p. 31.

[5] G. Tesauro, N. Jong, R. Das, and M. Bennani, "A hybrid reinforcement learning
approach to autonomic resource allocation," in Proc. IEEE International Conference
on Autonomic Computing (ICAC'06), 2006, pp. 65-73.

[6] G. Tesauro, R. Das, W. Walsh, and J. Kephart, "Utility-function-driven resource
allocation in autonomic systems," in Proc. Second International Conference on
Autonomic C ',,n;,/.:,1 (ICAC'05), 2005, pp. 342-343.

[7] R. Duda, P. Hart, and D. Stork, The Art of Computer S,-ii mi Performance
A,.l;i.-.: Techniques for Experimental Design, Measurement, Simulation, and
Modeling, Wiley-Interscience, New York, NY, Apr. 1991.

[8] J. O. Kephart, "Research challenges of autonomic computing," in Proc. 27th
International Conference on Software Engineering ICSE, May 2005, pp. 15-22.

[9] S. M. Weiss and C. A. Kulikowski, Computer S1,., 1'm That Learn: C1ir--' ..rl.:I.)n
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
S-1/ m- Morgan Kaufmann, San Mateo, CA 94403, 1990.

[10] R. P. Goldberg, "Survey of virtual machine research," IEEE Computer M rg. .:,..
vol. 7, no. 6, pp. 34-45, June 1974.

[11] R. Figueiredo, P. Dinda, and J. Fortes, "A case for grid computing on virtual
machines," in Proc. ,./ International Conference on Distributed CornI'nIl.I',;
S. ,1, 4 M iv 19-22, 2003, pp. 550-559.

[12] S. Pinter, Y. Aridor, S. Shultz, and S. Guenender, Iipiu, iinm machine virtualization
with 'hotplug memory'," Proc. 17th International Symposium on Computer
Architecture and High Performance Computing, pp. 168-175, 2005.

[13] C. Clark, K. Fraser, S. Hand, J. Hanseny, E. July, C. Limpach, I. Pratt, and
A. Warfield, "Live migration of virtual machines," in Proc. ',./I Symposium on
Networked S,/-1' m- Design & Implementation (NSDI'05), Boston, MA, 2005.









[14] "Vmotion," http://www.vmware.com/products/vi/vc/vmotion.html.

[15] M. Zhao, J. Z!h iw- and R. Figueiredo, "Distributed file system support for virtual
machines in grid computing," Proc. 13th International Symposium on High Perfor-
mance Distributed CorT,,';,.:. pp. 202-211, 2004.

[16] I. Krsul, A. Ganguly, J. Zhang, J. Fortes, and R. Figueiredo, "Vmplants: Providing
and managing virtual machine execution environments for grid computing," in Proc.
Super, .i,,;jl/.:. Washington, DC, Nov. 6-12, 2004.

[17] J. Sugerman, G. Venkitachalan, and B. Lim, "Virtualizing i/o devices on vmware
workstation's hosted virtual machine monitor," in Proc. USENIX Annual Technical
Conference, 2001.

[18] J. Dike, "A user-mode port of the linux kernel," in Proc. 4th Annual Linux Showcase
and Conference, USENIX Association, Atlanta, GA, Oct. 2000.

[19] A. Sundararaj and P. Dinda, "Towards virtual networks for virtual machine grid
computing," in Proc. 3rd USENIX Virtual Machine Research and T / ',,. 4..,i;
Symposium, T i_- 2004.

[20] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "C' I. l.point and
migration of UNIX processes in the Condor distributed processing system," Tech.
Rep. UW-CS-TR-1346, University of Wisconsin Madison Computer Sciences
Department, Apr. 1997.

[21] A. Barak, O. Laden, and Y. Yarom, "The now mosix and its preemptive process
migration scheme," Bulletin of the IEEE Technical Committee on Operating Sl-/' i
and Application Environments, vol. 7, no. 2, pp. 5-11, 1995.

[22] R. Duda, P. Hart, and D. Stork, Pattern ClI .--. ,I/r.:.n, Wiley-Interscience, New
York, NY, 2001, 2nd edition.

[23] C. G. Atkeson, A. W. Moore, and S. Schaal, "Locally weighted 1. i.riii.- Ar'.:pfi.:;,1
Intelligence Review, vol. 11, no. 1-5, pp. 11-73, 1997.

[24] S. Adabala, V. ('!i ,.[i P. C'!i .--! ., R. J. O. Figueiredo, J. A. B. Fortes, I. Krsul,
A. M. Matsunaga, M. O. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu, "From
virtualized resources to virtual computing grids: the in-vigo system.," Future
Generation Comp. Syst., vol. 21, no. 6, pp. 896-909, 2005.

[25] L. Yu and H. Liu, "Efficient feature selection via analysis of relevance and
redundancy," Journal of Machine Learning Research, vol. 5, pp. 1205-1224,
Oct. 2004.

[26] T. Cover and P. Hart, \, rest neighbor pattern classification," IEEE Trans. Inf.
T7, .. ',; vol. 13, no. 1, pp. 21-27, Jan. 1967.









[27] M. L. Massie, B. N. Chun, and D. E. Culler, "The ganglia distributed monitoring
system: Design, implementation, and experience.," Parallel ConTrI'I,.:, vol. 30, no.
5-6, pp. 817-840, 2004.

[28] N\' I 'iI," http://www.netapp.com/techlibrary/3022.html.

[29] R. Eigenmann and S. Hassanzadeh, "Benchmarking with real industrial applications:
the spec high-performance group," IEEE Computational Science and Engineering,
vol. 3, no. 1, pp. 18-23, 1996.

[30] "Ettcp," http://sourceforge.net/projects/ettcp/.

[34] Q. Snell, A. Mikler, and J. Gustafson, I lpipe: A network protocol independent
performance evaluator," June 1996.

[31] "Simplescalar," http://www.cs.wisc.edu/ mscalar/simplescalar.html.

[32] "Ch3d," http://users.coastal.ufl.edu/ pete/CH3D/ch3d.html.

[33] "Bonnie," http://www.textuality.com/bonnie/.

[35] "Vmd," http://www.ks.uiuc.edu/Research/vmd/.

[36] "Spim," http://www.cs.wisc.edu/ larus/spim.html.

[37] "Reference of stream," http://www.cs.virginia.edu/stream/ref.html.

[38] "Autobench," http://www.xenoclast.org/autobench/.

[39] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," J.
Mach. Learn. Res., vol. 3, pp. 1157-1182, Mar. 2003.

[40] Y. Liao and V. R. Vemuri, "Using text categorization techniques for intrusion
detection," in 11th USENIX S.. ii; Symposium, San Francisco, CA, Aug. 5-9,
2002, pp. 51-59.

[41] A. K. Ghosh, A. Schwartzbard, and M. Schatz, "Learning program behavior profiles
for intrusion detection," in Proc. the Workshop on Intrusion Detection and Network
Monitoring, Santa Clara, CA, Apr. 9-12, 1999, pp. 51-62.

[42] M. Almgren and E. Jonsson, "Using active learning in intrusion detection," in Proc.
17th IEEE Computer S,. ,:; Foundations Workshop, June 28-30, 2004, pp. 88-98.

[43] S. C. Lee and D. V. Heinbuch, "Training a neural-network based intrusion detector
to recognize novel attacks.," IEEE Transactions on Sii-/, Man, and C;,d, ,,. ,.:..
Part A, vol. 31, no. 4, pp. 294-299, 2001.

[44] G. Forman, "An extensive empirical study of feature selection metrics for text
classification," J. Mach. Learn. Res., vol. 3, pp. 1289-1305, 2003.









[45] N. H. Kapadia, J. A. B. Fortes, and C. E. Brodley, "Predictive application-
performance modeling in a computational grid environment," in Proc. 8th IEEE
International Symposium on High Performance Distributed Cor,,ij',l.,: Redondo
Beach, CA, Aug. 3-6, 1999, p. 6.

[46] J. Basney and M. Livny, hupiou, mg goodput by coscheduling cpu and network
capacity," Int. J. High Perform. Comput. Appl., vol. 13, no. 3, pp. 220-230, Aug.
1999.

[47] R. Raman, M. Livny, and M. Solomon, "Policy driven heterogeneous resource
co-allocation with gangmatching," in Proc. 12th IEEE International Symposium
on High Performance Distributed Cor,',l.I,:.I (HPDC'03), Seattle, WA, June 22-24,
2003, p. 80.

[48] S. Sodhi and J. Subhlok, "Skeleton based performance prediction on shared
networks," in IEEE International Symposium on C'i,-il r Computing and the Grid
(CCGrid 2004), 2004, pp. 723 730.

[49] V. Taylor, X. Wu, and R. Stevens, "Prophesy: an infrastructure for performance
analysis and modeling of parallel and grid applications," SIG(ll;TRICS Perform.
Eval. Rev., vol. 30, no. 4, pp. 13-18, 2003.

[50] 0. Y. Nickolayev, P. C. Roth, and D. A. Reed, "Real-time statistical clustering for
event trace reduction," The International Journal of Supercomputer Applications
and High Performance Computing, vol. 11, no. 2, pp. 144-159, Summer 1997.

[51] D. H. Ahn and J. S. Vetter, "Scalable analysis techniques for microprocessor
performance counter metrics," in Proc. SuperComputing, Baltimore, MD, Nov.
16-22, 2002, pp. 1-16.

[52] I. Cohen, J. S. C( ,- M. Goldszmidt, T. Kelly, and J. Symons, "Correlating
instrumentation data to system states: A building block for automated diagnosis
and control.," in 6th USENIX Symposium on Operating S,1-.ii m Design and
Implementation, 2004, pp. 231-244.

[53] J. Zhang and R. Figueiredo, "Application classification through monitoring and
learning of resource consumption patterns," in Proc. 20th International Parallel &
Distributed Processing Symposium, Rhodes Island, Greece, Apr. 25-29, 2006.

[54] M. Massie, B. Chun, and D. Culler, The G ir,ll':. Distributed Monitoring System:
Design, Implementation, and Experience, Addison-Wesley, Reading, MA, 2003.

[55] S. Agarwala, C. Poellabauer, J. Kong, K. Schwan, and M. Wolf, "Resource-aware
stream management with the customizable dproc distributed monitoring
mechanisms," in Proc. 12th IEEE International Symposium on High Performance
Distributed CorTj11,l.',:, June 22-24, 2003, pp. 250-259.


[56] "Hp," http://www.managementsoftware.hp.com.









[57] H. Liu and L. Yu, "Toward integrating feature selection algorithms for classification
and clustering," IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491-502, Apr.
2005.

[58] J. Pearl, Probabilistic Reasoning in Intelligent Si- I. m- Networks of Plausible
Inference, Morgan Kaufmann Publishers, San Francisco, CA, 1988.

[59] T. Dean, K. B .-,-., R. Chekaluk, S. Hyun, M. Lejter, and M. Randazza, "Coping
with uncertainty in a control system for navigation and exploration.," in Proc. 8th
National Conference on Ar'.:, .:,l Intelligence, Boston, MA, July 29-Aug. 3, 1990,
pp. 1010-1015.

[60] D. Heckerman, "Probabilistic similarity networks," Tech. Rep., Depts. of Computer
Science and Medicine, Stanford University, 1990.

[61] D. J. Spiegelhalter, R. C. Franklin, and K. Bull, "Assessment criticism and
improvement of imprecise subjective probabilities for a medical expert system,"
in Proc. Fifth Workshop on Uncer'i,,.:,l in Ar'.(l,.:,l Intelligence, 1989, pp. 335-342.

[62] E. C('! i i i1: and D. McDermott, Introduction to Ar'.:i .:.,l Intelligence,
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1985.

[63] T. S. Levitt, J. Mullin, and T. O. Binford, "Model-based influence diagrams for
machine vision," in Proc. 5th Workshop on Uncer''i,.:ul in Ar'.fl, .:,i Intelligence,
1989, pp. 233-244.

[64] R. E. Neapolitan, Probabilistic reasoning in expert s;,;-. ii- theory and il,.,rithms,
John Wiley & Sons, Inc., New York, NY, USA, 1990.

[65] K. Weinberger, J. Blitzer, and L. Saul, "Distance metric learning for large margin
nearest neighbor classification," in Proc. 19th annual Conference on Neural
Information Processing S;, .i' Vancouver, CA, Dec. 2005.

[66] R. Kohavi and F. Provost, "Glossary of terms," Machine Learning, vol. 30, pp.
271-274, 1998.

[67] B. Ziebart, D. Roth, R. Campbell, and A. Dey, "Automated and adaptive threshold
setting: Enabling technology for .illiiin and self-management," in Proc. '.1/
International Conference of Autonomic Cor,,i',.:,I,, June 13-16, 2005, pp. 204-215.

[68] P. Mitra, C. Murthy, and S. Pal, "Unsupervised feature selection using feature
similarity," IEEE Trans. Pat. Anal. Mach. Intel., vol. 24, no. 3, pp. 301-312, Mar.
2002.

[69] W. Lee, S. J. Stolfo, and K. W. Mok, "Adaptive intrusion detection: A data mining
approach," Ar'.:l i.:.1 Intelligence Review, vol. 14, no. 6, pp. 533-567, 2000.

[70] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen,
"Performance debugging for distributed systems of black boxes," in Proc. 19th AC'CM









symposium on Operating si,-1/ ii;- principles, Bolton L iliHi:, NY, Oct. 19-22, 2003,
pp. 74-89.

[71] R. Isaacs and P. Barham, "Performance analysis in loosely-coupled distributed
systems," in Proc. 7th CaberNet Radicals Workshop, Bertinoro, Italy, Oct. 2002.

[72] I. Foster, "The .in ,l .-_r: of the grid: enabling scalable virtual organizations," in
Proc. 1st IEEE/AC(M International Symposium on C'i,-l r CorT,,l'I,-Il and the Grid,
2001, pp. 6-7.

[73] R. Wolski, "Dynamically forecasting network performance using the network weather
service," in Journal of cluster -,,, 1i.:,1u, 1998.

[74] I. Matsuba, H. Suyari, S. Weon, and D. Sato, "Practical chaos time series ,I ll i-;
with financial applications," in Proc. 5th International Conference on S:,.,]rl
Processing, Beijing, 2000, vol. 1, pp. 265-271.

[75] P. Magni and R. Bellazzi, "A stochastic model to assess the variability of blood
glucose time series in diabetic patients self-monitoring," IEEE Trans. Biomed. Eng.,
vol. 53, no. 6, pp. 977-985, 2006.

[76] K. Didan and A. Huete, "Analysis of the global vegetation dynamic metrics using
modis vegetation index and land cover products," in IEEE International Geoscience
and Remote Sensing Symposium (IGARSS'04), 2004, vol. 3, pp. 2058-2061.

[77] P. Dinda, "The statistical properties of host load," Scientific P,.. 'Iini,,,. no.
7:3-4, 1999.

[78] P. Dinda, "Host load prediction using linear models," C'l,-i, r Cornl';,I.:., vol. 3, no.
4, 2000.

[79] Y. Zhang, W. Sun, and Y. Inoguchi, "CPU load predictions on the computational
grid *," in Proc. 6th IEEE International Symposium on C'ii. r Computing and the
Grid, liv 2006, vol. 1, pp. 321-326.

[80] J. Liang, K. N ,i istedt, and Y. Zhou, "Adaptive multi-resource prediction in
distributed resource sharing environment," in Proc. IEEE International Symposium
on C'1,-. r CorIpl..:,,l and the Grid, 2004, pp. 293-300.

[81] S. Vazhkudai and J. Schopf, "Predicting sporadic grid data transfers," Proc.
International Symposium on High Performance Distributed Cornr1,l.:,.'l pp. 188-196,
2002.

[82] S. Vazhkudai, J. Schopf, and I. Foster, "Using disk throughput data in predictions
of end-to-end grid data transfers," in Proc. 3rd International Workshop on Grid
Computing, Nov. 2002.









[83] S. Gunter and H. Bunke, "An evaluation of ensemble methods in handwritten word
recognition based on feature selection," in Proc. 17th International Conference on
Pattern Recognition, Aug. 2004, vol. 1, pp. 388-392.

[84] G. Jain, A. Ginwala, and Y. Aslandogan, "An approach to text classification using
dimensionality reduction and combination of classifiers," in Proc. IEEE International
Conference on In frI, i,,il.:in Reuse and Integration, Nov. 2004, pp. 564-569.

[85] V. white paper, "Comparing the mui, virtualcenter, and vmkusage,"

[86] J. D. Cryer, Time series n.'/l,;.-, Duxbury Press, Boston, MA, 1986.

[87] S. G. John O.Rawlings and D. A.Dickey, Applied Regression A,.,l;,-: Springer,
2001.

[88] R. T. Trevor Hastie and J. Friedman, The Elements of Statistical Learning, Springer,
2001.

[89] E. Bingham and H. Mannila, "Random projection in dimensionality reduction:
applications to image and text data," in Knowledge Discovery and Data Mining,
2001, pp. 245-250.

[90] L. Sirovich and R. Everson, \ I! ,, ii :, ini and analysis of large scientific datasets,"
Int. Journal of Supercomputer Applications, vol. 6, no. 1, pp. 50-68, 1992.

[91] J. Yang, Y. Zhang and B. Kisiel, "A scalability analysis of classifiers in text
categorization," in ACMI SIGIR'03, 2003, pp. 96-103.

[92] F. Friedman, J.H. Baskett and L. Shustek, "An algorithm for finding nearest
neighbors," IEEE Transactions on Computers, vol. C-24, no. 10, pp. 1000-1006, Oct.
1975.

[93] J. Friedman, J.H. Be-,tk and R. Finkel, "An algorithm for finding best matches in
logarithmic expected time," AC'[ Transactions on Mathematical Software, vol. 3,
pp. 209-226, 1977.

[94] P. D. G. Banga and J. Mogul, "Resource containers: A new facility for resource
management in server systems," in Proc. 3rd symposium on Operating System
Design and Implementation, New Orleans, Feb. 1999.

[95] L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, and J. C'!h -
"Towards a doctrine of containment: Grid hosting with adaptive resource control,"
in Proc. Supercomputing, Tampa, FL, Nov. 2006.

[96] R. Dubes, "How many clusters are best? -an experiment," Pattern Recogn., vol. 20,
no. 6, pp. 645-663, Nov. 1987.

[97] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: a review," ACMI
Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.









[98] W\\.. i II up98," http://ita.ee.lbl.gov/html/contrib/WorldCup.html.

[99] "Logreplayer," http://www.cs.virginia.edu/ rz5b/software/logreplayer-manual.htm.

[100] C. Isci, A. Buyuktosunoglu, and M. Martonosi, "Long-term workload phases:
duration predictions and applications to dvfs," IEEE Micro, vol. 25, no. 5, pp.
39-51, 2005.

[101] C. Isci and M. Martonosi, "Phase characterization for power: evaluating
control-flow-based and event-counter-based techniques," Proc. 12th International
Symposium on High-Performance Computer Architecture, pp. 121-132, 2006.

[102] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically
characterizing large scale program behavior," in Proc. 10th International Con-
ference on Architectural Support for P,.ii ,' ii,,i,.,:, Iraq,. and Operating S1,1 -1.
2002, pp. 45-57.

[103] H. Patil, R. Cohn, M. C'!i i,'-y, R. Kapoor, A. Sun, and A. Karunanidhi,
"Pinpointing representative portions of large intel itanium programs with dynamic
instrumentation," in Proc. 37th annual international symposium on Microarchitec-
ture, 2004.

[104] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas,
\h i, ,,'y hierarchy reconfiguration for energy and performance in general purpose
architectures," in Proc. 33rd annual international symposium on microarchitecture,
Dec. 2000, pp. 245-257.

[105] A. Dhodapkar and J. Smith, \l i1, i, ig multi-configuration hardware via dynamic
working set analysis," in Proc. 29th Annual International Symposium on Computer
Architecture, Anchorage, AK, May 2002, pp. 233-244.

[106] A. Dhodapkar and J. Smith, "Comparing program phase detection techniques," in
Proc. 36th Annual IEEE/AC M International Symposium on Microarchitecture, 2003,
pp. 217-227.

[107] B. Urgaonkar, P. S1:, i,,i-, A. C'! iiili and P. Gov,-- "Dynamic provisioning
of multi-tier internet applications," in Proc. ',I./ International Conference of
Autonomic C ',,n-,j:,,; June 2005, pp. 217-228.

[108] J. Wildstrom, P. Stone, E. Witchel, R. J. Mooney, and M. Dahlin, "Towards
self-configuring hardware for distributed computer systems," in Proc. 2nd Interna-
tional Conference of Autonomic Computing, June 2005, pp. 241-249.

[109] J. S. C'! .-, D. E. Irwin, L. E. Grit, J. D. Moore, and S. E. Sprenkle, "Dynamic
virtual clusters in a grid site manager," Proc. 12th IEEE International Symposium
on High Performance Distributed Corn,,',l:,.',! pp. 90-100, June 2003.






145


[110] D. Kusic and N. K ind il- ii.i, "Risk-aware limited lookahead control for dynamic
resource provisioning in enterprise computing systems," in Proc. 3rd International
Conference of Autonomic ConruI,,l',: 2006, pp. 74-83.









BIOGRAPHICAL SKETCH

Jian Zhang was born in C('! i',I'ii, C('!i i She received her B.S. degree in 1995, from

the University of Electronic Science and Technology of C('ii i, i, ,i ii;i.-; in computer

communication. She received her M.S. degree in 2001 from the University of Florida,

in 1 iii_.iw-; in electrical and computer engineering. Since 2002, she has been with the

Advanced Computing and Information Systems Laboratory (ACIS) at the University of

Florida, pursuing her Ph.D. degree. Her research interests include distributed systems,

autonomic computing, virtualization technologies, and information systems.





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Iwouldliketoexpressmysinceregratitudetomyadvisor,ProfessorRenatoJ.Figueiredo,forhisinvaluableadvice,encouragement,andsupport.Thisdissertationwouldnothavebeenpossiblewithouthisguidanceandsupport.MydeepappreciationgoestoProfessorJoseA.B.FortesforparticipatinginmysupervisorycommitteeandforalltheguidanceandopportunitiestoworkintheIn-VIGOteamthathegavemeduringmyPh.Dstudy.MydeeprecognitionalsogoestoProfessorMalayGhoshandProfessorAlanGeorgeforservingonmysupervisorycommitteeandfortheirvaluablesuggestions.ManythanksgotoDr.MazinYousifandMr.RobertCarpenterfromIntelCorporationfortheirvaluableinputandgenerousfundingforthisresearch.ThanksalsogotomycolleaguesintheAdvancedComputingInformationSystems(ACIS)Laboratoryfortheirdiscussionofideasandyearsoffriendship.Lastbutnotleast,Ioweaspecialdebtofgratitudetomyfamily.Withouttheirselessloveandsupport,IcannotimaginewhatIwouldhaveachieved. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 13 1.1ResourcePerformanceModeling ........................ 14 1.2AutonomicComputing ............................. 15 1.3Learning ..................................... 17 1.3.1SupervisedLearning ........................... 17 1.3.2UnsupervisedLearning ......................... 18 1.3.3ReinforcementLearning ......................... 18 1.3.4OtherLearningParadigms ....................... 19 1.4VirtualMachines ................................ 20 1.4.1VirtualMachineCharacteristics .................... 20 1.4.2VirtualMachinePlant ......................... 22 2APPLICATIONCLASSIFICATIONBASEDONMONITORINGANDLEA-RNINGOFRESOURCECONSUMPTIONPATTERNS ............. 24 2.1Introduction ................................... 24 2.2ClassicationAlgorithms ............................ 26 2.2.1PrincipalComponentAnalysis ..................... 27 2.2.2k-NearestNeighborAlgorithm ..................... 30 2.3ApplicationClassicationFramework ..................... 31 2.3.1PerformanceProler .......................... 32 2.3.2ClassicationCenter .......................... 33 2.3.2.1Datapreprocessingbasedonexpertknowledge ....... 33 2.3.2.2Featureselectionbasedonprincipalcomponentanalysis 34 2.3.2.3Trainingandclassication .................. 35 2.3.3PostProcessingandApplicationDatabase .............. 35 2.4ExperimentalResults .............................. 36 2.4.1ClassicationAbility .......................... 36 2.4.2SchedulingPerformanceImprovement ................. 41 2.4.3ClassicationCost ............................ 45 2.5RelatedWork .................................. 45 2.6Conclusion .................................... 47 5

PAGE 6

......................................... 49 3.1Introduction ................................... 49 3.2StatisticalInference ............................... 51 3.2.1FeatureSelection ............................ 51 3.2.2BayesianNetwork ............................ 52 3.2.3MahalanobisDistance .......................... 55 3.2.4ConfusionMatrix ............................ 55 3.3AutonomicFeatureSelectionFramework ................... 56 3.3.1DataQualityAssuror .......................... 56 3.3.2FeatureSelector ............................. 59 3.3.3Trainer .................................. 61 3.4ExperimentalResults .............................. 62 3.4.1FeatureSelectionandClassicationAccuracy ............. 62 3.4.2ClassicationValidation ........................ 65 3.4.3TrainingDataQualityAssurance ................... 71 3.5RelatedWork .................................. 71 3.6Conclusion .................................... 73 4ADAPTIVEPREDICTORINTEGRATIONFORSYSTEMPERFORMANCEPREDICTIONS .................................... 74 4.1Introduction ................................... 74 4.2RelatedWork .................................. 76 4.3VirtualMachineResourcePredictionOverview ............... 77 4.4TimeSeriesModelsforResourcePerformancePrediction .......... 80 4.5AlgorithmsforPredictionModelSelection .................. 82 4.5.1k-NearestNeighbor ........................... 83 4.5.2BayesianClassication ......................... 83 4.5.3PrincipalComponentAnalysis ..................... 85 4.6Learning-AidedAdaptiveResourcePredictor ................. 86 4.6.1TrainingPhase ............................. 86 4.6.2TestingPhase .............................. 89 4.7EmpiricalEvaluation .............................. 90 4.7.1BestPredictorSelection ........................ 90 4.7.2VirtualMachinePerformanceTracePrediction ............ 91 4.7.2.1Performanceofk-NNbasedLARPredictor ......... 92 4.7.2.2Performancecomparisonofk-NNandBayesian-classierbasedLARPredictor ..................... 96 4.7.2.3PerformancecomparisonoftheLARPredictorsandthecumulative-MSEbasedpredictorusedintheNWS .... 97 4.7.3Discussion ................................ 98 4.8Conclusion .................................... 100 6

PAGE 7

......................................... 106 5.1Introduction ................................... 106 5.2ApplicationResourceDemandPhaseAnalysisandPredictionPrototype 108 5.3DataClustering ................................. 111 5.3.1StagesinClustering ........................... 111 5.3.2DenitionsandNotation ........................ 112 5.3.3k-meansClustering ........................... 113 5.3.4FindingtheOptimalNumberofClusters ............... 114 5.4PhasePrediction ................................ 117 5.5EmpiricalEvaluation .............................. 118 5.5.1PhaseBehaviorAnalysis ........................ 119 5.5.1.1SPECseis96benchmark ................... 119 5.5.1.2WorldCupweblogreplay .................. 122 5.5.2PhasePredictionAccuracy ....................... 123 5.5.3Discussion ................................ 125 5.6RelatedWork .................................. 126 5.7Conclusion .................................... 128 6CONCLUSION .................................... 135 REFERENCES ....................................... 137 BIOGRAPHICALSKETCH ................................ 146 7

PAGE 8

Table page 2-1Performancemetriclist ................................ 35 2-2Listoftrainingandtestingapplications ....................... 37 2-3Experimentaldata:applicationclasscompositions ................. 40 2-4Systemthroughput:concurrentvs.sequentialexecutions ............. 44 3-1Sampleconfusionmatrixwithtwoclasses(L=2) .................. 56 3-2Sampleperformancemetricsintheoriginalfeatureset .............. 59 3-3Confusionmatrixofclassicationresults ...................... 65 3-4Performancemetriccorrelationmatrixesoftestapplications ........... 70 4-1NormalizedpredictionMSEstatisticsforresourcesofVM1 ............ 96 4-2NormalizedpredictionMSEstatisticsforresourcesofVM2 ............ 97 4-3NormalizedpredictionMSEstatisticsforresourcesofVM3 ............ 98 4-4NormalizedpredictionMSEstatisticsforresourcesofVM4 ............ 99 4-5NormalizedpredictionMSEstatisticsforresourcesofVM5 ............ 99 4-6Bestpredictorsofallthetracedata ......................... 100 5-1Performancefeaturelist ............................... 119 5-2SPECseis96totalcostratioschedulefortheeightperformancefeatures ..... 122 5-3Averagephasepredictionaccuracy ......................... 124 5-4PerformancefeaturelistofVMtraces ........................ 124 5-5AveragephasepredictionaccuracyoftheveVMs ................ 126 8

PAGE 9

Figure page 1-1Structureofanautonomicelement. ......................... 16 1-2Classicationsystemrepresentation ......................... 19 1-3Virtualmachinestructure .............................. 21 1-4VMPlantarchitecture ................................ 23 2-1Sampleofprincipalcomponentanalysis ....................... 28 2-2k-nearestneighborclassicationexample ...................... 31 2-3Applicationclassicationmodel ........................... 32 2-4Performancefeaturespacedimensionreductionsintheapplicationclassicationprocess ......................................... 34 2-5Sampleclusteringdiagramsofapplicationclassications ............. 39 2-6Applicationclasscompositiondiagram ....................... 42 2-7Systemthroughputcomparisonsfortendierentschedules ............ 43 2-8Applicationthroughputcomparisonsofdierentschedules ............ 44 3-1SampleBayesiannetworkgeneratedbyfeatureselector .............. 54 3-2Featureselectionmodel ............................... 57 3-3Bayesian-networkbasedfeatureselectionalgorithmforapplicationclassication 60 3-4Averageclassicationaccuracyof10setsoftestdataversusnumberoffeaturesselectedintherstexperiment ........................... 63 3-5Two-classtestdatadistributionwiththersttwoselectedfeatures ....... 63 3-6Five-classtestdatadistributionwithrsttwoselectedfeatures .......... 66 3-7Comparisonofdistancesbetweenclustercentersderivedfromexpert-selectedandautomaticallyselectedfeaturesets ....................... 66 3-8Trainingdataclusteringdiagramderivedfromexpert-selectedandautomat-icallyselectedfeaturesets .............................. 67 3-9Classicationresultsofbenchmarkprograms .................... 69 4-1Virtualmachineresourceusagepredictionprototype ............... 78 4-2SampleXMLschemaoftheVMperformanceDB ................. 80 9

PAGE 10

............... 87 4-4Learning-aidedadaptiveresourcepredictordataow ................ 88 4-5BestpredictorselectionfortraceVM2 load15 ................... 92 4-6BestpredictorselectionfortraceVM2 PktIn .................... 93 4-7BestpredictorselectionfortraceVM2 Swap .................... 94 4-8BestpredictorselectionfortraceVM2 Disk .................... 95 4-9Predictorperformancecomparison(VM1) ..................... 101 4-10Predictorperformancecomparison(VM2) ..................... 102 4-11Predictorperformancecomparison(VM3) ..................... 103 4-12Predictorperformancecomparison(VM4) ..................... 104 4-13Predictorperformancecomparison(VM5) ..................... 105 5-1Applicationresourcedemandphaseanalysisandpredictionprototype ...... 109 5-2Resourceallocationstrategycomparison ...................... 115 5-3Applicationresourcedemandphasepredictionworkow .............. 129 5-4PhaseanalysisofSPECseis96CPU user ...................... 130 5-5PhaseanalysisofWorldCup'98Bytes In ...................... 133 5-6PhaseanalysisofWorldCup'98Bytes out ...................... 134 10

PAGE 11

Withthegoalofautonomiccomputing,itisdesirabletohavearesourceschedulerthatiscapableofself-optimization,whichmeansthatwithagivenhigh-levelobjectivetheschedulercanautomaticallyadaptitsschedulingdecisionstothechangingworkload.Thisself-optimizationcapacityimposeschallengestosystemperformancemodelingbecauseofincreasingsizeandcomplexityofcomputingsystems. Ourgoalsweretwofold:todesignperformancemodelsthatcanderiveapplications'resourceconsumptionpatternsinasystematicway,andtodevelopperformancepredictionmodelsthatcanadapttochangingworkloads.Anoveltyinthesystemperformancemodeldesignistheuseofvariousmachinelearningtechniquestoecientlydealwiththecomplexityofdynamicworkloadsbasedonmonitoringandminingofhistoricalperformancedata.Intheenvironmentsconsideredinthisthesis,virtualmachines(VMs)areusedasresourcecontainerstohostapplicationexecutionsbecauseoftheirexibilityinsupportingresourceprovisioningandloadbalancing. Ourstudyintroducedthreeperformancemodelstosupportself-optimizedschedulinganddecision-making.First,anovelapproachisintroducedforapplicationclassicationbasedonthePrincipalComponentAnalysis(PCA)andthek-NearestNeighbor(k-NN)classier.Ithelpstoreducethedimensionalityoftheperformancefeaturespaceandclassifyapplicationsbasedonextractedfeatures.Inaddition,afeatureselectionmodelis 11

PAGE 12

Second,anadaptivesystemperformancepredictionmodelisinvestigatedbasedonalearning-aidedpredictorintegrationtechnique.Supervisedlearningtechniquesareusedtolearnthecorrelationsbetweenthestatisticalpropertiesoftheworkloadandthebest-suitedpredictors. Inadditiontoaone-stepaheadpredictionmodel,aphasecharacterizationmodelisstudiedtoexplorethelarge-scalebehaviorofapplication'sresourceconsumptionpatterns. Ourstudyprovidesnovelmethodologiestomodelsystemandapplicationperfor-mance.Theperformancemodelscanself-optimizeovertimebasedonlearningofhistoricalruns,thereforebetteradapttothechangingworkloadandachievebetterpredictionaccuracythantraditionalmethodswithstaticparameters. 12

PAGE 13

Thevisionofautonomiccomputing[ 1 ]istoimprovemanageabilityofcomplexITsystemstoafargreaterextentthancurrentpracticethroughself-conguring,self-healing,self-optimization,andself-protection.Toperformtheself-congurationandself-optimizationofapplicationsandassociatedexecutionenvironmentsandtorealizedynamicresourceallocation,bothresourceawarenessandapplicationawarenessareimportant.Inthiscontext,therehasbeensubstantialresearchoneectiveschedulingpolicies[ 2 { 6 ]withgivenresourceandapplicationspecications.Whilethereareseveralmethodsforobtainingresourcespecicationparameters(e.g.,CPU,memory,anddiskinformationfromthe/proclesysteminUnixsystems),applicationspecicationischallengingtodescribeduetothefollowingfactors:1)lackofknowledgeandcontroloftheapplicationsourcecodes,2)multi-dimensionalityofapplicationresourceconsumptionpatterns,and3)multi-stageresourceconsumptionpatternsoflong-runningapplications.Furthermore,thedynamicsofsystemperformanceaggravatethedicultiesofperformancedescriptionandprediction. Inthisdissertation,anintegratedframeworkconsistingofalgorithmsandmiddlewareforresourceperformancemodelingisdeveloped.Itincludessystemperformancepredictionmodelsandapplicationresourcedemandmodelsbasedonlearningofhistoricalexecutions.Anoveltyoftheperformancemodeldesignsistheiruseofmachinelearningtechniquestoecientlyandrobustlydealwiththecomplexdynamicalphenomenaoftheworkloadandresourceavailability.Inaddition,virtualmachines(VMs)areusedasresourcecontainersbecausetheyprovideaexiblemanagementplatformthatisusefulforboththeencapsulationofapplicationexecutionenvironmentsandtheaggregationandaccountingofresourcesconsumedbyanapplication.Inthiscontext,resourceschedulingbecomesaproblemofhowtodynamicallyallocateresourcestovirtualmachines(whichhostapplicationexecutions)tomeettheapplications'resourcedemands. 13

PAGE 14

1.1 givesanoverviewofresourceperformancemodeling.Sections 1.2 1.3 ,and 1.4 ,brieyintroduceautonomiccomputing,machinelearning,andvirtualmachineconcepts. 7 ]: Insystemprocurementstudies,thecost/performanceratioiscommonlyusedasametricforcomparingsystems.Threetechniquesforperformanceevaluationareanalyticalmodeling,simulation,andmeasurement.Sometimesitishelpfultousetwoormoretechniques,eithersimultaneouslyorsequentially. Computersystemperformancemeasurementsinvolvemonitoringthesystemwhileitisbeingsubjectedtoaparticularworkload.Inordertoperformmeaningfulmeasurements,theworkloadshouldbecarefullyselectedbasedontheservicesexercisedbytheworkload,thelevelofdetail,representativeness,andtimeliness.Sincearealuserenvironmentisgenerallynotrepeatable,itisnecessarytostudytherealuserenvironments,observethekeycharacteristics,anddevelopaworkloadmodelthatcanbeusedrepeatedly.This 14

PAGE 15

1 ].Theessenceofautonomiccomputingistoenableself-managedsystems,whichincludesthefollowingaspects: Autonomiccomputingpresentschallengesandopportunitiesinvariousareassuchaslearningandoptimizationtheory,automatedstatisticallearning,andbehavioralabstractionandmodels[ 8 ].Thisdissertationaddressessomeofthechallengesin 15

PAGE 16

Structureofanautonomicelement. theapplicationresourceperformancemodelingtosupportself-congurationandself-optimizationofapplicationexecutionenvironments. Generally,anautonomicsystemisaninteractivecollectionofautonomicelements:individualsystemconstituentsthatcontainresourcesanddeliverservicestohumansandotherautonomicelements.AsFigure 1-1 shows,anautonomicelementwilltypicallyconsistofoneormoremanagedelementscoupledwithasingleautonomicmanagerthatcontrolsandrepresentsthem.Themanagedelementcouldbeahardwareresource,suchasstorage,aCPU,orasoftwareresource,suchasadatabase,oradirectoryservice,oralargelegacysystem[ 1 ].Themonitoringprocesscollectstheperformancedataofthe 16

PAGE 17

Machinelearningisanaturalsolutiontoautomation.Itavoidsknowledge-intensivemodelbuildingandreducestherelianceonexpertknowledge.Inaddition,itcandealwithcomplexdynamicalphenomenaandenablethesystemtoadapttothechangingenvironments. Traditionallytherearegenerallythreetypesoflearning:supervisedlearning,unsupervisedlearning,andreinforcementlearning. 9 ].\Learning"consistsofchoosingoradaptingparameterswithinthemodelstructurethatworkbestonthesamplesathandandotherslikethem.Oneofthemostprominentandbasiclearningtasksisclassicationorprediction,whichisusedextensivelyinthiswork.Forclassicationproblems,alearningsystemcanbeviewedasahigher-levelsystemthathelpsbuildthedecision-makingsystemitself,calledtheclassier.Figure 1-2 illustratesthestructureofaclassicationsystemanditslearningprocess. 17

PAGE 18

Reinforcementlearningalgorithmsattempttondapolicyformaximizingcumulativerewardfortheagentoverthecourseoftheproblem.Theenvironmentistypically 18

PAGE 19

ClassicationsystemrepresentationDuringthetrainingphase,labeledsamplecasesareusedtoderivetheunknownparametersoftheclassiermodel.Duringthetestingphase,thecustomizedclassierisusedtoassociateaspecicpatternofobservationswithaspecicclass. 19

PAGE 20

Inthiswork,variouslearningtechniquesareusedtomodeltheapplicationresourcedemandandsystemperformance.Thesemodelscanhelptosystemtoadapttothechangingworkloadandachievehigherperformance. 10 ].A\classic"virtualmachine(VM)enablesmultipleindependent,isolatedoperatingsystems(guestVM)torunononephysicalmachine(hostserver),ecientlymultiplexingsystemresourcesofthehostmachine[ 10 ]. Avirtual-machinemonitor(VMM)isasoftwarelayerthatrunsonahostplatformandprovidesanabstractionofacompletecomputersystemtohigher-levelsoftware.TheabstractioncreatedbytheVMMiscalledavirtualmachine.Figure 1-3 showsthestructureofvirtualmachines. 11 ].Thefollowingcharacteristicsofvirtualmachinesmakethemahighlyexibleandmanageableapplicationexecutionplatform: 20

PAGE 21

VirtualmachinestructureAvirtual-machinemonitorisasoftwarelayerthatrunsonahostplatformandprovidesanabstractionofacompletecomputersystemtohigher-levelsoftware.Thehostplatformmaybethebarehardware(TypeIVMM)orahostoperatingsystem(TypeIIVMM).Thesoftwarerunningabovethevirtual-machineabstractioniscalledguestsoftware(operatingsystemandapplications). 12 ]cansupportdynamicmemoryextensionofVMguestwithoutshuttingdownthesystem. 21

PAGE 22

13 ].VMware'sVMotioncansupportmigrationwithzerodowntime[ 14 ].TechniquesbasedonVirtualFileSystem(VFS)hasbeenstudiedin[ 15 ]tosupportVMmigrationacrossWide-AreaNetworks(WANs). 16 ]handlesvirtualmachinecreationandhostingforclassicvirtualmachines(e.g.VMware[ 17 ])anduser-modeLinuxplatforms(e.g.,UML[ 18 ])viadynamiccloning,instantiationandconguration.TheVMPlanthasthreemajorcomponents:VirtualMachineProductionCenter(VMPC),VirtualMachineWarehouse(VMWH)andVirtualMachineInformationSystem(VMIS).TheVMPChandlesthevirtualmachine'screation,congurationanddestruction.Itemploysacongurationpatternrecognitiontechniquetoidentifyopportunitiestoapplythepre-cachedvirtualmachinestatetoacceleratethemachinecongurationprocess.TheVMWHstoresthepre-cachedmachineimages,monitorsthemandtheirhostserver'sperformanceandperformsthemaintenanceactivity.TheVMISstoresthestaticanddynamicinformationofthevirtualmachinesandtheirhostserver.ThearchitectureoftheVMPlantisshowninFigure 1-4 TheVMPlantprovidesAPItoVMShopforvirtualmachinecreation,deconstruction,andmonitoring.TheVMShophasthreemajorcomponents:VMCreater,VMCollecterandVMReporter.TheVMCreaterhandlesthevirtualmachines'creation;TheVMCollecterhandlesthemachines'deconstructionandsuspension;TheVMReporterhandlesinformationrequest.Incombinationwithavirtualmachineshopservice,VMPlants 22

PAGE 23

VMPlantarchitecture deployedacrossphysicalresourcesofasiteallowclients(usersand/ormiddlewareactingontheirbehalf)toinstantiateandcontrolclient-customizedvirtualexecutionenvironments.Theplantcanbeintegratedwithvirtualnetworkingtechniques(suchasVNET[ 19 ])toallowclient-sidenetworkmanagement.Customized,application-specicVMscanbedenedinVMPlantwiththeuseofadirectedacyclicgraph(DAG)conguration.VMexecutionenvironmentsdenedwithinthisframeworkcanthenbeclonedanddynamicallyinstantiatedtoprovideahomogeneousapplicationexecutionenvironmentacrossdistributedresources. InthecontextoftheVMPlant,anapplicationcanbescheduledtoruninaspecicvirtualmachine,whichiscalledapplicationVM.Therefore,thesystemperformancemetriccollectedfromtheapplicationVMcanreectandsummarizetheresourceconsumptionoftheapplication. 23

PAGE 24

Applicationawarenessisanimportantfactorofecientresourcescheduling.ThischapterintroducesanovelapproachforapplicationclassicationbasedonthePrincipalComponentAnalysis(PCA)andthek-NearestNeighbor(k-NN)classier.Thisapproachisusedtoassistschedulinginheterogeneouscomputingenvironments.Ithelpstoreducethedimensionalityoftheperformancefeaturespaceandclassifyapplicationsbasedonextractedfeatures.Theclassicationconsidersfourdimensions:CPU-intensive,I/Oandpaging-intensive,network-intensive,andidle.Applicationclassinformationandthestatisticalabstractsoftheapplicationbehaviorarelearnedoverhistoricalrunsandusedtoassistmulti-dimensionalresourcescheduling. 2 { 4 ]withgivenresourceandapplicationspecications.Thereareseveralmethodsforobtainingresourcespecicationparameters(e.g.,CPU,memory,diskinformationfrom/procinUnixsystems).However,applicationspecicationischallengingtodescribebecauseofthefollowingfactors: 24

PAGE 25

20 ][ 21 ]itispossibletomigrateanapplicationduringitsexecutionforloadbalancing. Theabovecharacteristicsofgridapplicationspresentachallengetoresourcescheduling:Howtolearnandmakeuseofanapplication'smulti-dimensionalresourceconsumptionpatternsforresourceallocation?Thischapterintroducesanovelapproachtosolvethisproblem:applicationclassicationbasedonthefeatureselectionalgorithm,PrincipalComponentAnalysis(PCA),andK-NearestNeighbor(k-NN)classier[ 22 ][ 23 ].ThePCAisappliedtoreducethedimensionalityofapplicationperformancemetrics,whilepreservingthemaximumamountofvarianceinthemetrics.Then,thek-NearestNeighboralgorithmisusedtocategorizetheapplicationexecutionstatesintodierentclassesbasedontheapplication'sresourceconsumptionpattern.Thelearnedapplicationclassinformationisusedtoassisttheresourceschedulingdecision-makinginheterogeneouscomputingenvironments. TheVMPlantserviceintroducedinChapter 1.4.2 providesautomatedcloningandcongurationofapplication-centricVirtualMachines(VMs).Problem-solvingenvironmentssuchasIn-VIGO[ 24 ]cansubmitrequeststotheVMPlantservice,which 25

PAGE 26

Theclassicationsystemdescribedinthischapterleveragesthecapabilityofsummarizingapplicationperformancedatabycollectingsystem-leveldatawithinaVM,asfollows.Duringtheapplicationexecution,snapshotsofperformancemetricsaretakenatadesiredfrequency.APCAprocessoranalyzestheperformancesnapshotsandextractsthekeycomponentsoftheapplication'sresourceusage.Basedontheextractedfeatures,ak-NNclassiercategorizeseachsnapshotintooneofthefollowingclasses:CPU-intensive,IO-intensive,memory-intensive,network-intensiveandidle. Byusingthissystem,resourceschedulingcanbebasedonacomprehensivediagnosisoftheapplicationresourceutilization,whichconveysmoreinformationthanCPUloadinisolation.Experimentsreportedinthischaptershowthattheresourceschedulingfacilitatedwithapplicationclasscompositionknowledgecanachievebetteraveragesystemthroughputthanschedulingwithouttheknowledge. Therestofthechapterisorganizedasfollows:Section 2.2 introducesthePCAandthek-NNclassierinthecontextofapplicationclassication.Section 2.3 presentstheclassicationmodelandimplementation.Section 2.4 presentsanddiscussesexperimentalresultsofclassicationperformancemeasurements.Section 2.5 discussesrelatedwork.ConclusionsandfutureworkarediscussedinSection 2.6 26

PAGE 27

Apatternclassicationsystemconsistsofpre-processing,featureextraction,classication,andpost-processing.Thepre-processingandfeatureextractionareknowntosignicantlyaecttheclassication,becausetheerrorcausedbywrongfeaturesmaypropagatetothenextstepsandstayspredominantintermsoftheoverallclassicationerror.Inthiswork,asetofapplicationperformancemetricsarechosenbasedonexpertknowledgeandtheprincipleofincreasingrelevanceandreducingredundancy[ 25 ]. 22 ]isalineartransformationrepresentingdatainaleast-squaresense.Itisdesignedtocapturethevarianceinadatasetintermsofprincipalcomponentsandreducethedimensionalityofthedata.Ithasbeenwidelyusedindataanalysisandcompression. Whenasetofvectorsamplesarerepresentedbyasetoflinespassingthroughthemeanofthesamples,thebestlineardirectionsresultineigenvectorsofthescattermatrix-theso-called\principalcomponents"asshowninFigure 2-1 .Thecorrespondingeigenvaluesrepresentthecontributiontothevarianceofdata.Whentheklargesteigenvaluesofnprincipalcomponentsarechosentorepresentthedata,thedimensionalityofthedatareducesfromntok. Principalcomponentanalysisisbasedonthestatisticalrepresentationofarandomvariable.Supposewehavearandomvectorpopulationx,where andthemeanofthatpopulationisdenotedby 27

PAGE 28

Sampleofprincipalcomponentanalysis andthecovariancematrixofthesamedatasetis ThecomponentsofCx,denotedbycij,representthecovariancesbetweentherandomvariablecomponentsxiandxj.Thecomponentciiisthevarianceofthecomponentxi. Fromasampleofvectorsx1;;xM,wecancalculatethesamplemeanandthesamplecovariancematrixastheestimatesofthemeanandthecovariancematrix. Theeigenvectorseiandthecorrespondingeigenvaluesicanbeobtainedbysolvingtheequation 28

PAGE 29

(2{5) wheretheIistheidentifymatrixhavingthesameorderthanCxandthejjdenotesthedeterminantofthematrix.Ifthedatavectorhasncomponents,thecharacteristicequationbecomesofordern. Byorderingtheeigenvectorsintheorderofdescendingeigenvalues(largestrst),onecancreateanorderedorthogonalbasiswiththersteigenvectorhavingthedirectionoflargestvarianceofthedata.Inthisway,wecannddirectionsinwhichthedatasethasthemostsignicantamountsofenergy. Supposeonehasadatasetofwhichthesamplemeanandthecovariancematrixhavebeencalculated.LetAbeamatrixconsistingofeigenvectorsofthecovariancematrixastherowvectors. Bytransformingadatavectorx,weget (2{6) whichisapointintheorthogonalcoordinatesystemdenedbytheeigenvectors.Componentsofycanbeseenasthecoordinatesintheorthogonalbase.Wecanreconstructtheoriginaldatavectorxfromyby usingthepropertyofanorthogonalmatrixA1=AT.TheATisthetransposeofamatrixA.Theoriginalvectorxwasprojectedonthecoordinateaxesdenedbytheorthogonalbasis.Theoriginalvectorwasthenreconstructedbyalinearcombinationoftheorthogonalbasisvectors. 29

PAGE 30

(2{8) and ItmeansthatweprojecttheoriginaldatavectoronthecoordinateaxeshavingthedimensionKandtransformingthevectorbackbyalinearcombinationofthebasisvectors.Thismethodminimizesthemean-squareerrorbetweenthedataandtherepresentationwithgivennumberofeigenvectors. Ifthedataisconcentratedinalinearsubspace,thismethodprovidesawaytocompressdatawithoutlosingmuchinformationandsimplifyingtherepresentation.Bypickingtheeigenvectorshavingthelargesteigenvaluesweloseaslittleinformationaspossibleinthemean-squaresense. 26 ].Ithasbeenusedinmanyapplicationsintheeldofdatamining,statisticalpatternrecognition,imageprocessing,andmanyothers.Thepurposeofthisalgorithmistoclassifyanewobjectbasedonattributesandtrainingsamples.Theclassiersdonotuseanymodeltotandonlybasedonmemory.Givenaquerypoint,wendknumberofobjectsor(trainingpoints)closesttothequerypoint.Thek-NNclassierdecidestheclassbyconsideringthevotesofk(anoddnumber)nearestneighbors.Thenearest 30

PAGE 31

k-nearestneighborclassicationexample neighborispickedasthetrainingdatageometricallyclosesttothetestdatainthefeaturespaceasillustratedinFigure 2-2 Inthiswork,avectoroftheapplication'sresourceconsumptionsnapshotsisusedtorepresenttheapplication.Eachsnapshotconsistsofachosensetofperformancemetrics.ThePCAisusedtopreprocesstherawdatatoindependentfeaturesfortheclassier.Then,a3-NNclassierisusedtoclassifyeachsnapshot.Themajorityvoteofthesnapshots'classesisusedtorepresenttheclassoftheapplications:CPU-intensive,I/Oandpaging-intensive,network-intensive,oridle.Amachinewithnoloadexceptforbackgroundloadfromsystemdaemonsisconsideredasinidlestate. 2-3 .Inaddition,amonitoring 31

PAGE 32

ApplicationclassicationmodelThePerformanceprolercollectsperformancemetricsofthetargetapplicationnode.TheClassicationcenterclassiestheapplicationusingextractedkeycomponentsandperformsstatisticanalysisoftheclassicationresults.TheApplicationDBstorestheapplicationclassinformation.(misthenumberofsnapshotstakeninoneapplicationrun,t0=t1:arethebeginningendingtimesoftheapplicationexecution,VMIPistheIPaddressoftheapplication'shostmachine). systemisusedtosamplethesystemperformanceofacomputingnoderunninganapplicationofinterest. 32

PAGE 33

27 ]distributedmonitoringsystemisusedtomonitorapplicationnodes.TheperformancesamplertakessnapshotsoftheperformancemetricscollectedbyGangliaatapredenedfrequency(currently,5seconds)betweentheapplication'sstartingtimet0andendingtimet1.SinceGangliausesmulticastbasedonalisten/announceprotocoltomonitorthemachinestate,thecollectedsamplesconsistoftheperformancedataofallthenodesinasubnet.Theperformancelterextractsthesnapshotsofthetargetapplicationforfutureprocessing.Attheendofproling,anapplicationperformancedatapoolisgenerated.ThedatapoolconsistsofasetofndimensionalsamplesAnm=(a1;a2;;am),wherem=(t1t0)=disthenumberofsnapshotstakeninoneapplicationrunanddisthesamplingtimeinterval.Eachsampleaiconsistsofnperformancemetrics,whichincludeallthedefault29metricsmonitoredbyGangliaandthe4metricsthatweaddedbasedontheneedofclassication,includingthenumberofI/Oblocksreadfrom/writtentodisk,andthenumberofmemorypagesswappedin/out.Aprogramwasdevelopedtocollectthesefourmetrics(usingvmstat)andthemetricswereaddedtothemetriclistofGanglia'sgmond. 2-1 .Eachpairoftheperformancemetricscorrelatestotheresourceconsumptionbehaviorofthespecicapplicationclassandhaslimitedredundancies. 33

PAGE 34

Performancefeaturespacedimensionreductionsintheapplicationclassicationprocessm:Thenumberofsnapshotstakeninoneapplicationrun,n:Thenumberofperformancemetrics,Anm:Allperformancemetricscollectedbymonitoringsystem,A'pm:Theselectedrelevantperformancemetricsafterthezero-meanandunit-variancenormalization,Bqm:Theextractedkeycomponentmetrics,C1m:Theclassvectorofthesnapshots,Class:Theapplicationclass,whichisthemajorityvoteofsnapshots'classes. Forexample,performancemetricsofCPU SystemandCPU UserarecorrelatedtoCPU-intensiveapplications;Bytes InandBytes OutarecorrelatedtoNetwork-intensiveapplications;IO BIandIO BOarecorrelatedtotheIO-intensiveapplications;Swap InandSwap OutarecorrelatedtoMemory-intensiveapplications.Thedatapreprocessorextractstheseeightmetricsofthetargetapplicationnodefromthedatapoolbasedonourexpertknowledge.Thusitreducesthedimensionoftheperformancemetricfromn=33top=8andgeneratesA'pmasshowninFigure 2-4 .Inaddition,thepreprocessoralsonormalizestheselectedmetricstozero-meanandunit-variance. 2-1 asinputs.Itconductsthelineartransformationoftheperformancedataandselectstheprincipalcomponentsbasedonthepredenedminimalfractionvariance.Inourimplementation,theminimalfractionvariancewassettoextractexactlytwoprincipalcomponents.Therefore,attheendofprocessing,thedatadimensiongetsfurtherreducedfromp=8toq=2andthevectorBqmisgenerated,asshowninFigure 2-4 34

PAGE 35

Performancemetriclist Description System/User PercentCPU System/User Bytes In/Out Numberofbytespersecond into/outofthenetwork IO BI/BO Blockssentto/receivedfrom blockdevice(blocks/s) Swap In/Out Amountofmemoryswapped in/outfrom/todisk(kB/s) 28 ],isusedtorepresenttheIO-intensiveclass.SPECseis96[ 29 ],ascienticcomputingintensiveprogram,isusedtorepresenttheCPU-intensiveclass.Asyntheticapplication,Pagebench,isusedtorepresentthePaging-intensiveclass.ItinitializesandupdatesanarraywhosesizeisbiggerthanthememoryoftheVM,therebyinducingfrequentpagingactivity.Ettcp[ 30 ],abenchmarkthatmeasuresthenetworkthroughputoverTCPorUDPbetweentwonodes,isusedasthetrainingapplicationoftheNetwork-intensiveclass.Theperformancedataofallthesefourapplicationsandtheidlestateareusedtotraintheclassier.Foreachtestdata,thetrainedclassiercalculatesitsdistancetoallthetrainingdata.The3-NNclassicationidentiesonlythreetrainingdatasetswiththeshortestdistancetothetestdata.Thenthetestdata'sclassisdecidedbythemajorityvoteofthethreenearestneighbors. 2-4 .Inadditiontoasinglevalue(Class)theapplicationclassieralso 35

PAGE 36

2-2 summarizesthesetofapplicationsusedasthetrainingandthetestingapplicationsintheexperiments[ 28 { 38 ].The3-NNclassierwastrainedwiththeperformancedatacollectedfromtheexecutionsofthetrainingapplicationshighlightedinthetable.AlltheapplicationexecutionswerehostedbyaVMwareGSXvirtualmachine(VM1).ThehostserverofthevirtualmachinewasanIntel(R)Xeon(TM)dual-CPU1.80GHzmachinewith512KBcacheand1GBRAM.Inaddition,asecondvirtualmachinewiththesamespecicationwasusedtoruntheserverapplicationsofthenetworkbenchmarks. 36

PAGE 37

Listoftrainingandtestingapplications

PAGE 38

2-1 basedontheexpertknowledgeofthecorrelationbetweenthesemetricsandtheapplicationclasses.Afterthat,thePCAprocessorconductedthelineartransformationoftheperformancedataandselectedprincipalcomponentsbasedontheminimalfractionvariancedened.Inthisexperiment,thevariancecontributionthresholdwassettoextracttwo(q=2)principalcomponents.Ithelpstoreducethecomputationalrequirementsoftheclassier.Then,thetrained3-NNclassierconductsclassicationbasedonthedataofthetwoprincipalcomponents. Thetrainingdata'sclassclusteringdiagramisshowninFigure 2-5 (a).ThediagramshowsaPCA-basedtwo-dimensionalrepresentationofthedatacorrespondingtotheveclassestargetedbyoursystem.Afterbeingtrainedwiththetrainingdata,theclassierclassiestheremainingbenchmarkprogramsshowninTable 2-2 .Theclassierprovidesoutputsintwokindsofformats:theapplicationclass-clusteringdiagram,whichhelpstovisualizetheclassicationresults,andtheapplicationclasscomposition,whichcanbeusedtocalculatetheunitapplicationcost. Figure 2-5 showsthesampleclusteringdiagramsforthreetestapplications.Forexample,theinteractiveVMDapplication(Figure 2-5 (d))showsamixoftheidleclasswhenuserisnotinteractingwiththeapplication,theI/O-intensiveclasswhentheuserisuploadinganinputle,andtheNetwork-intensiveclasswhiletheuserisinteractingwiththeGUIthroughaVNCremotedisplay.Table 2-3 summarizestheclasscompositionsofallthetestapplications.Figure 2-6 visualizestheclasscompositionofsomesamplebenchmarkprograms.Theseclassicationresultsmatchtheclassexpectationsgainedfromempiricalexperiencewiththeseprograms.Theyareusedtocalculatetheunitapplicationcostshowninsection4.4. 38

PAGE 39

B C D Sampleclusteringdiagramsofapplicationclassications.A)Trainingdata:MixtureB)SimpleScalar:CPU-intensiveC)Autobench:Network-intensiveD)VMD:InteractivePrincipalComponent1and2aretheprincipalcomponentmetricsextractedbyPCA. 39

PAGE 40

Experimentaldata:applicationclasscompositions

PAGE 41

2-3 whenSPECseis96withmediumsizeinputdatawasexecutedinVM1with256MBmemory(SPECseis96 A),itisclassiedasCPU-intensiveapplication.IntheSPECseis96 Bexperiment,thesmallerphysicalmemory(32MB)resultedinincreasedpagingandI/Oactivity.TheincreasedI/OactivityisduetothefactthatlessphysicalmemoryisavailabletotheO/SbuercacheforI/Oblocks.Thebuercachesizeatruntimewasobservedtobeassmallas1MBinSPECseis96 B,andaslargeas200MBinSPECseis96 A.Inaddition,theexecutiontimegetsincreasedfrom291minutesand42secondsintherstcaseto426minutes58secondsinthesecondcase. Similarly,intheexperimentswithPostMark,dierentexecutionenvironmentcongurationschangedtheapplication'sresourceconsumptionpatternfromoneclasstoanother.Table 2-3 showsthatifalocalledirectorywasusedtostorethelestobereadandwrittenduringtheprogramexecution,thePostMarkbenchmarkshowedtheresourceconsumptionpatternoftheI/O-intensiveclass.Incontrast,withanNFSmountedledirectory,it(PostMark NFS)wasturnedintoaNetwork-intensiveapplication. Therstsetofexperimentsdemonstratesthattheapplicationclassinformationcanhelptheschedulertooptimizeresourcesharingamongapplicationsrunninginparalleltoimprovesystemthroughputandreducethroughputvariances.Intheexperiments,three 41

PAGE 42

Applicationclasscompositiondiagram applications{SPECseis96(S)withsmalldatasize,PostMark(P)withlocalledirectoryandNetPIPEClient(N){wereselected,andthreeinstancesofeachapplicationwereexecuted.Thescheduler'staskwastodecidehowtoallocatethenineapplicationinstancestorunonthe3virtualmachines(VM1,VM2andVM3)inparallel,eachofwhichhosted3jobs.TheVM4wasusedtohosttheNetPIPEserver.Therearetenpossibleschedulesavailable,asshowninFigure 2-7 Whenmultipleapplicationsrunonthesamehostmachineatthesametime,thereareresourcecontentionsamongthem.Twoscenarioswerecompared:intherstscenario,theschedulerdidnotuseclassinformation,andoneofthetenpossiblescheduleswas 42

PAGE 43

Systemthroughputcomparisonsfortendierentschedules1:f(SSS),(PPP),(NNN)g,2:f(SSS),(PPN),(PNN)g,3:f(SSP),(SPP),(NNN)g,4:f(SSP),(SPN),(PNN)g,5:f(SSP),(SNN),(PPN)g,6:f(SSN),(SPP),(PNN)g,7:f(SSN),(SPN),(PPN)g,8:f(SSN),(SNN),(PPP)g,9:f(SPP),(SPN),(SNN)g,10:f(SPN),(SPN),(SPN)gS{SPECseis96(CPU-intensive),P{PostMark(I/O-intensive),N{NetPIPE(Network-intensive). selectedatrandom.Theotherscenariousedapplicationclassknowledge,alwaysallocatingapplicationsofdierentclasses(CPU,I/Oandnetwork)torunonthesamemachine(Schedule10,Figure 2-7 ).ThesystemthroughputsobtainedfromrunsofallpossibleschedulesintheexperimentalenvironmentareshowninFigure 2-7 Theaveragesystemthroughputoftheschedulechosenwithclassknowledgewas1391jobsperday.Itachievedthehighestthroughputamongthetenpossibleschedules{22.11%largerthantheweightedaverageofthesystemthroughputsofallthetenpossibleschedules.Inaddition,therandomselectionofthepossibleschedulesresultedinlargevariancesofsystemthroughput.Theapplicationclassinformationcanbeusedtofacilitatetheschedulertopicktheoptimalscheduleconsistently.TheapplicationthroughputcomparisonofdierentschedulesononemachineisshowninFigure 2-8 .Itcomparesthe 43

PAGE 44

Applicationthroughputcomparisonsofdierentschedules.MIN,MAX,andAVGaretheminimum,maximum,averageapplicationthroughputofallthetenpossibleschedules.SPNistheproposedschedule10f(SPN),(SPN),(SPN)ginFigure 2-7 Table2-4. Systemthroughput:concurrentvs.sequentialexecutions CH3D PostMark TimeTakento Time(sec) Finish2Jobs Concurrent 310 613 264 752 throughputofscheduleID10(labeledSPNinFigure 2-8 )withtheminimum,maximum,andaveragethroughputsofallthetenpossibleschedules.Byallocatingjobsfromdierentclassestothemachine,thethreeapplications'throughputswerehigherthanaveragebydierentdegrees:SPECseis96Smallby24.90%,Postmarkby48.13%,andNetPIPEby4.29%.Figure 2-8 alsoshowsthatthemaximumapplicationthroughputswereachievedbysub-schedule(SSN)forSPECseis96and(PPN)forNetPIPEinsteadoftheproposed(SPN).However,thelowthroughputsoftheotherapplicationsinthesub-schedulemaketheirtotalthroughputssub-optimal. 44

PAGE 45

2-4 .Theexperimentresultsshowthattheexecutioneciencylossescausedbytherelativelymoderateresourcecontentionsbetweenapplicationsofdierentclasseswereosetbythegainsfromtheutilizationofidlecapacity.Theresourcesharingofapplicationsofdierentclassesimprovedtheoverallsystemthroughput. 2-3 wasrunningonanIntel(R)Pentium(R)4CPU1.70GHzmachinewith512MBmemory.Inaddition,theapplicationclassierwasrunningonanIntel(R)Pentium(R)III750MHzmachinewith256MBRAM. Inthisexperiment,atotalof8000snapshotsweretakenwithve-secondintervalsforthevirtualmachine,whichhostedtheexecutionofSPECseis96(medium).Ittooktheperformancelter72secondstoextracttheperformancedataofthetargetapplicationVM.Inaddition,ittookanother50secondsfortheclassicationcentertotraintheclassier,performthePCAfeatureselectionandtheapplicationclassication.Thereforetheunitclassicationcostis15mspersampledata,demonstratingthatitispossibletoconsidertheclassierforonlinetraining. 39 ][ 25 ]andclassicationtechniqueshavebeenappliedtomanyareassuccessfully,suchasintrusiondetection[ 40 ][ 41 ][ 42 ][ 43 ],textcategorization[ 44 ],andimageandspeechanalysis.Kapadia'sevaluationoflearningalgorithmsforapplicationperformancepredictionin[ 45 ]showsthatthenearest-neighboralgorithmhasbetter 45

PAGE 46

45 ].ThisthesisdiersfromKapadia'sworkinthefollowingways:First,theapplicationclassknowledgeisusedtofacilitatetheresourceschedulingtoimprovetheoverallsystemthroughputincontrastwithKapadia'swork,whichfocusesonapplicationCPUtimeprediction.Second,theapplicationclassiertakesperformancemetricsasinputs.Incontrast,in[ 45 ]theCPUtimepredictionisbasedontheinputparametersoftheapplication.Third,theapplicationclassieremploysPCAtoreducethedimensionalityoftheperformancefeaturespace.Itisespeciallyhelpfulwhenthenumberofinputfeaturesoftheclassierisnottrivial. Condorusesprocesscheckpointandmigrationtechniques[ 20 ]toallowanallocationtobecreatedandpreemptedatanytime.Thetransferofcheckpointsmayoccupysignicantnetworkbandwidth.Basney'sstudyin[ 46 ]showsthatco-schedulingofCPUandnetworkresourcescanimprovetheCondorresourcepool'sgoodput,whichisdenedastheallocationtimewhenaremotelyexecutingapplicationusestheCPUtomakeforwardprogress.Theapplicationclassierpresentedinthisthesisperformslearningofapplication'sresourceconsumptionofmemoryandI/OinadditiontoCPUandnetworkusage.Itprovidesawaytoextractthekeyperformancefeaturesandgenerateanabstractoftheapplicationresourceconsumptionpatternintheformofapplicationclass.Theapplicationclassinformationandresourceconsumptionstatisticscanbeusedtogetherwithrecentmulti-lateralresourceschedulingtechniques,suchasCondor'sGang-matching[ 47 ],tofacilitatetheresourceschedulingandimprovesystemthroughput. ConservativeScheduling[ 4 ]usesthepredictionoftheaverageandvarianceoftheCPUloadofsomefuturepointoftimeandtimeintervaltofacilitatescheduling.Theapplicationclassiersharesthecommontechniqueofresourceconsumptionpatternanalysisofatimewindow,whichisdenedasthetimeofoneapplicationrun.However,theapplicationclassieriscapabletotakeintoaccountusagepatternsofmultiplekindsofresources,suchasCPU,I/O,networkandmemory. 46

PAGE 47

48 ]usesasyntheticskeletonprogramtoreproducetheCPUutilizationandcommunicationbehaviorsofmessagepassingparallelprogramstopredictapplicationperformance.Incontrast,theapplicationclassierprovidesapplicationbehaviorlearninginmoredimensions. Prophesy[ 49 ]employsaperformance-modelingcomponent,whichusescouplingparameterstoquantifytheinteractionsbetweenkernelsthatcomposeanapplication.However,tobeabletocollectdataatthelevelofbasicblocks,procedures,andloops,itrequiresinsertionofinstrumentationcodeintotheapplicationsourcecode.Incontrast,theclassicationapproachusesthesystemperformancedatacollectedfromtheapplicationhosttoinfertheapplicationresourceconsumptionpattern.Itdoesnotrequirethemodicationoftheapplicationsourcecode. Statisticalclusteringtechniqueshavebeenappliedtolearnapplicationbehavioratvariouslevels.Nickolayevetalappliedclusteringtechniquestoecientlyreducetheprocessoreventtracedatavolumeinclusterenvironment[ 50 ].AhnandVetterconductedapplicationperformanceanalysisbyusingclusteringtechniquestoidentifytherepresentativeperformancecountermetrics[ 51 ].BothCohenandChase's[ 52 ]andourworkperformstatisticalclusteringusingsystem-levelmetrics.However,theirworkfocusesonsystemperformanceanomalydetection.Ourworkfocusesonapplicationclassicationforresourcescheduling. Ourworkcanbeusedtolearntheresourceconsumptionpatternsofparallelapplication'schildprocessandmulti-stageapplication'ssub-stage.However,inthisstudywefocusonsequentialandsingle-stageapplications. 47

PAGE 48

Inthiswork,theinputperformancemetricsareselectedmanuallybasedonexpertknowledge.Inthenextchapter,thetechniquesforautomaticallyselectingfeaturesforapplicationclassicationarediscussed. 48

PAGE 49

Applicationclassicationtechniquesbasedonmonitoringandlearningofresourceusage(e.g.,CPU,memory,disk,andnetwork)havebeenproposedinChapter2toaidinresourceschedulingdecisions.Animportantproblemthatarisesinapplicationclassiersishowtodecidewhichsubsetofnumerousperformancemetricscollectedfrommonitoringtoolsshouldbeusedfortheclassication.Thischapterpresentsanapproachbasedonaprobabilisticmodel(BayesianNetwork)tosystematicallyselecttherepresentativeperformancefeatures,whichcanprovideoptimalclassicationaccuracyandadapttochangingworkloads. 53 ].Well-knownmonitoringtoolssuchastheopensourcepackagesGanglia[ 54 ]anddproc[ 55 ],andcommercialproductssuchasHP'sOpenView[ 56 ]providethecapabilityofmonitoringarichsetofsystemlevelperformancemetrics.Animportantproblemthatarisesishowtodecidewhichsubsetofnumerousperformancemetricscollectedfrommonitoringtoolsshouldbeusedfortheclassicationinadynamicenvironment.Inthischapterweaddressthisproblem.Ourapproachisbasedonautonomicfeatureselectionandcanhelptoimprovethesystem'sself-manageability[ 1 ]byreducingtherelianceonexpertknowledgeandincreasingthesystem'sadaptability. TheneedforautonomicfeatureselectionandapplicationclassicationismotivatedbysystemssuchasVMPlant[ 16 ],whichprovidesautomatedresourceprovisioningofVirtualMachine(VM).InthecontextofVMPlant,theapplicationcanbescheduledtorunonadedicatedvirtualmachine,whosesystemlevelperformancemetricsreecttheapplication's 49

PAGE 50

Tobuildanautonomicclassicationsystemwithself-congurability,itiscriticaltodeviseasystematicfeatureselectionschemethatcanautomaticallychoosethemostrepresentativefeaturesforapplicationclassicationandadapttochangingworkloads.Thischapterpresentsanapproachofusingaprobabilisticmodel,theBayesianNetwork,toautomaticallyselecttheperformancemetricsthatcorrelatewithapplicationclassesandoptimizetheclassicationaccuracy.TheapproachalsousestheMahalanobisdistancetosupportonlineselectionoftrainingdata,whichenablesthefeatureselectiontoadapttodynamicworkloads.Intherestofthisdissertation,wewillusetheterms\metrics"and\features"interchangeably. Inchapter2,asubsetofperformancemetricsweremanuallyselectedbasedonexpertknowledgetocorrelatetotheresourceconsumptionbehavioroftheapplicationclass.However,expertknowledgeisnotalwaysavailable.Incaseofhighlydynamicworkloadormassvolumeofperformancedata,theapproachofmanualcongurationbyhumanexpertisalsonotfeasible.Thesepresentaneedforasystematicwaytoselecttherepresentativemetricsintheabsenceofsucientexpertknowledge.Ontheotherhand,theuseoftheBayesianNetworkleavestheoptionopentointegrateexpertknowledgewiththeautomaticfeatureselectiontoimprovetheclassicationaccuracyandeciency. Featureselectionbasedonstaticselectedapplicationperformancedata,whichareusedasthetrainingset,maynotalwaysprovidetheoptimalclassicationresultsindynamicenvironments.Toenablethefeatureselectiontoadapttothechangingworkload,itrequiresthesystemtobeabletodynamicallyupdatethetrainingsetwithdatafromrecentworkload.Aquestionthatarisesishowtodecidewhichdatashouldbeselectedastrainingdata.Inthiswork,analgorithmbasedonMahalanobisdistanceisused 50

PAGE 51

Ourexperimentalresultsshowthefollowing.First,weobservecorrelationsbetweenpairsofselectedperformancemetrics,whichjustiestheuseofMahalanobisdistanceasameansoftakingthecorrelationintoaccountinthetrainingdataselectionprocess.Second,thereisadiminishingreturnofclassicationutilityfunction(i.e.theratioofclassicationaccuracyoverthenumberofselectedmetrics)asmorefeaturesareselected.Theexperimentsshowedthatabove90%applicationclassicationaccuracycanbeachievedwithasmallsubsetofperformancemetricswhicharehighlycorrelatedwiththeapplicationclass.Third,theapplicationclassicationbasedontheselectedfeaturesforasetofbenchmarkprogramsandscienticapplicationsmatchedourempiricalexperiencewiththeseapplications. Therestofthechapterisorganizedasfollows:ThestatisticaltechniquesusedaredescribedinSection 3.2 .Section 3.3 presentsthefeatureselectionmodel.Section 3.4 presentsanddiscussestheexperimentalresults.Section 3.5 discussesrelatedwork.ConclusionsandfutureworkarediscussedinSection 3.6 3.2.1FeatureSelection 57 ].Subsetgenerationisaprocessofheuristicsearchofcandidatesubsets.Eachsubsetisevaluatedbasedontheevaluationcriterion.Thentheevaluationresultiscomparedwiththepreviouslycomputedbestresult.Ifitisbetter,itwillreplacethebestresultandtheprocesscontinuesuntilthestopcriterionisreached.Theselectionresultisvalidatedbydierenttestsorpriorknowledge. 51

PAGE 52

3.3.2 58 ].Itcanbeusedtocomputetheconditionalprobabilityofanode,giventhevaluesofitspredecessors;hence,aBNcanbeusedasaclassierthatgivestheposteriorprobabilitydistributionoftheclassdecisionnodegiventhevaluesofothernodes. BayesianNetworksarebasedontheworkofthemathematicianandtheologianRev.ThomasBayes,whoworkedwithconditionalprobabilitytheoryinthelate1700stodiscoverabasiclawofprobability,whichwasthencalledBayes'rule.TheBayes'ruleincludesahypothesis,pastexperience,andevidence: wherewecanupdateourbeliefinhypothesisHgiventheadditionalevidenceE,andthebackgroundcontext(pastexperience),c. Theleft-handterm,P(HjE;c)iscalledtheposteriorprobability,ortheprobabilityofhypothesisHafterconsideringtheeectoftheevidenceEonpastexperiencec. ThetermP(Hjc)iscalledthea-prioriprobabilityofHgivencalone. ThetermP(EjH;c)iscalledthelikelihoodandgivestheprobabilityoftheevidenceassumingthehypothesisHandthebackgroundinformationcistrue. 52

PAGE 53

BayesianNetworkscaptureBayes'ruleinagraphicalmodel.Theyareveryeectiveformodelingsituationswheresomeinformationisalreadyknownandincomingdataisuncertainorpartiallyunavailable(unlikerule-basedor\expert"systems,whereuncertainorunavailabledataresultsinineectiveorinaccuratereasoning).ThisrobustnessinthefaceofimperfectknowledgeisoneofthemanyreasonswhyBayesianNetworksareincreasinglyusedasanalternativetootherAIrepresentationalformalisms.Bayesiannetworkshavebeenappliedtomanyareassuccessfully,includingmaplearning[ 59 ],medicaldiagnosis[ 60 ][ 61 ],andspeechandvisionprocessing[ 62 ][ 63 ].Comparedwithotherpredictivemodels,suchasdecisiontreesandneuralnetworks,andstandardfeatureselectionmodelthatisbasedonPrincipalComponentAnalysis(PCA),Bayesiannetworksalsohavetheadvantageofinterpretability.Humanexpertscaneasilyunderstandthenetworkstructureandmodifyittoobtainbetterpredictivemodels.Byaddingdecisionnodesandutilitynodes,BNmodelscanalsobeextendedtodecisionnetworksfordecisionanalysis[ 64 ]. ConsideradomainUofnvariables,x1;;xn.Eachvariablemaybediscretehavinganiteorcountablenumberofstates,orcontinuous.GivenasubsetXofvariablesxi,wherexi2U,ifonecanobservethestateofeveryvariableinX,thenthisobservationiscalledaninstanceofXandisdenotedasX=p(xijx1;;xi1;)=p(xiji;)kxfortheobservationsxi=ki;xi2X.The\jointspace"orUisthesetofallinstancesofU.p(X=kxjY=k;)denotesthe\generalizedprobabilitydensity"thatX=p(xijx1;;xi1;)=p(xiji;)kxgivenY=kforapersonwithcurrentstateinformation.p(XjY;)thendenotesthe\GeneralizedProbabilityDensityFunction"(gpdf)forX,givenallpossibleobservationsofY.ThejointgpdfoverUisthegpdfforU. ABayesiannetworkfordomainUrepresentsajointgpdfoverU.Thisrepresentationconsistsofasetoflocalconditionalgpdfscombinedwithasetofconditionalindependence 53

PAGE 54

SampleBayesiannetworkgeneratedbyfeatureselector assertionsthatallowtheconstructionofaglobalgpdffromthelocalgpdfs.Asshownpreviously,thechainruleofprobabilitycanbeusedtoascertainthesevalues: (3{1) OneassumptionimposedbyBayesianNetworktheory(andindirectlybytheProductRuleofprobabilitytheory)isthateachvariablexi;ifxi;;xi1gmustbeasetofvariablesthatrendersxiandfx1;;xi1gconditionallyindependent.Inthisway: (3{2) ABayesianNetworkStructurethenencodestheassertionsofconditionalindependenceinEquation 3{1 above.Essentiallythen,aBayesianNetworkStructureBSisadirectedacyclicgraphsuchthateachvariableinUcorrespondstoanodeinBS,andtheparentsofthenodecorrespondingtoxiarethenodescorrespondingtothevariablesini. Dependingontheproblemthatisdened,either(orboth)ofthetopologyandtheprobabilitydistributionofBayesianNetworkcanbepre-denedbyhandormaybe 54

PAGE 55

3-1 givesasampleBNlearnedintheexperiment.Therootistheapplicationclassdecisionnode,whichisusedtodecideanapplicationclassgiventhevalueoftheleafnodes.Therootnodeistheparentofallothernodes.Theleafnodesrepresentselectedperformancemetrics,suchasnetworkpacketssentandbyteswrittentodisk.Theyareconnectedonetoanotherinaseries. 22 ][ 65 ].Forexample,ifx1andx2aretwopointsfromthedistributionwhichischaracterizedbycovariancematrix1,thenthequantity ((x1x2)T1(x1x2))1 2 iscalledtheMahalanobisdistancefromx1tox2,whereTdenotesthetransposeofamatrix. Inthecaseswheretherearecorrelationsbetweenvariables,simpleEuclideandistanceisnotanappropriatemeasure,whereastheMahalanobisdistancecanadequatelyaccountforthecorrelationsandisscale-invariant.StatisticalanalysisoftheperformancedatainSection 3.4.3 showsthattherearecorrelationsbetweentheapplicationperformancemetricswithvariousdegrees.Therefore,Mahalanobisdistancebetweentheunlabeledperformancesampleandtheclasscentroid,whichrepresentstheaverageofallexistingtrainingdataoftheclass,isusedinthetrainingdataqualicationprocessinSection 3.3.1 66 ]iscommonlyusedtoevaluatetheperformanceofclassicationsystems.Itshowsthepredictedandactualclassicationdonebythesystem.ThematrixsizeisLxL,whereListhenumberofdierentclasses.Inourcasewheretherearevetargetapplicationclasses,theLisequalto5. 55

PAGE 56

3-1 showsasampleconfusionmatrixwithL=2.Thereareonlytwopossibleclassesinthisexample:Positiveandnegative.Therefore,itsclassicationaccuracycanbecalculatedas(a+d)/(a+b+c+d). 3-2 showstheautonomicfeatureselectionframworkinthecontextofapplicationclassication.Inthissection,wearegoingtofocusonintroducingtheclassicationtrainingcenter,whichenablestheself-congurabilityforonlineapplicationclassication.Thetrainingcenterhastwomajorfunctions:qualityassuranceoftrainingdata,whichenablestheclassiertoadapttochangingworkloads,andsystematicfeatureselection,whichsupportsautomaticfeatureselection.Thetrainingcenterconsistsofthreecomponents:thedataqualityassuror,thefeatureselector,andthetrainer. ThetrainingdatapoolconsistsofrepresentativedataofveapplicationclassesincludingCPU-intensive,I/O-intensive,memory-intensive,network-intensive,andidle.TrainingdataofeachclasscisasetofKcm-dimensionalpoints,wheremisthenumberofapplication-specicperformancemetricsreportedbythemonitoringtools.Toselectthe Table3-1. Sampleconfusionmatrixwithtwoclasses(L=2) Actual Predicted Class Negative Positive Negative a b Positive c d 56

PAGE 57

FeatureselectionmodelThePerformanceprolercollectsperformancemetricsofthetargetapplicationnode.TheApplicationclassierclassiestheapplicationusingextractedkeycomponentsandperformsstatisticanalysisoftheclassicationresults.TheDataQAselectsthetrainingdatafortheclassication.TheFeatureselectorselectsperformancemetricswhichcanprovideoptimalclassicationaccuracy.TheTrainertrainstheclassierusingtheselectedmetricsoftrainingdata.TheApplicationDBstorestheapplicationclassinformation.(t0=t1:arethebeginning/endingtimesoftheapplicationexecution,VMIPistheIPaddressoftheapplication'shostmachine). trainingdatafromtheapplicationsnapshots,onlynoutofmmetricsareextractedbasedonpreviousfeatureselectionresulttoformasetofKcn-dimensionaltrainingpoints. thatcompriseaclusterCc.From[ 50 ],itfollowsthatthen-tuple (3{5) 57

PAGE 58

xic(t)=1 iscalledthecentroidoftheclusterCc. Thetrainingdataselectionisathree-stepprocess:FirsttheDataQAextractsthenoutofmmetricsoftheinputperformancesnapshottoformatrainingdatacandidate.Thuseachcandidateisrepresentedbyann-dimensionalpointx=(x1;x2;;xn).Second,itevaluateswhethertheinputcandidateisqualiedtobetrainingdatarepresentingoneoftheapplicationclass.Atlast,thequaliedtrainingdatacandidateisassociatedwithascalarvalueClass,whichdenestheapplicationclass. Therststepisstraight-forward.Inthesecondandthirdsteps,theMahalanobisdistancebetweenthetrainingdatacandidatexandthecentroidcofclusterCciscalculatedasfollows: 2 wherec=1;2;;5representstheapplicationclassand1cdenotestheinversecovariancematrixoftheclusterCc.Thedistancefromthetrainingdatacandidatextotheboundarybetweentwoclassclusters,forexampleC1andC2,isjd1(x)d2(x)j.Ifjd1(x)d2(x)j=0,itmeansthatthecandidateisexactlyattheboundarybetweenclass1and2.Thefurtherawaythecandidateisfromtheclassboundaries,thebetteritcanrepresentaclass.Inotherwords,thereislessprobabilityforittobemisclassied.Therefore,theDataQAcalculatesthedistancefromthecandidatetoboundariesofallpossiblepairsoftheclasses.Iftheminimaldistancetoclassboundaries,min(jd1d2j;jd1d3j;;jd4d5j),isbiggerthanapredenedthreshold,thecorrespondingm-dimensionalsnapshotofthecandidateisdeterminedtobequaliedtrainingdataof 58

PAGE 59

Sampleperformancemetricsintheoriginalfeatureset Description Metrics system/user PercentCPU system/user /idle /idle cpu nice PercentCPUnice bytes in/out Numberofbytespersecond into/outofthenetwork io bi/bo Blockssentto/receivedfrom ablockdevice(blocks/s) swap in/out Amountofmemoryswapped in/outfrom/todisk(kB/s) pkts in/out Packetsin/outpersecond proc run Totalnumberofrunning processes load one/ve One/Five/Fifteenminutes /fteen loadaverage theclass,whosecentroidhasthesmallestMahanalobisdistancemin(d1;d2;;d5)tothesnapshot.Automatedandadaptivethresholdsettingisdiscussedindetailin[ 67 ]. Inourimplementation,Gangliaisusedasthemonitoringtoolandtwenty(m=20)performancemetrics,whicharerelatedtoresourceusage,areincludedinthetrainingdata.Theseperformancemetricsinclude16outof33defaultmetricsmonitoredbyGangliaandthe4metricsthatweaddedbasedontheneedofclassication.ThefourmetricsincludethenumberofI/Oblocksreadfrom/writtentodisk,andthenumberofmemorypagesswappedin/out.Aprogramwasdevelopedtocollectthesefourmetrics(usingvmstat)andaddedthemtothemetriclistofGanglia'smonitoringdaemongmond.Table 3-2 showssomesampleperformancemetricsofthetrainingcandidate. Thersttimequalityassurancewasperformedbyhumanexpertattheinitialization.Thesubsequentassurancecanbeconductedautomaticallybyfollowingtheabovestepstoselectrepresentativetrainingdataforeachclass. 59

PAGE 60

trainingdatasetwithNfeatures Class classoftrainingdata(teacherforlearning) selectedfeaturesubset maximumaccuracyD=discretize(C); convertcontinuoustodiscretefeaturesrepeatinitializeAnode=0; maxaccuracyforeachnodeinitializeFnode=0; selectedfeatureforeachnodeforeachF2(fF0;F1;;FN1gSbest)doAccuracy=eval(D;Class;Sbest[fFg); evaluateBayesiannetworkwithextrafeatureFifAccuracy>Anodethen storethecurrentfeatureAnode=Accuracy;Fnode=F;endendifAnode>AmaxthenSbest=Sbest[fFnodeg;Amax=Anode;Anode=Anode+1;enduntil(AnodeAmax);end Bayesian-networkbasedfeatureselectionalgorithmforapplicationclassication collectedfrommonitoringtools.Bylteringoutmetricswhichcontributelesstotheclassication,itcanhelptonotonlyreducethecomputationalcomplexityofsubsequentclassications,butalsoimproveclassicationaccuracy. Inourpreviouswork[ 53 ],representativefeatureswereselectedmanuallybasedonexpertknowledge.Forexample,performancemetricsofcpu systemandcpu userarecorrelatedtothebehaviorofCPU-intensiveapplications;bytes inandbytes outarecorrelatedtonetwork-intensiveapplications;io biandio boarecorrelatedtotheI/O-intensiveapplications;swap inandswap outarecorrelatedtomemory-intensiveapplications.However,tosupporton-lineclassication,itisnecessaryforfeatureselectiontohavetheabilitytoadapttochangingworkloads.Therefore,thestaticselection 60

PAGE 61

AwrapperalgorithmbasedonBayesiannetworkisemployedbythefeatureselectortoconductthefeatureselection.AsintroducedinSection 3.2.1 ,althoughthisfeatureselectionschemereducestherelianceonhumanexperts'knowledge,theBayesiannetwork'sinterpretabilityleavestheoptionsopentointegratetheexpertknowledgeintotheselectionschemetobuildabetterclassicationmodel. Figure 3-3 showsthefeatureselectionalgorithm.ItstartswithanemptyfeaturesubsetSbest=fg.TosearchforthebestfeatureF,itusesthetemporaryfeaturesetfSbest[FgtoperformBayesianNetworkclassicationforthediscretetrainingdataD.TheclassicationaccuracyiscalculatedbycomparingtheclassicationresultandthetrueansweroftheClassinformationcontainedinthetrainingdata.Aftertheevaluationofaccuracyusingallremainingfeatures(fF1;F2;;FN1gSbest),thebestaccuracyisstoredtoAnode.IfAnodeisbetterthanthepreviousbestaccuracyAmaxachieved,thecorrespondingfeaturenodeisaddedtothefeaturesubsettoformthenewsubset.Thisprocessisrepeateduntiltheclassicationaccuracycannotbeimprovedanymorebyaddinganyoftheremainingfeaturesintothesubset. 61

PAGE 62

Intheexperiments,alltheapplicationswereexecutedinaVMwareGSX2.5virtualmachinewith256MBmemory.ThevirtualmachinewashostedonanIntel(R)Xeon(TM)dual-CPU1.80GHzmachinewith512KBcacheand1GBRAM.TheCTCandapplicationclassierwererunningonanIntel(R)Pentium(R)III750MHzmachinewith256MBRAM. Therstexperimentwasdesignedtoshowtherelationshipbetweenclassicationaccuracyandthenumberoffeaturesselected.Thesecondexperimentwasdesignedto 62

PAGE 63

Averageclassicationaccuracyof10setsoftestdataversusnumberoffeaturesselectedintherstexperiment Figure3-5. Two-classtestdatadistributionwiththersttwoselectedfeatures 63

PAGE 64

Intherstexperiment,thetrainingdataconsistofperformancesnapshotsofveclassesofapplications,includingCPU-intensive,I/O-intensive,memory-intensive,andnetwork-intensiveapplications,andthesnapshotscollectedfromanidleapplication-VM,whichhasonly\backgroundnoise"fromsystemactivity(i.e.withoutanyapplicationexecutionduringthemonitoringperiod).Thefeatureselector'staskistoselectthosemetricswhichcanbeusedtoclassifythetestsetintoveclasseswithoptimalaccuracy. Inalltheteniterationsofcrossvalidation,twoperformancemetrics(cpu systemandload fteen)werealwaysselectedasthebesttwofeatures.Figure 3-6 showsasampletestdatadistributionwiththesetwofeatures.Ifweprojectthedatatox-axisory-axis,wecanseethatitismorediculttodierentiatethedatafromeachclassbyusingeithercpu systemorload fteenalonethanusingbothmetrics.Forexample,thecpu systemvaluerangesofnetwork-intensiveapplicationandI/O-intensiveapplicationlargelyoverlap.Thismakesithardtoclassifythesetwoapplicationswithonlycpu systemmetric.Comparedwiththeone-metricclassication,itismucheasiertodecidewhichclassthetestdatabelongtobyusinginformationofbothmetrics.Inotherwords,thecombinationofmultiplefeaturesismoredescriptivethanasinglefeature. TheclassicationaccuracyversusthenumberoffeaturesselectedfortheabovelearnedBayesiannetworkisplottedinFigure 3-4 .Itshowsthatwithasmallnumberoffeatures(3to4),itcanachieveabove90%classicationaccuracyforthis5-classclassication. Inthesecondexperiment,thetrainingdataconsistofperformancesnapshotsoftwoclassesofapplications,I/O-intensiveandmemory-intensive.Figure 3-5 showsitstestdatadistributionwiththersttwoselectedfeatures,bytes inandpkts in.AcomparisonofFigure 3-6 andFigure 3-5 showsthatwithreducednumberofapplicationclasses,higherclassicationaccuracycanbeachievedwithlessnumberoffeatures.Forexample, 64

PAGE 65

Confusionmatrixofclassicationresultswithexpert-selectedandautomatically-selectedfeaturesets.A)AutomaticB)Expert Actual Classiedas Class Idle CPU IO Net Mem Idle 62 0 0 CPU 231 0 0 IO 20 86 0 Net 0 12 8 Mem 0 0 0 0 Actual Classiedas Class Idle CPU IO Net Mem Idle 38 0 0 CPU 4 0 104 IO 20 10 173 Net 0 0 24 Mem 3 0 36 0 Theboldnumbersalongthediagonalarethenumberofcorrectlyclassieddata. inthisexperiment,ifweknowthattheapplicationbelongstoeitherI/O-intensiveormemory-intensiveclass,withtwoselectedfeatures,96%classicationaccuracycanbeachievedversus87%accuracyinthe5-classcase.Itshowsthepotentialofusingpair-wiseclassicationtoimprovetheclassicationaccuracyformulti-classcases.Usingpair-wiseapproachformulti-classclassicationisatopicoffutureresearch. 53 ]. First,thetrainingdatadistributionsbasedonprincipalcomponents,whicharederivedfromautomaticallyselectedfeaturesinSection 3.4.1 andmanuallyselectedfeaturesinpreviouswork[ 53 ],areshowninFigure 3-8 .DistancesbetweeneachpairofclasscentroidsinFigure 3-8 arecalculatedandplotedinFigure 3-7 .Itshowsthat 65

PAGE 66

Five-classtestdatadistributionwithrsttwoselectedfeatures Figure3-7. Comparisonofdistancesbetweenclustercentersderivedfromexpert-selectedandautomaticallyselectedfeaturesets1:idle-cpu2:idle-I/O3:idle-net4:idle-mem5:cpu-I/O6:cpu-net7:cpu-mem8:I/O-net9:I/O-mem10:net-mem 66

PAGE 67

B Trainingdataclusteringdiagramderivedfromexpert-selectedandautomaticallyselectedfeaturesets A )Automatic B )Expert 67

PAGE 68

Second,thePCAandk-NNbasedclassicationswereconductedwithboththeexpertselected8featuresinpreviouswork[ 53 ]andtheautomaticallyselectedfeaturesinSection 3.4.1 .Table 3-3 showstheconfusionmatricesoftheclassicationresults.Ifdataareclassiedtothesameclassesastheiractualclasses,theclassicationsareconsideredascorrect.Theclassicationaccuracyistheproportionofthetotalnumberofclassicationsthatwerecorrect.Theconfusionmatricesshowsthataclassicationaccuracyof98.05%canbeachievedwithautomaticallyselectedfeatureset,whichissimilartothe98.14%accuracyachievedwithexpertselectedfeatureset.ThustheautomaticfeatureselectionthatisbasedonBayesianNetworkcanreducetherelianceonexpertknowledgewhileoeringcompetitiveclassicationaccuracycomparedtomanualselectionbyhumanexpert. Inaddition,asetof8featuresselectedinthe5-classfeatureselectionexperimentinSection 3.4.1 wasusedtoconguretheapplicationclassierandthesametrainingdatausedinthefeatureselectionexperimentwereusedtotraintheapplicationclassier.Thenthetrainedclassierconductedclassicationforasetofthreebenchmarkprograms:SPECseis96[ 29 ],PostMarkandPostMark NFS[ 28 ].SPECseis96isascienticapplicationwhichiscomputing-intensivebutalsoexercisesdiskI/Ointheinitialandendphasesofitsexecution.PostMarkoriginallyisadiskI/Obenchmarkprogram.InPostMark NFS,anetworklesystem(NFS)mounteddirectorywasusedtostoretheleswhichwereread/writtenbythebenchmark.Therefore,PostMark NFSperformssubstantialnetwork-I/OratherthandiskI/O.TheclassicationresultsareshowninFigure 3-9 .Theresultsshowthat86%oftheSPECseis96testdatawereclassiedascpu-intensive,95%ofthePostMarkdatawereclassiedasI/O-intensive,and61%ofthePostMark NFS 68

PAGE 69

B C Classicationresultsofbenchmarkprograms A )SPECseis96 B )PostMark C )PostMark NFSPrincipalcomponents1and2aretheprincipalcomponentmetricsextractedbyPCA. 69

PAGE 70

Performancemetriccorrelationmatrixesoftestapplications.A)CorrelationmatrixofSPECseis96performancedataB)CorrelationmatrixofPostMarkperformancedataC)CorrelationmatrixofNetPIPEperformancedata Metric 1 2 3 4 5 6 1 1.00 -0.21 -0.34 -0.02 2 -0.21 1.00 -0.16 -0.02 -0.17 -0.06 3 -0.34 -0.16 1.00 -0.05 4 -0.19 0.04 5 0.20 -0.17 0.20 -0.19 1.00 0.12 6 -0.02 -0.06 -0.05 0.04 0.12 1.00 A Metric 1 2 3 4 5 6 1 1.00 -0.24 0.22 0.34 -0.08 -0.13 2 -0.24 1.00 -0.22 0.18 0.04 -0.02 3 0.22 -0.22 1.00 0.33 0.30 0.18 4 0.34 0.18 0.33 1.00 0.42 0.47 5 -0.08 0.04 0.30 0.42 1.00 0.20 6 -0.13 -0.02 0.18 0.47 0.20 1.00 B Metric 1 2 3 4 5 6 1 1.00 0.29 0.31 0.48 0.27 0.30 2 0.29 1.00 0.49 0.39 0.95 0.31 0.49 1.00 0.50 0.52 0.48 0.39 0.50 1.00 0.42 0.39 5 0.28 0.59 1.00 0.30 0.52 C 1{load ve2{pkts in3{cpu system4{load fteen5{pkts out6{bytes outCorrelationsthosearelargerthan0.5arehighlightedwithboldcharacters datawereclassiedasnetwork-intensive.Theresultsmatchedwithourempiricalexperiencewiththeseprogramsandareclosetotheresultsofexpert-selected-featurebasedclassiciation,whichshows85%cpu-intensiveforSPECseis96,97%I/O-intensiveforPostMark,and62%network-intensiveforPostMark NFS. 70

PAGE 71

Thedataqualityassurorclassieseachunlabeledtestdatabyidentifyingitsnearestneighboramongallclasscentroids.Itsperformancethusdependscruciallyonthedistancemetricusedtoidentifythenearestclasscentroid.Infact,anumberofresearchershavedemonstratedthatnearestneighborclassicationcanbegreatlyimprovedbylearninganappropriatedistancemetricfromlabeledexamples[ 65 ]. Table 3-4 showsthecorrelationcoecientsofeachpairoftherstsixperformancemetricscollectedduringtheapplicationexecution,includingload ve,pkts in,cpu system,load fteen,pkts out,andbytes out.Threeapplicationsareusedintheseexperimentsincluding:SPECseis96[ 29 ],PostMark[ 28 ]andNetPIPE[ 34 ]. Theexperimentsshowthattherearecorrelationsbetweenpairsofperformancemetricsinvariousdegrees.Forexample,NetPIPE'sbytes outmetricarehighlycorrelatedwithitspkts in,pkts out,andcpu systemmetrics.Inthecaseswheretherearecorrelationsbetweenmetrics,distancemetricwhichcantakethecorrelationintoaccountwhendeterminingitsdistancefromtheclasscentroid,shouldbeused.Therefore,Mahalanobisdistanceisusedinthetrainingdataselectionprocess. 39 ][ 68 ]andclassicationtechniqueshavebeenappliedtomanyareassuccessfully,suchasintrusiondetection[ 69 ][ 40 ][ 42 ],textcategorization[ 44 ],speechandimageprocessing[ 62 ][ 63 ],andmedicaldiagnosis[ 60 ][ 61 ]. Thefollowingworksappliedthesetechniquestoanalyzesystemperformance.Howevertheydierfromeachotherfromthefollowingaspects:goalsoffeatureselection,thefeaturesunderstudy,andimplementationcomplexity. Nickolayevetal.usedstatisticalclusteringtechniquestoidentifytherepresentativeprocessorsforparallelapplicationperformancetuning[ 50 ].Onlyeventtracingofthe 71

PAGE 72

Ahnetal.appliedvariousstatisticaltechniquestoextracttheimportantperformancecountermetricsforapplicationperformanceanalysis[ 51 ].Theirprototypecansupportparallelapplication'sperformanceanalysisbycollectingandaggregatinglocaldata.Itrequiresannotationofapplicationsourcecodeaswellasappropriateoperatingsystemandlibrarysupporttocollectprocessinformation,whichisbasedonhardwarecounters. Cohenetal.[ 52 ]studiedcorrelationbetweencomponentperformancemetricsandSLOviolationinInternetserverplatform.Therearesomesimilaritiesbetweentheirworkandoursintermsoflevelofperformancemetricsunderstudyandtypeofclassierused.However,ourstudydiersfromtheirsinthefollowingways.First,ourstudyfocusesonapplicationclassication(CPU-intensive,I/Oandpagingintensive,andnetwork-intensive)forresourcescheduling.Theirstudyfocusedonperformanceanomalydetection(SLOviolationandcompliance).Second,ourprototypetargetstosupportonlineclassication.Itaddressedthetrainingdataqualicationproblemtoadaptthefeatureselectiontochangingworkload.However,onlinetrainingdataselectionproblemswerenotthefocusof[ 52 ].Third,inourprototype,virtualmachineswereusedtohostapplicationexecutionsandsummarizeapplication'sresourceusage.Theprototypesupportsawiderangeofapplications,suchasscienticprogramsandbusinessonlinetransactionsystem.[ 52 ]studiedwebapplicationinthree-tierclient/serversystems. Inadditionto[ 52 ],Aguileraetal.[ 70 ]andMagpie[ 71 ]alsostudiedperformanceanalysisofdistributedsystems.However,theyconsideredmessage-leveltracesofsystemactivitiesinsteadofsystemlevelperformancemetrics.Bothofthemtreatedcomponentsofdistributedsystemsasblack-box.Therefore,theirapproachesdonotrequireapplicationandmiddlewaremodications. 72

PAGE 73

73

PAGE 74

Theintegrationofmultiplepredictorspromiseshigherpredictionaccuracythantheaccuracythatcanbeobtainedwithasinglepredictor.Thechallengeishowtoselectthebestpredictoratanygivenmoment.Traditionally,multiplepredictorsareruninparallelandtheonethatgeneratesthebestresultisselectedforprediction.Inthischapter,weproposeanovelapproachforpredictorintegrationbasedonthelearningofhistoricalpredictions.Comparedwiththetraditionalapproach,itdoesnotrequirerunningallthepredictorssimultaneously.Instead,itusesclassicationalgorithmssuchask-NearestNeighbor(k-NN)andBayesianclassicationanddimensionreductiontechniquesuchasPrincipalComponentAnalysis(PCA)toforecastthebestpredictorfortheworkloadunderstudybasedonthelearningofhistoricalpredictions.Thenonlytheforecastedbestpredictorisrunforprediction. 72 ]enablesentitiestocreateaVirtualOrganization(VO)tosharetheircomputationresourcessuchasCPUtime,memory,networkbandwidth,anddiskbandwidth.Predictingthedynamicresourceavailabilityiscriticaltoadaptiveresourcescheduling.However,determiningthemostappropriateresourcepredictionmodelaprioriisdicultduetothemulti-dimensionalityandvariabilityofsystemresourceusage.First,theapplicationsmayexercisetheuseofdierenttypeofresourcesduringtheirexecutions.SomeresourceusagessuchasCPUloadmayberelativelysmootherwhereasotherssuchasnetworkbandwidtharebustier.Itishardtondasinglepredictionmodelwhichworksbestforalltypesofresources.Second,dierentapplicationsmayhavedierentresourceusagepatterns.Thebestpredictionmodelforaspecicresourceofonemachinemaynotwokbestforanothermachine.Third,theresourceperformanceuctuatesdynamicallyduetothecontentioncreatedbycompetingapplications.Indeed,intheabsenceofaperfectpredictionmodel,thebestpredictorforanyparticularresourcemaychangeovertime. 74

PAGE 75

Ourexperimentalresultsbasedontheanalysisofasetofvirtualmachinetracedatashow: 1.Thebestpredictionmodelisworkloadspecic.Intheabsenceofaperfectpredictionmodel,itishardtondasinglepredictorwhichworksbestacrossvirtualmachineswhichhavedierentresourceusagepatterns. 2.Thebestpredictionmodelisresourcespecic.Itishardtondasinglepredictorwhichworksbestacrossdierentresourcetypes. 3.ThebestpredictionmodelforaspecictypeofresourceofagivenVMtracevariesasafunctionoftime.TheLARPredictorcanadaptthepredictorselectiontothechangeoftheresourceconsumptionpattern. 4.Intheexperimentswithasetoftracedata,TheLARPredictoroutperformedtheobservedsinglebestpredictorinthepoolfor44.23%ofthetracesandoutperformedthecumulative-MSEbasedpredictionmodelusedintheNetworkWeatherServicesystem 75

PAGE 76

73 ]for66.67%ofthetraces.Ithasthepotentialtoconsistentlyoutperformanysinglepredictorforvariableworkloadsandachieve18.63%lowerMSEthanthemodelusedintheNWS. Therestofthechapterisorganizedasfollows:Section 4.2 givesanoverviewofrelatedwork.Section 4.4 describesthelineartimeseriespredictionmodelsusedtoconstructtheLARPredictorandSection 4.5 describesthelearningtechniquesusedforpredictorselection.Section 4.6 detailstheworkowofthelearning-aidedadaptiveresourcepredictor.Section 4.7 discussestheexperimentalresults.Section 4.8 summarizestheworkanddescribesfuturedirection. 74 ],biomedicalsignalprocessing[ 75 ],andgeoscience[ 76 ].Inthiswork,wefocusonthetimeseriesmodelingforcomputerresourceperformanceprediction. In[ 77 ]and[ 78 ],Dindaetal.conductedextensivestudyofthestatisticalpropertiesandthepredictionsofhostload.TheirworkindicatesthatCPUloadisstronglycorrelatedovertime,whichimpliesthathistory-basedloadpredictionschemesarefeasible.Theyevaluatedthepredictivepowerofasetoflinearmodelsincludingautoregression(AR),movingaverage(MA),autoregressionintegratedmovingaverage(ARIMA),autoregressionfractionallyintegratedmovingaverage(ARFIMA),andwindow-meanmodels.TheirresultsshowthattheARmodelisthebestintermsofhighpredictionaccuracyandlowoverheadamongthemodelstheystudied.Basedontheirconclusion,theARmodelisincludedinourpredictorpooltoleverageitsperformance. Toimprovethepredictionaccuracy,variousadaptivetechniqueshavebeenexploitedbytheresearchcommunity.In[ 4 ],Yangetal.developedatendency-basedpredictionmodelthatpredictsthenextvalueaccordingtothetendencyofthetimeserieschange.Someincrement/decrementvalueareadded/subtractedtothecurrentmeasurementbasedonthecurrentmeasurementandsomeotherdynamicinformationtopredictthe 76

PAGE 77

79 ].Inaddition,in[ 80 ],Lianget.alproposedamulti-resourcepredictionmodelthatusesboththeautocorrelationoftheCPUloadandthecrosscorrelationbetweentheCPUloadandfreememorytoachievehigherCPUloadpredictionaccuracy.Vazhkudaietal.[ 81 ][ 82 ]usedlinearregressiontopredictthedatatransfertimefromnetworkbandwidthordiskthroughput. TheNetworkWeatherService(NWS)[ 73 ]performspredictionofbothnetworkthroughputandlatencyforhostmachinesdistributedwithdierentgeographicdistances.BoththeNWSandtheLARPredictorusethemix-of-expertapproachtoselectthebestpredictoratanygivenmoment.However,theydierfromeachotherinthewayofbestpredictorselection.ThepredictionmodelusedintheNWSsystemrunsasetofpredictorsinparalleltotracktheirpredictionaccuracies.Acumulativeerrormeasurement,MeanSquareError(MSE),iscalculatedforeachpredictor.Theonethatgeneratesthelowestpredictionerrorfortheknownmeasurementsischosentomakeaforecastoffuturemeasurementvalues.Section 4.6 showsthattheLARPredictoronlyusesparallelpredictionduringthetrainingphase.Inthetestingphase,itusesthePCAandk-NNclassiertoforecastthebestpredictorforthenextvaluebasedonthelearningofhistoricalpredictionperformances.Onlytheforecastedbestpredictorisruntopredictthenextvalue. Themix-of-expertapproachhasbeenappliedtothetextrecognitionandcate-gorizationarea.Thecombinationofmultipleclassiershasbeenprovedtobeabletoincreasetherecognitionrateindicultproblemswhencomparedwithsingleclassier[ 83 ].Dierentcombinationstrategiessuchasweightedvotingandprobability-basedvotinganddimensionalityreductionbasedonconceptindexingareintroducedin[ 84 ]. 77

PAGE 78

VirtualmachineresourceusagepredictionprototypeThemonitoragent,whichisinstalledintheVirtualMachineMonitor(VMM),collectstheVMresourceperformancedataandstoresthemintheroundrobinVMPerformanceDatabase.TheprolerextractstheperformancedataofagiventimeframefortheVMindicatedbyVMIDanddeviceID.TheLARPredictorselectthebestpredictionmodelbasedonlearningofhistoricalpredictions,predictstheresourceperformancefortimet+1,andstoresthepredictionresultsinthepredictiondatabase.ThepredictionresultscanbeusedtosupporttheresourcemanagertoperformdynamicVMresourceallocation.ThePerformanceQualityAssuror(QA)auditstheLARPredictor'sperformanceandordersre-trainingforthepredictoriftheperformancedropsbelowapredenedthreshold. Ourvirtualmachineresourcepredictionprototype,illustratedinFigure 4-1 ,modelshowtheVMperformancedataarecollectedandusedtopredictthevalueforfuturetimetosupportresourceallocationdecision-making. AperformancemonitoringagentisinstalledintheVirtualMachineMonitor(VMM)tocollecttheperformancedataoftheguestVMs.Inourimplementation,VMware'sESXvirtualmachinesareusedtohosttheapplicationexecutionandthevmkusagetool[ 85 ]ofESXisusedtomonitorandcollecttheperformancedataoftheVMguestsandhost 78

PAGE 79

2-1 showsthelistofperformancefeaturesunderstudyinthiswork. TheprolerretrievestheVMperformancedata,whichareidentiedbyvmID,deviceID,andatimewindow,fromtheroundrobinperformancedatabase.ThedataofeachVMdevice'sperformancemetricformtimeseries(xtm+1;;xt)withanidenticalinterval,wheremisthedataretrievalwindowsize.Theretrievedperformancedatawiththecorrespondingtimestampsarestoredinthepredictiondatabase.The[vmID,deviceID,timeStamp,metricName]formsthecombinationalprimarykeyofthedatabase.Figure 4-2 showstheXMLschemaofthedatabaseandsampledatabaserecordsofvirtualmachinessuchasVM1,whichhasoneCPU,twoNetworkInterfaceCards(NIC),andtwovirtualharddisks. TheLARPredictortakesthetimeseriesperformancedata(ytm;;yt1)asinputs,selectsthebestpredictionmodelbasedonthelearningofhistoricalpredictionresults,andpredictstheresourceperformance^ytoffuturetime.ThedetaildescriptionoftheLARPredictor'sworkowisgiveninSection 4.6 .ThepredictedresultsarestoredinthepredictionDBandcanbeusedtosupporttheresourcemanager'sdynamicVMprovisioningdecision-making. ThePredictionQualityAssuror(QA)isresponsibleformonitoringtheLARPredic-tor'sperformance,intermsofMSE.ItperiodicallyauditsthepredictionperformancebycalculatingtheaverageMSEofhistoricalpredictiondatastoredinthepredictionDB.WhentheaverageMSEofthedataintheauditwindowexceedsapredenedthreshold,itdirectstheLARPredictortore-trainthepredictorsandtheclassierusingrecentperformancedatastoredinthedatabase. 79

PAGE 80

SampleXMLschemaoftheVMperformanceDB wherefZtgdenotestheobservedtimeseries,fatgdenotesanunobservedwhitenoiseseries,andfigdenotestheweights.Inthisthesis,performancesnapshotsofvirtualmachine'sresourcesincludingCPU,memory,disk,andnetworkbandwidtharetakenperiodicallytoformthetimeseriesfZtgunderstudy. 80

PAGE 81

86 ].Timeseriesanalysistechniqueshavebeenwidelyappliedtoforecastinginmanyareassuchaseconomicforecasting,salesforecasting,stockmarketanalysis,communicationtraccontrol,andworkloadprojection.Inthiswork,simpletimeseriesmodels,suchasLAST,sliding-windowaverage(SW AVG),andautoregressive(AR),areusedtoconstructtheLARPredictortosupportonlineprediction.However,theLARPredictorprototypemaybegenerallyusedwithotherpredictionmodelsstudiedin[ 78 ][ 73 ][ 4 ]. AVGmodel:Thesliding-windowaveragemodelpredictsthefuturevaluesbytakingtheaverageoveraxed-lengthhistory: ThecurrentvalueoftheseriesZtisalinearcombinationoftheplatestpastvaluesofitselfplusatermat,whichincorporateseverythingnewintheseriesattimetthatisnotexplainedbythepastvalues.Yule-WalkertechniqueisusedintheARmodelttinginthiswork. 81

PAGE 82

AVGisproposedtopredicttheVMresourceperformance. Thepredictionperformanceismeasuredinmeansquarederror(MSE)[ 87 ],whichisdenedastheaveragesquareddierencebetweenindependentobservationsandpredictionsfromthettedequationforthecorrespondingvaluesoftheindependentvariables. (4{5) where^istheestimatorofaparameterinastatisticalmodel. Therearetwotypesofclassiers:nonparametricandparametric.Theparametricclassierexploitspriorinformationtomodelthefeaturespace.Whentheassumedmodeliscorrect,parametricclassiersoutperformnonparametricones.Incontrast,thenonparametricclassiersdonotmakesuchassumptionandaremorerobust.However,thenonparametricclassierstendtosuerfromcurseofdimensionality,whichmeansthatthedemandofanumberofsamplesgrowsexponentiallywiththedimensionalityofthefeaturespace.Inthissection,wearegoingtointroduceanonparametricclassier,k-NNclassier,andaparametricclassier,Bayesianclassier,whichareusedforbestpredictorselectionin 82

PAGE 83

Sincethefeaturesunderstudy,suchasCPUpercentageandnetworkreceived bytes/-sec,havedierentunitsofmeasure,allfeaturesarenormalizedtohavezeromeanandunitvariance[ 88 ].Inthiswork,\closest"isdeterminedbyEuclideandistance(Equation 4{6 ). Asanonparametricmethod,thek-NNclassiercanbeappliedtodierenttimeserieswithoutmodication.Toaddresstheproblemassociatedwithhighdimensionality,variousdimensionreductiontechniquescanbeusedinthedatapreprocessing. 83

PAGE 84

whereinthiscaseofccategories Then,theposteriorprobabilitiesP(!jjx)canbecomputedfromp(xj!j)byBayesformula.Inaddition,BayesformulacanbeexpressedinformallyinEnglishbysayingthat evidence: Themultivariatenormaldensityhasbeenappliedsuccessfullytoanumberofclassicationproblems.Inthisworkthefeaturevectorcanbemodeledasamultivariatenormalrandomvariable. Thegeneralmultivariatenormaldensityinddimensionsiswrittenas (2)d=2jj1=2exp1 2(x)T1(x); wherexisad-componentcolumnvector,isthed-componentmeanvector,isthed-by-dcovariancematrix,andjjand1areitsdeterminantandinverse,respectively.Further,welet(x)Tdenotethetransposeofx. Theminimizationoftheprobabilityoferrorcanbeachievedbyuseofthediscrimi-nantfunctions 84

PAGE 85

2(xi)T1(xi)d 2lnjij+lnP(!i): Theresultingclassicationisperformedbyevaluatingdiscriminantfunctions.Whentheworkloadhavesimilarstatisticalproperty,theBayesianclassierderivedfromoneworkloadtracecanbeappliedtoanotherdirectly.Incaseofhighlyvariableworkload,theretrainingoftheclassierisnecessary. 22 ][ 88 ],alsocalledKarhunen-Loevetrans-form,isalineartransformationrepresentingdatainaleast-squaresense.Theprincipalcomponentsofasetofdatain
PAGE 86

4-3 .Thepredictionconsistsoftwophases:atrainingphaseandatestingphase.Duringthetrainingphase,thebestpredictorsforeachsetoftrainingdataareidentiedusingthetraditionalmix-of-expertapproach.Duringthetestingphase,theclassierforecaststhebestpredictorforthetestdatabasedontheknowledgegainedfromthetrainingdataandhistoricalpredictionperformance.Thenonlytheselectedbestpredictorisruntopredicttheresourceperformance.Bothphasesincludethedatapre-processingandthePrincipalComponentAnalysis(PCA)process. Thefeaturesunderstudyinthiswork,asshowninTable 2-1 ,includeCPU,memory,networkbandwidth,anddiskI/Ousages.Figure 4-4 illustrateshowthefeaturesareprocessedtoformthepredictiondatabase.Sincethefeatureshavedierentunitsofmeasure,adatapre-processorwasusedtonormalizetheinputdatawithzeromeanandunitvariance.ThenormalizeddataareframedaccordingtothepredictionwindowsizetofeedthePCAprocessor. TheLASTandSW AVGmodelsdonotinvolveanyunknownparameters.Theycanbeusedforpredictionsdirectly.TheparametricpredictionmodelssuchastheARmodel,whichcontainunknownparameters,requiremodeltting.Themodelttingisaprocess 86

PAGE 87

Learning-aidedadaptiveresourcepredictorworkowTheinputdataarenormalizedandframedwiththepredictionwindowsizem.ThePrincipalComponentAnalysis(PCA)isusedtoreducethedimensionoftheinputdatafromthewindowsizemton(n
PAGE 88

Learning-aidedadaptiveresourcepredictordataowFirst,theutrainingdataX1uisnormalizedtoX01uandsubsequentlyframedtoX0(um+1)maccordingtothepredictororderm.ThePCAprocessorisusedtoreducethedimensionofeachsetoftrainingdatafrommtonbeforeprediction.ThenthepredictorsareruninparallelwiththeinputsX00(um+1)nandtheonethatgivesthesmallestMSEisidentiedasthebestpredictortobeassociatedwiththecorrespondingtrainingdatainthepredictiondatabase.Thedimensionreductionofthetestingdataissimilartothetrainingdata'sandisnotshownhere. toestimatetheunknownparametersofthemodels.TheYule-Walkerequation[ 86 ]isusedintheARmodelttinginthiswork. Forwindowbasedpredictionmodels,suchasSW AVGandAR,thePCAalgorithmisappliedtoreducetheinputdatadimension.Thenaivemix-of-expertapproachisusedtoidentifythebestpredictorpiforeachsetofpre-processedtrainingdata(exp.(x0ix0i+1:::x0i+m1)).AllpredictionmodelsareruninparallelwiththetrainingdataandtheonewhichgeneratestheleastMSEofpredictionisidentiedasthebestpredictorpi,whichisaclasslabeltakingvaluesin(LAST,AR,SW AVG)tobeassociatedwiththetrainingdata.TheupairsofPCA-processedtrainingdataandthecorrespondingbestpredictors[(x001;p1);;(x00u;pu)]formthetrainingdataoftheclassiers. Asanon-parametricclassier,thek-NNclassierdoesnothaveanobvioustrainingphase.Themajortaskofitstrainingphaseistolabelthetrainingdatawithclassdenitions.Asaparametricclassier,theBayesianclassierusesthetrainingdatatoderiveitsunknownparameters,whicharethemeanandcovariancematrixoftrainingdataofeachclass,toformtheclassicationmodel. 88

PAGE 89

InthetestingphaseoftheLARPredictorthatisbasedonk-NNclassier,theEuclideandistancesbetweenallPCA{processedtestdatay00tny00tn+1:::y00t1andalltrainingdataX00(u1+m)ninthereducedndimensionalfeaturespacearecalculatedandthek(k=3inourimplementation)trainingdatawhichhavetheshortestdistancestothetestingdataareidentied.Themajorityvoteoftheknearestneighbors'bestpredictorwillbechosenasthebestpredictortopredict^y0tbasedonthe(y0tm;y0tm+1;;y0t1)incaseoftheARmodelortheSW AVGmodeland^y0t=y0t1incaseoftheLASTmodel.Thepredictionperformancecanbeobtainedbycomparingthepredictedvalue^y0twiththenormalizedobservedvaluey0t. InthetestingphaseoftheLARPredictorthatisbasedonBayesianclassier,testdataarepreprocessedthesameasthek-NNclassier.ThePCA{processedtestdatay00tny00tn+1:::y00t1arepluggedintothediscriminantfunction( 4{12 )derivedinSection 4.5.2 .Theparametersinthediscriminantfunctionforeachclass,themeanvectorandcovariancematrix,areobtainedduringthetrainingphase.Then,eachtestdataisclassiedastheclassofthelargestdiscriminantfunction. ThetestingphasediersfromthetrainingphaseinthatitdoesnotrequirerunningmultiplepredictorsinparalleltoidentifytheonewhichisbestsuitedtothedataandgivesthesmallestMSE.Instead,itforecaststhebestpredictorbylearningfromhistoricalpredictions.Thereasoninghereisthatthesenearestneighbors'workloadcharacteristicsareclosesttothetestingdata'sandthepredictorthatworksbestfortheseneighborsshouldalsoworkbestforthetestingdata. 89

PAGE 90

ThesevirtualmachineswerehostedbyaphysicalmachinewithanIntel(R)Xeon(TM)2.0GHzCPU,4GBmemory,and36GBSCSIdisk.VMwareESXserver2.5.2wasrunningonthephysicalhost.ThevmkusagetoolwasrunontheESXservertocollecttheresourceperformancedataoftheguestvirtualmachineseveryminuteandstoretheminaroundrobindatabase.TheprolerwasusedtoextractthedatawithgivenVMID,DeviceID,performancemetric,startingandendingtimestamps,andintervals.Inthisexperiment,theperformancedataofa24-hourperiodwith5-minuteintervalswereextractedforVM2,VM3,VM4,andVM5.Thedataofa7-dayperiodwith30-minuteintervalsofVM1wereextracted.ThedataofagivenVMID,DeviceID,andperformancemetricsformatimeseriesunderstudy.Thetimeseriesdatawerenormalizedwithzeromeanandunitvariance. 90

PAGE 91

4-5 showsthepredictorselectionsforCPUfteenminuteloadaverageduringa12hourperiodwithasamplingintervalof5minutes.Thetopplotshowstheobservedbestpredictorbyrunningthreepredictionmodelsinparallel.ThemiddleplotshowsthepredictorselectionoftheLARPredictorandthebottomplotshowsthecumulative-MSEbasedpredictorselectionusedintheNWS.Similarlythepredictorselectionresultsofthetracedataofotherresourcesareshownasfollows:NetworkpacketsinpersecondinFig. 4-6 ,totalamountofswapmemoryinFig. 4-7 ,andtotaldiskspaceinFig. 4-8 Theseexperimentalresultsshowthatthebestpredictionmodelforaspecictypeofresourceofagiventracevariesasafunctionoftime.Intheexperiment,theLARPredictorcanbetteradaptthepredictorselectiontothechangingworkloadthanthecumulative-MSEbasedapproachpresentedintheNWS.TheLARPredictor'saveragebestpredictorforecastingaccuracyofalltheperformancetracesofthevevirtualmachinesis55.98%,whichis20.18%higherthantheaccuracyof46.58%achievedbythecumulative-MSEbasedpredictorusedintheNWSfortheworkloadstudied. 4.7.2.1 showsthepredictionaccuracyofthek-NNbasedLARPredictorandallthepredictorsinthepool.Section 4.7.2.2 comparesthepredictionaccuracyandexecutiontimeofthek-NNbasedLARPredictorandtheBayesianbasedLARPredictor.Inaddition,Section 4.7.2.3 benchmarkstheperformanceoftheLARPredictorsandthecumulative-MSEbasedpredictionmodelusedintheNWS. Intheexperiments,ten-foldcrossvalidationwereperformedforeachsetoftimeseriesdata.Atimestampwasrandomlychosentodividetheperformancedataofavirtualmachineintotwoparts:50%ofthedatawasusedtotraintheLARPredictorandtheother50%wasusedastestsettomeasurethepredictionperformancebycalculatingitspredictionMSE. 91

PAGE 92

BestpredictorselectionfortraceVM2 load15PredictorClass:1-LAST,2-AR,3-SW AVG Inthetestingphase,the3NNclassierwasusedtoforecastthebestpredictorsofthetestingdata.First,foreachsetoftestingdataofthepredictionwindowsize,thePCAwasappliedtoreducethedatadimensionfromm,whichwas5or16,ton=2in 92

PAGE 93

BestpredictorselectionfortraceVM2 PktInPredictorClass:1-LAST,2-AR,3-SW AVG thisexperiment.ThentheEuclideandistancesbetweenthetestdataandallthetrainingdatainthereducedfeaturespacewerecalculated.Thethreetrainingdatawhichhadtheshortestdistancestothetestingdatawereidentiedandthemajorityvoteoftheirassociatedbestpredictorswasforecastedtobethebestpredictorofthetestingdata.Atlast,theforecastedbestpredictorwasruntopredictthefuturevalueofthetestingdata.TheMSEofeachtimeserieswascalculatedtomeasuretheperformanceoftheLARPredictor.Tables 4-1 4-2 4-3 4-4 ,and 4-5 showthepredictionperformanceoftheLARPredictorwithcurrentimplementation(LAR)andthethreepredictionmodelsincludingLAST,AR,andSW AVGforallresourceperformancetracesofthevevirtualmachines.AlsoshowninthesetablesisthecomputedMSEforaperfectLARPredictor 93

PAGE 94

BestpredictorselectionfortraceVM2 SwapPredictorClass:1-LAST,2-AR,3-SW AVG (P-LAR).TheMSEoftheP-LARmodelshowstheupperboundofthepredictionaccuracythatcanbeachievedbytheLARPredictor.TheMSEofthebestpredictoramongLAR,LAST,AR,andSW AVGishighlightedwithitalicboldnumbers. Table 4-6 showsthebestpredictoramongLAST,AR,andSW AVGforalltheresourceperformancemetricsandVMtraces.Thesymbol\*"indicatesthecasesinwhichtheLARPredictorachievedequalorhigherpredictionaccuracythanthebestofthethreepredictors.Overall,theARmodelperformedbetterthantheLASTandtheSW AVGmodels. Theaboveexperimentalresultsshow: 94

PAGE 95

BestpredictorselectionfortraceVM2 DiskPredictorClass:1-LAST,2-AR,3-SW AVG 1.ItishardtondasinglepredictionmodelamongLAST,AR,andSW AVGthatperformsbestforalltypesofresourceperformancedataforagivenVMtrace.Forexample,fortheVM1'stracedatashowninTable 4-1 ,eachofthethreemodels(LAST,AR,andSW)outperformedtheothertwoforasubsetoftheperformancemetrics.Inthisexperiment,onlytheARmodelworkedbestforthetracedataofVM3. 2.ItishardtondasinglepredictionmodelamongthethreethatperformbestconsistentlyforagiventypeofresourcesacrossalltheVMtraces.Intheexperiment,onlytheARmodelworkedbestfortheCPUperformancepredictions. 3.TheLARPredictorachievedbetter-than-expertperformancesusingthemix-of-expertapproachfor44.23%oftheworkloadtraces.Itshowsthepotentialforthe 95

PAGE 96

NormalizedpredictionMSEstatisticsforresourcesofVM1 4-9 showsthepredictionperformancecomparisonsbetweenitandthek-NNbasedLARPredictorforalltheresourcesofVM1.TheprolereportoftheMatlabprogramexecutionshowedthatitcostthekNNbasedLARPredictor205.8secondCPUtime,with193.5secondsinthetestingphaseand12.3secondsinthetrainingphase.Ittook132.1 96

PAGE 97

NormalizedpredictionMSEstatisticsforresourcesofVM2 TheexperimentalresultsshowthatthepredictionaccuracyintermsofnormalizedMSEoftheBayesian-classierbasedLARPredictorisabout3.8%worsethanthek-NNbasedone.However,itshortenedtheCPUtimeofthetestingphaseby37.57%. 4-9 4-10 4-11 4-12 ,and 4-13 showsthepredictionaccuracyoftheperfectLARPredictorthathas100%bestpredictorforecastingaccuracy(P-LARP),thek-NNandBayesianbasedLARPredictors(KnnLARPandBaysLARP),thecumulativeMSEofallhistorybasedpredictorusedintheNWS(Cum.MSE),andthecumulative-MSE 97

PAGE 98

NormalizedpredictionMSEstatisticsforresourcesofVM3 Theexperimentalresultsshowthatwithoutrunningallthepredictorsinparallelallthetime,for66.67%ofthetraces,theLARPredictoroutperformedthecumulative-MSEbasedpredictorusedintheNWS.TheperfectLARPredictorshowsthepotentialtoachieve18.6%lowerMSEinaveragethatthecumulative-MSEbasedpredictor. 89 ].Inthecontextofresourceperformancetimeseriesprediction,W=1anddisthepredictionwindowsize.ThetypicalsmallinputdatasizeinthiscontextmakestheuseofthePCAfeasible.Therealsoexistcomputationallylessexpensivemethods[ 90 ]forndingonlyafeweigenvectorsandeigenvaluesofalargematrix;inourexperiments,weuseappropriateMatlabroutinestorealizethese. 98

PAGE 99

NormalizedpredictionMSEstatisticsforresourcesofVM4 NormalizedpredictionMSEstatisticsforresourcesofVM5

PAGE 100

Bestpredictorsofallthetracedata.ThepredictorsshowninthetablehavethesmallestMSEamongallthethreepredictors(LAST,AR,andSW AVG).The\*"symbolindicatesthattheLARPredictoroutperformsthebestpredictorinthepredictorpool. VM1 VM2 VM3 VM4 VM5 usedsec AR AR AR AR* AR* CPU ready AR AR* AR* AR* AR Mem size LAST AR* AR* LAST AR* Mem swap LAST AR* LAST AR* NIC1 Rx AR* AR AR* AR NIC1 Tx AR* AR* AR* AR NIC2 Rx AR* LAST AR SW AVG NIC2 Tx AR* AR* AR* AR VD1 read AR AR AR SW AVG VD1 write AR AR SW AVG* AR VD2 read SW AVG AR AR* AR VD2 write AR AR AR* AR* AR Thek-NNdoesnothaveano-linelearningphase.The\trainingphase"ink-NNissimplytoindextheNtrainingdataforlateruse.Therefore,itstrainingcomplexityofk-NNisO(N)bothintimeandspace.Inthetestingphase,theknearestneighborsofatestingdatacanbeobtainedO(N)timebyusingamodiedversionofquicksort[ 91 ].Therearefastalgorithmsforndingnearest-neighbors[ 92 ][ 93 ]also. Threesimpletimeseriesmodelswereusedinthisexperimenttoshowthepotentialofusingdynamicpredictorselectionbasedonlearningtoimprovepredictionaccuracy.However,theLARPredictorprototypemaybegenerallyusedwithothermoresophisti-catedpredictionmodelssuchasthesestudiedin[ 78 ][ 73 ][ 4 ].Generally,themorepredictorsinthepoolandthemorecomplexthepredictorsare,itismorebenecialtousetheLARPredictorbecausetheclassicationoverheadcanbebetteramortizedbyrunningonlysinglepredictoratanygiventime. 100

PAGE 101

Predictorperformancecomparison(VM1)1-CPU usedsec,2-CPU ready,3-Mem size,4-Mem swap,5-NIC1 rx,6-NIC1 tx,7-NIC2 rx,8-NIC2 tx,9-VD1 read,10-VD1 write,11-VD2 read,12-VD2 write theBayesianclassierareusedtoforecastthebestpredictorfortheworkloadbasedonthelearningofhistoricalloadcharacteristicsandpredictionperformance.Theprincipalcomponentanalysistechniquehasbeenappliedtoreducetheinputdatadimensionoftheclassicationprocess.OurexperimentalresultswiththetracesofthefullrangeofvirtualmachineresourcesincludingCPU,memory,networkanddiskshowthattheLARPredictorcaneectivelyidentifythebestpredictorfortheworkloadandachievepredictionaccuraciesthatareclosetoorevenbetterthananysinglebestpredictor. 101

PAGE 102

Predictorperformancecomparison(VM2)1-CPU usedsec,2-CPU ready,3-Mem size,4-Mem swap,5-NIC1 rx,6-NIC1 tx,7-NIC2 rx,8-NIC2 tx,9-VD1 read,10-VD1 write,11-VD2 read,12-VD2 write 102

PAGE 103

Predictorperformancecomparison(VM3)1-CPU usedsec,2-CPU ready,3-Mem size,4-Mem swap,5-NIC1 rx,6-NIC1 tx,7-NIC2 rx,8-NIC2 tx,9-VD1 read,10-VD1 write,11-VD2 read,12-VD2 write 103

PAGE 104

Predictorperformancecomparison(VM4)1-CPU usedsec,2-CPU ready,3-Mem size,4-Mem swap,5-NIC1 rx,6-NIC1 tx,7-NIC2 rx,8-NIC2 tx,9-VD1 read,10-VD1 write,11-VD2 read,12-VD2 write 104

PAGE 105

Predictorperformancecomparison(VM5)1-CPU usedsec,2-CPU ready,3-Mem size,4-Mem swap,5-NIC1 rx,6-NIC1 tx,7-NIC2 rx,8-NIC2 tx,9-VD1 read,10-VD1 write,11-VD2 read,12-VD2 write 105

PAGE 106

Prolingtheexecutionphasesofapplicationscanhelptooptimizetheutilizationoftheunderlyingresources.Thischapterpresentsanovelsystemlevelapplication-resource-demandphaseanalysisandpredictionapproachinsupportofon-demandresourceprovisioning.Thisapproachexploreslarge-scalebehaviorofapplications'resourceconsumption,followedbyanalysisusingasetofalgorithmsbasedonclustering.Thephaseprole,whichlearnsfromhistoricalruns,isusedtoclassifyandpredictfuturephasebehavior.Thisprocesstakesintoconsiderationapplications'resourceconsumptionpatterns,phasetransitioncostsandpenaltiesassociatedwithService-LevelAgreements(SLA)violations. 94 ]oftheapplication'sexecutionenvironmentbothinacademiaandindustry[ 11 ][ 16 ][ 95 ].Thisismotivatedbytheideaofprovidingcomputingresourcesasautilityandchargingtheusersforaspecicusage.Forexample,inAugust2006,AmazonlauncheditsBetaversionofVM-basedElasticComputeCloud(EC2)webservice.EC2allowsuserstorentvirtualmachineswithspeciccongurationsfromAmazonandcansupportchangesinresourcecongurationsintheorderofminutes.Insystemsthatallowuserstoreserveandrecongureresourceallocationsandchargebaseduponsuchallocations,usershaveanincentivetorequestnomorethantheamountofresourcesanapplicationneeds.Aquestionwhichariseshereis:howtoadapttheresourceprovisioningtothechangingworkload? Inthischapter,wefocusonmodelingandanalyzinglong-runningapplications'phasebehavior.Themodelingisbasedonmonitoringandlearningoftheapplications'historicalresourceconsumptionpatterns,whichlikelyvariesovertime.Understandingsuchbehavioriscriticaltooptimizingresourcescheduling.Toself-optimizethecongurationofan 106

PAGE 107

Inthiscontext,aphaseisdenedasasetofintervalswithinanapplication'sexecutionthathavesimilarsystem-levelresourceconsumptionbehavior,regardlessoftemporaladjacency.Itmeansthataphasemayreappearmanytimesasanapplicationexecutes.Phaseclassicationpartitionsasetofintervalsintophaseswithsimilarbehavior.Inthischapter,weintroduceanapplicationresourcedemandphaseanalysisandpredictionprototype,whichusesacombinationofclusteringandsupervisedlearningtechniquestoinvestigatethefollowingquestions: 1)Isthereaphasebehaviorintheapplication'sresourceconsumptionpatterns?Ifso,howmanyphasesshouldbeusedtoprovideoptimalresourceprovisioning? 2)Basedontheobservationsofhistoricalphasebehaviors,whatisthepredictednextphaseoftheapplication'sexecution? 3)Howdophasetransitionfrequencyandpredictionaccuracyaectresourceallocation?Answerstothesequestionscanbeusedtodecidethetimeandspaceallocationofresources. Tomakeoptimizationdecisions,thisprototypetakestheapplication'sresourceconsumptionpatterns,phasetransitioncosts,andpenaltiesassociatedwithServiceLevelAgreement(SLA)violationsintoaccountwhilemakingoptimizationdecisions.Thepredictionaccuracyisfedbacktoguidefuturephaseanalysis.Thisprototypedoesnotrequireanyinstrumentationoftheapplicationsourcecodesandcangenerallyworkwithbothphysicalandvirtualmachineswhichcanprovidemonitoringofsystemlevelperformancemetrics. OurexperimentalresultswiththeCPUandthenetworkperformancetracesofSPECseis96andWorldCup98accesslogreplayshowthat: 107

PAGE 108

2.Forapplicationswithphasebehavior,typicallywithasmallnumberofphasesthesavingsgainedfromphase-basedresourcereservationcanoutweighthecostsassociatedwiththeincreasednumberofre-provisioningandthepenaltiescausedbymispredictions. 3.Thephasepredictionaccuracydecreasesasthenumberofphasesincreases.Withthecurrentprototype,anaverageofabove90%phasepredictionaccuracycanbeachievedfortheCPUandnetworkperformancefeatureswherefourphasesareconsidered. Therestofthischapterisorganizedasfollows:Section 5.2 presentstheapplicationphaseanalysisandpredictionmodel.Section 5.3 and 5.4 detailthealgorithmsusedforphaseanalysisandprediction.Section 5.5 presentsexperimentalresults.Section 5.6 discussesrelatedwork.Section 5.7 drawsconclusionsanddiscussesfuturework. 5-1 ,modelshowtheapplicationVM'sperformancedataarecollectedandanalyzedtoconstructthecorrespondingapplication'sphaseproleandhowtheproleisusedtopredictitsnextphase.Inaddition,itshowshowprocessqualityindicators,suchasphasepredictionaccuracy,aremonitoredandusedasfeedbacksignalstotunethesystemperformance(suchasapplicationresponsetime)towardsthegoaldenedintheSLA. AperformancemonitoringagentisusedtocollecttheperformancedataoftheapplicationVM,whichservesastheapplicationcontainer.Themonitoringagentcanbeimplementedinvariousways.Inthiswork,Ganglia[ 54 ],adistributedmonitoringsystem,andthevmkusagetool[ 85 ]providedbyVMwareESXserver,areusedtomonitor 108

PAGE 109

ApplicationresourcedemandphaseanalysisandpredictionprototypeThephaseanalyzeranalyzestheperformancedatacollectedbythemonitoringagenttondouttheoptimalnumberofphasesn2[1;m].Theoutputphaseproleisstoredintheapplicationphasedatabase(DB)andwillbeusedastrainingdataforthephasepredictor.Thepredictorpredictsthenextphaseoftheapplicationresourceusagebasedonthelearningofitshistoricalphasebehaviors.Thepredictedphasecanbeusedtosupporttheapplicationresourcemanager's(ARM's)decisionsregardingresourceprovisioning.Theauditormonitorsandevaluatestheperformanceoftheanalyzerandpredictorandordersre-trainingofthephasepredictorwiththeupdatedworkloadprolewhentheperformancemeasurementsdroptobelowapredenedthreshold. theapplicationcontainers.Thecollectedperformancedataisstoredintheperformancedatabase. Thephaseanalyzerretrievesthetime-seriesVMperformancedata,whichareidentiedbyvmID,FeatureID,andatimewindow(ts;te),fromtheperformancedatabase.Thenitperformsphaseanalysisusingalgorithmsbasedonclusteringtocheckwhetherthereisaphasebehaviorintheapplication'sresourceconsumptionpatterns.Ifso,itcontinuestondouthowmanyphasesinanumericrangearebestintermsofprovidingtheminimalresourcereservationcosts.Theoutputphaseprole,whichconsistsofthe 109

PAGE 110

5.3 Thephaseproleisusedastrainingdataofthephasepredictor.Inthepresenceofphasebehavior,thephasepredictorcanperformon-linepredictionofthenextphaseoftheapplication'sresourceusagebasedonthelearningofhistoricalphasebehaviorsasshowninSection 5.4 .Thepredictedphaseinformationcanbeusedtosupporttheapplicationresourcemanager'sdecisionsregardingresourcere-provisioningrequeststotheresourcescheduler. Theauditormonitorsandevaluatesthehealthofthephaseanalysisandpredictionprocessbyperformingqualitycontrolofeachcomponent.Clusteringqualitycanbemeasuredbythesimilarityandcompactnessoftheclustersusingvariousinternalindicesintroducedin[ 96 ].Thephasepredictor'sperformanceismeasuredbyitspredictionaccuracy.TheapplicationresponsetimeisusedasanexternalsignalfortotalqualitycontrolandcheckedagainsttheQualityofService(QoS)denedintheSLA.Localperformancetuningistriggeredwhentheauditorobservesthatthecomponent-levelservicequalitydropstobelowapredenedthreshold.Forexample,whenthereal-timeworkloadvariestoadegreewhichmakesitstatisticallysignicantlydierentfromthetrainingworkload,thephasepredictionaccuraciesmaydrop.Upondetection,theauditorcanorderaphaseanalysisbasedonrecentworkloadtoupdatethephaseproleandsubsequentlyorderare-trainingforthephasepredictor.Ifthere-trainingstillcannotimprovethetotalqualityofservicetoasatisfactorylevel,theresourcereservationstrategyfallsbackfromthephase-basedreservationtoaconservativestrategy,whichreservesthelargestamountofresourcestheuseriswillingtopayduringthewholeapplicationrun.Automatedandadaptivethresholdsettingisdiscussedindetailin[ 67 ]. 110

PAGE 111

Atahigh-level,theproblemofclusteringisdenedasfollows:GivenasetUofnsamplesu1;u2;;un,wewouldliketopartitionUintoksubsetsU1;U2;;Uk,suchthatthesamplesassignedtoeachsubsetaremoresimilartoeachotherthanthesamplesassignedtodierentsubsets.Here,weassumethattwosamplesaresimilariftheycorrespondtothesamephase. 97 ]: (1)Patternrepresentation,whichisusedtoobtainanappropriatesetoffeaturestouseinclustering.Itoptionallyconsistsoffeatureextractionand/orselection.Featureselectionistheprocessofidentifyingthemosteectivesubsetoftheoriginalfeaturestouseinclustering.Featureextractionistheuseofoneormoretransformationsoftheinputfeaturestoproducenewsalientfeatures. Inthecontextofresourcedemandphaseanalysis,thefeaturesunderstudyarethesystemlevelresourceperformancemetricsasshowninTable 5-1 .Foronedimensionclustering,whichisthecaseofthiswork,thefeatureselectionisassimpleaschoosingtheperformancemetricwhichisinstructivetotheallocationofthecorrespondingsystemresource.Forclusteringbasedonmultipleperformancemetrics,featureextractiontechniquessuchasPrincipalComponentAnalysis(PCA)maybeusedtotransformtheinputperformancemetricstoalowerdimensionspacetoreducethecomputingintensityofsubsequentclusteringandimprovetheclusteringquality. (2)Denitionofapatternproximitymeasureappropriatetothedatadomain.Thepatternproximityisusuallymeasuredbyadistancefunctiondenedonpairsofpatterns.Inthiswork,themostpopularmetricforcontinuousfeatures,Euclideandistanceisused 111

PAGE 112

(3)Clusteringorgrouping:Theclusteringcanbeperformedinanumberofways[ 97 ].Theoutputclusteringcanbehard(apartitionofthedataintogroups)orfuzzy(whereeachpatternhasavariabledegreeofmembershipineachoftheoutputclusters).Ahardclusteringcanbeobtainedfromafuzzypartitionbythresholdingthemembershipvalue.Inthiswork,oneofthemostpopulariterativeclusteringmethods,k-meansalgorithmasdetailedinSection 5.3.3 ,isused. 97 ] {Apattern(orfeaturevectororobservation)isasingledataitemusedbytheclusteringalgorithm.Ittypicallyconsistsofavectorofdmeasurements. {Theindividualscalarcomponentsofapatternarecalledfeatures(orattributes). {disthedimensionalityofthepatternorofthepatternspace. {Aclassreferstoastateofnaturethatgovernsthepatterngenerationprocessinsomecases.Moreconcretely,aclasscanbeviewedasasourceofpatternswhosedistributioninfeaturespaceisgovernedbyaprobabilitydensityspecictotheclass.Clusteringtechniquesattempttogrouppatternssothattheclassestherebyobtainedreectthedierentpatterngenerationprocessesrepresentedinthepatternset. {Adistancemeasureisametriconthefeaturespaceusedtoquantifythesimilarityofpatterns. 112

PAGE 113

Incaseofclusteringinthemulti-dimensionalspace,normalizationofthecontinuousfeaturescanbeusedtoremovethetendencyofthelargest-scaledfeaturetodominatetheothers.Inaddition,Mahalanobisdistancecanbeusedtoremovethedistortioncausedbythelinearcorrelationamongfeaturesasdiscussedinchapter 3 Thek-meansalgorithmworksasfollows[ 97 ]: (1)Choosekclustercenterstocoincidewithkrandomly-chosenpatternsinsidethehypervolumecontainingthepatternset. (2)Assigneachpatterntotheclosestclustercenter. (3)Recomputetheclustercentersusingthecurrentclustermemberships. (4)Ifaconvergencecriterionisnotmet,gotostep2.Typicalconvergencecriteriaare:no(orminimal)reassignmentofpatternstonewclustercenters,orminimaldecreaseinsquarederror. ThealgorithmhasatimecomplexityofO(n),wherenisthenumberofpatterns,andaspacecomplexityofO(k),wherekisthenumberofclusters.Thealgorithmisorder-independent;foragiveninitialseedsetofclustercenters,itgeneratesthesamepartitionofthedatairrespectiveoftheorderinwhichthepatternsarepresentedtothealgorithm.However,thealgorithmissensitivetoinitialseedselectionandeveninthebestcase,itcanproduceonlyhypersphericalclusters. 113

PAGE 114

96 ].Thebestnumberofclustersinthecontextofphaseanalysisdiscussedinthisworkistheonethatgivesminimaltotalcosts.Theprocesstondouttheoptimalnumberofclustersoftheapplicationworkloadisexplainedasfollows. Letun=u(t0+nt)denotetheresourceusagesampledattimet=t0+ntduringtheexecutionofanapplication.AsshowninSection 5.3.3 ,whentheclusteringwithinputparameterk(i.e.,thenumberofclusters)isperformedforaresourceusagesetU=fu1;u2;g,thesubsetUiofresourceusagesthatbelongtotheithphasecanbewrittenas: andthetotalresourcereservationRoverthewholeexecutionperiodcanbewrittenas wherekisthenumberofclustersusedforclusteringalgorithmandthesizeofUiisdenedasthenumberofelementsofthesubsetUi.Comparedtotheconservativereservationstrategy,whichreservestheglobalmaximumamountofresourcesoverthewholeexecutionperiod,thephase-basedreservationstrategycanbetteradapttheresourcereservationtotheactualresourceusageandreducetheresourcereservationcostasshowninFigure 5-2 114

PAGE 115

ResourceallocationstrategycomparisonPhase-basedresourceallocationstrategycanadaptthetime(t)andspace(s)granularityoftheallocationtotheactualresourceusage.Itpresentscostreductionopportunitycomparedtothecoarse-grainedconservativestrategy. whichillustratesthedierencebetweenthetworeservationstrategiesusingahypotheticalworkload. (5{5) whereC1andC2denotetheunitcostperresourceusageandtransition.Thebestnumberofphases,kbest,shouldminimizethetotalcost.Therefore,kbestisderivedas 115

PAGE 116

wherekisthenumberofphases.Takingboththephasetransitionandmispredictioncostsintoaccount,thegeneraltotalcostfunctionismodiedas (5{9) 116

PAGE 117

whereCisthetransitionfactor,Cpdenotethediscountfactorformispredictionpenalty,whichistheratioofC3toC1,andKisthemaximumnumberofphases. 5-3 .Thepredictionconsistsoftwostages:atrainingstageandatestingstage.Duringthetrainingstage,thenumberoftheclustersintheapplicationresourceusage,thecorrespondingclustercentroids,andtheunknownparametersofthetimeseriespredictionmodeloftheresourceusagearedetermined.Duringthetestingstage,theone-stepaheadresourceusageispredictedandclassiedasoneoftheclusters. Bothstagesstartfrompatternrepresentationandframing.Inthestepofpatternrepresentation,thecollectedperformancedataoftheapplicationVMareproledtoextractonlythefeatureswhichwillbeusedforclusteringandfutureresourceprovisioning.Forexample,intheone-dimensioncasediscussedinthisthesis,thetrainingdataofaspecicperformancefeature(X1u,seeTable 5-1 ),areextracted,whereuisthetotalnumberofinputdata.ThentheextractedperformancedataX1uareframedwiththepredictionwindowsizemtoformdataX0(um+1)m. Thetrainingstagemainlyconsistsoftwoprocesses:predictionmodelttingandphasebehavioranalysis.ThealgorithmsdenedinSection 5.3.3 and 5.3.4 areusedtondoutthenumberofphasesk,whichgivesthelowesttotalresourceprovisioningcost.Theoutputphaseproleisusedtotrainthephasepredictor.Inaddition,theunknownparametersoftheresourcepredictorareestimatedfromthetrainingdata.Inthisthesis, 117

PAGE 118

78 ].However,thisprototypecangenerallyworkwithanyothertime-seriespredictionmodels.Incaseofhighlydynamicworkloads,theLearning-AidedResourcePredictor(LARPredictor)developedinChapter4canbeused.TheLARPredictorusesamix-of-expertsapproach,whichadaptivelychoosesthebestpredictionmodelfromapoolofmodelsbasedonlearningofthecorrelationsbetweentheworkloadandttedpredictionmodelsofhistoricalruns. Similartothetrainingstage,thetestingdataareextractedY1vandframedwiththepredictionwindowsizem.TheframedtestingdataY0(vm+1)mareusedasinputofthettedresourcepredictortopredictthefutureresourceusage^Y01v.Thephasepredictorclassiesthepredictedresourceusages^Y01vintothephases^P01vbasedonthephaseprolelearnedinthetrainingstage.Similarly,thephasepredictionsfortheactualresourceusageY1vareperformedtogenerate^P1v.Thenthecorrespondingpredictedphases^P01v(whicharebasedonpredictedresourceusage)and^P1v(whicharebasedonactualresourceusage)arecomparedtoevaluatethephasepredictionaccuracy,whichisdenedastheratioofthenumberofmatchedphasepredictionsoverthetotalnumberofphasepredictions. 118

PAGE 119

Performancefeaturelist Description System/User PercentCPU System/User Bytes In/Out Numberofbytespersecondinto/outof thenetwork IO BI/BO Blockssentto/receivedfromblockdevice (blocks/s) Swap In/Out Amountofmemoryswappedin/out from/todisk(kB/s) 5.3.4 canbeusedtondoutthebestnumberofclustersforanapplicationworkload.TheGangliamonitoringdaemonwasusedtocollecttheperformancedataoftheapplicationcontainer.Table 5-1 showsthelistofperformancefeaturesunderstudyintheexperiments. 53 ],washostedbyaVMwareGSXvirtualmachine.ThehostserverofthevirtualmachinewasanIntel(R)Xeon(TM)dual-CPU1.80GHzmachinewith512KBcacheand1GBRAM.TheGangliadaemonwasinstalledintheguestVMandruntocollecttheresourceperformancedataonceeveryveseconds(5secs=interval)andstorethemintheperformancedatabase.Duringfeaturerepresen-tation,thedatawereextractedbasedongivenVMID,FeatureID,startingandendingtimestampstoformtimeseriesdataunderstudy.Thensubsequentphaseanalysiswasperformedforthe8000performancesnapshotscollectedduringthemonitoringperiods. Figure A showsasamplesetoftrainingdataoftheCPU user(%)oftheVMincludingtheactualresourceusages(ActualRsc),reservedresourcesbasedonthek-meanclustering(k=3)(RsvdRsc)andbasedontheconservativereservationstrategy(ConsrvRsc).Figure B showsasamplesetofthecorrespondingtestingdataincludingtheactualresourceusage(ActualRsc),theresourcereservationbasedonactualresourceusage(Rsvd 119

PAGE 120

Figures C and D showthat,withincreasingnumberofphases,twoofthedeter-minantsinthecostmodelincludingthenumberofphasetransitionsTR(k)andthemispredictionpenaltyP(k)increasemonotonically.Theotherdeterminantofthecostmodel,theamountofreservedresourcesR(k),isshownbythelowestcurvewithindexC=0inFigure E .Itindicatesthatwithincreasingnumberofphasesthetotalreservedresourcesofthetrainingsetisdecreasingmonotonically.Thisresultisbecausewiththeincreasingnumberofphases,theresourceallocationcanbeperformedattimescalesofnergranularity.However,thereisadiminishingreturnoftheincreasednumberofphasesbecauseoftheincreasingphasetransitioncostsandmispredictionpenalties. Intherstanalysis,weassumeeachresourcereservationschemetobeclairvoyant,i.e.,itreservesresourcesbasedonexactknowledgeoffutureworkloadrequirements.Thisassumptioneliminatestheimpactofinaccuraciesintroducedbythephasepredictor.Inthiscase,Equation( 5{6 ),whichtakestheresourcereservationcostandthephasetransitioncostintoaccountwhiledecidingtheoptimalnumberofphases,canbeappliedasshowninFigure E .Inthisgure,thetotalcostoverthewholetestingperiodismeasuredbyCPUusageinpercentage.ThediscountfactorCdenotestheCPUpercentagethatwillcostforeachphasetransition:C=CPU(%)TransitionDuration.Forexample,thebottomlineofC=0showsthecaseofnotransitioncost,whichgivesthelowerboundofthetotalcost.Foranotherinstance,C=260%impliesa13-secondtransitionperiod(2:6intervals5secs=interval)withtheassumptionof100%CPUconsumptionduringthetransitionperiod.WhenthediscountfactorCincreasesfrom0to260,thebestnumberofphasesk best,whichcanprovidethelowesttotalcost,decreasesgraduallyfrom10to2.ThephaseproledepictedinFigure E canbeusedtodecidethenumberofphasesthatshouldbeusedinthephase-basedresourcereservationtominimizethetotalcostwithgivenavailabletransitionoptions.Forexample,VMwareESXsupports 120

PAGE 121

TheimpactofinaccuraciesintroducedbythephasepredictorisshowninFigure F .Inadditiontotheresourcereservationcostsandthephasetransitioncosts,thisexperimentalsotookthephasemis-predictionpenaltycostsintoaccountswhilecalculatingthetotalcost.Forexample,foreachunitofdown-sizemis-predictedresource,apenaltyof8-times(Cp=8)oftheunitresourcecostisimposed.ComparingFigure E toFigure F ,wecanseethataddingpenaltyintothecostmodelwillincreasethenalcoststotheuserforthesamesetofkandCandpotentiallyreducetheworkload'sbestnumberofphasesk0 FinallyatotalcostratioisdenedtobetherelativetotalcostusingkphasesTC0(k)tothetotalcostof1phaseTC0(1). Intuitively,measuresthecostsavingsachievedusingphase-basedreservationstrategyovertheconservativeone.Thus,thesmallerthevalueof,themoreecientaphase-basedreservationscheme.Table 5-2 givesasampletotalcostschedule(C=52andCp=8)foreachoftheeightperformancefeaturesofSPECseis96.Itshowsthatbychangingtheresourceprovisioningstrategyfromtheconservativeapproach(k=1)tothephase-basedprovisioning(k=3),29.5%totalcostreductionforCPUusagecanbeachieved.ForspikytracedatasuchasdiskI/Oandmemoryusage,thetotalcostreductioncanbeashighas49%. 121

PAGE 122

SPECseis96totalcostratioschedulefortheeightperformancefeatures Theworkloadusedinthisexperimentwasbasedonthe1998WorldCuptrace[ 98 ].TheopenlyavailabletracecontainingalogofrequeststoWebserverswasusedasaninputtoaclientreplaytool,whichenabledustoexercisearealisticWeb-basedworkloadandcollectsystem-levelperformancemetricsusingGangliainthesamemannerthatwasdonefortheSPECseis96workload.Forthisstudy,wechosetoreplaythevehour(from22:00:01Jun.23to3:11:20Jun.24)logoftheleastloadedserver(serverID101),whichcontained130,000webrequests. Thephaseanalysisandpredictiontechniquescanbeusedtocharacterizeperformancedatacollectedfromnotonlyvirtualmachinesbutalsophysicalmachines.Duringtheexperiment,aphysicalserverwithsixteenIntel(R)Xeon(TM)MP3.00GHzCPUsand32GBmemorywasusedtoexecutethereplayclientstosubmitrequestsbasedonsubmissionintervals,HTTPprotocoltypes(1.0or1.1),anddocumentsizesdenedinthelogle.AphysicalmachinewithIntel(R)Pentium(R)41.70GHzCPUand512MBmemorywasusedtohosttheApachewebserverandasetofleswhichwerecreatedbasedonthelesizesdescribedinthelog. 122

PAGE 123

98 ]wasusedtoconvertthebinarylogintotheCommonLogFormat.AmodiedversionoftheReal-TimeWebLogReplayer[ 99 ]wasusedtoanalyzeandgeneratethelesneededbythelogreplayerandperformthereplay. Figures 5-5 and 5-6 showthephasecharacterizationresultsoftheperformancefeaturesbytes inandbytes outofthewebserver.TheinterestingobservationfromFigures A and B isthatthenumberofphasetransitionsandmis-predictionpenaltiesdonotalwaysmonotonicallyincreasewiththeincreasingnumberofphases.Asaresult,thephaseproleshowninFigure C arguesthatthree-phasebasedresourceprovisioninggivesthelowesttotalcostwithgivenC=[150k;750k]andC p=8.Theresultsimpliesthatthephaseproleishighlyworkloaddependent.Theprototypepresentedinthisthesiscanhelptoconstructandanalyzethephaseproleoftheapplicationresourceconsumptionanddecidetheproperresourceprovisioningstrategy. 5.4 .Aperformancemeasurement,predictionaccuracy,isdenedastheratioofthenumberofperformancesnapshots,whosepredictedphasesmatchwiththeobservedphases,tothetotalnumberofperformancesnapshotscollectedduringthetestingperiod. Table 5-3 showsthephasepredictionaccuraciesfortheperformancetracesofthemainresourcesconsumedbytheSPECseis96andtheWorldCup98workloads.Generally,thephasepredictionaccuracyofeachperformancefeaturedecreaseswithincreasingnumberofphases.ItexplainswhythepenaltycurverisesmonotonicallywiththeincreasingnumberofphasesinFigure D .Withcurrentimplementation,anaverageof95%accuracycanbeachievedforthenetworkperformancetracesoftheWorldCup98log 123

PAGE 124

Averagephasepredictionaccuracy Table5-4. PerformancefeaturelistofVMtraces Description Ready Thepercentageoftimethatthevirtualmachine wasreadybutcouldnotgetscheduledtorunon aphysicalCPU. CPU Used ThepercentageofphysicalCPUresourcesused byavirtualCPU. Mem Size Currentamountofmemoryinbytesthevirtual machinehas. Mem Swap Amountofswapspaceinbytesusedbythe virtualmachine. Net RX/TX ThenumberofpacketsandtheMBytesper secondthataretransmittedandreceivedbyaNIC. Disk RD/WR ThenumberofI/OsandKBytespersecond thatarereadfromandwrittentothedisk. replayandanaverageof85%accuracycanbeachievedfortheCPUperformancetracesofSPECseis96forthefour-phasecases. Inadditiontotheabovetwoapplications,wealsoevaluatedthepredictionper-formanceofthephasepredictorusingtracesofasetofvevirtualmachines.ThesevirtualmachineswerehostedbyaphysicalmachinewithanIntel(R)Xeon(TM)2.0GHzCPU,4GBmemory,and36GBSCSIdisk.VMwareESXserver2.5.2wasrunningonthephysicalhost.ThevmkusagetoolwasrunontheESXservertocollecttheresourceperformancedataoftheguestvirtualmachineseveryminuteandstoretheminaroundrobindatabase.TheperformancefeaturesunderstudyinthisexperimentareshowninTable 5-4 124

PAGE 125

Table 5-5 showstheaveragephasepredictionaccuraciesforeachofthe12perfor-mancefeaturesoveralltheveVMs.Itshowsthatwithincreasingnumberofphasesthephasepredictionaccuracyofeachperformancefeaturedecreasesmonotonically.Thepredictionaccuraciesvarywiththeperformancefeaturesunderstudy.Withcurrentimplementation,anaverageof83.25%accuracycanbeachievedacrossthephasepredictionsofallthetwelveperformancefeaturesforthetwophasecases. 1.Aclearmappingbetweenresourceconsumptionandresponsetimeisassumedfortheapplicationcontainer.Thismightnotalwaysbetrueforalltypesofapplications.Morecomplexperformance/queuingmodelsmaybeneededtoprovideanaccuratemappingincaseofcomplexapplications. 2.Adedicatedmachineisassumedfortheapplicationcontainertocollecttheperformancedata.Incasethatmultipleapplicationsco-existonthesamehostingmachine,amoresophisticatedmethodofdatacollection,forexampleaggregatingperformancedataoftheprocessesthatbelongtothesameapplication,maybeneeded. 125

PAGE 126

AveragephasepredictionaccuracyoftheveVMs 3.Inthiswork,onedimensionalphaseanalysisandpredictionisperformed.Howevertheprototypecangenerallyworkformulti-dimensionresourceprovisioningcasesalso.Forclusteringinthemulti-dimensionspace,additionalpatternrepresentationtechniquessuchasPrincipalComponentAnalysis(PCA)canbeusedtoprojectthedatatolowerdimensionalspacetoreducethecomputingintensity.Inaddition,thetransitionfactorCwillrepresenttheunittransitioncostdenedinthepricingscheduleoftheresourceprovider. Developingpredictionmodelsforparallelandmulti-tierapplicationsispartofourfutureresearch. 100 ][ 101 ].Second,phasecharacterizationthatsummarizesapplicationbehaviorwithrepresentativeexecutionregionscanbeused 126

PAGE 127

102 ][ 103 ].Ourpurposetostudythephasebehavioristosupportdynamicresourceprovisioningoftheapplicationcontainers. Inadditiontothepurposeofstudy,ourapproachdiersfromtraditionalprogramphaseanalysisinthefollowingways: 1)Performancemetricunderstudy:Intheareaofpowermanagementandsimulationoptimizationforcomputerarchitectureresearch,themetricsusedforworkloadcharac-terizationaretypicallyBasicBlockVectors(BBV)[ 102 ][ 101 ],conditionalbranchcounter[ 104 ],andinstructionworkingset[ 105 ].InthecontextofapplicationVM/container'sresourceprovisioning,themetricsunderstudyarethesystemlevelperformancefeatures,whichareinstructivetoVMresourceprovisioningsuchasthoseshowninTable 5-1 2)Knowledgeoftheprogramcodes:While[ 102 ][ 101 ][ 104 ]atleastrequiresprolingofprogrambinarycodes,ourapproachrequiresneitherinstrumentationnoraccessofprogramcodes. 3)Thisthesisanswersthequestion\howmanyclustersarebest"inthecontextofsystemlevelresourceprovisioning. In[ 106 ],Dhodapkaretal.comparedthreedynamicprogramphasedetectiontechniquesdiscussedin[ 102 ],[ 104 ],and[ 105 ]usingavarietyofperformancemetrics,suchassensitivity,stability,performancevarianceandcorrelationsbetweenphasedetectiontechniques. Inaddition,otherrelatedworkonresourceprovisioninginclude:Urgaonkaretal.studiedresourceprovisioninginamulti-tierwebenvironment[ 107 ].Wildstrometal.developedamethodtoidentifythebestCPUandmemorycongurationfromapoolofcongurationsforaspecicworkload[ 108 ].Chaseetal.haveproposedahierarchicalarchitecturethatallocatesvirtualclusterstoagroupofapplicaitons[ 109 ].Kusicetal.developedanoptimizationframeworktodecidethenumberofserverstoallocateto 127

PAGE 128

110 ].Tesauroetal.usedacombinationofreinforcementlearningandqueuingmodelforsystemperformancemanagement[ 5 ]. 128

PAGE 129

ApplicationresourcedemandphasepredictionworkowInthetrainingstage,theuperformancedataX1uoffeature(s)usedinthesubsequentphaseanalysisareextracted(patternrepresentation)andframedwithpredictionwindowsizem.TheunknownparametersoftheresourcepredictorisestimatedduringmodelttingusingtheframedtrainingdataX0(um+1)m.Inaddition,theclusteringalgorithmsintroducedinSection 5.3 areusedtoconstructtheapplicationphaseproleincludingthephaselabelsI1uforallthesamplesandthecalculatedclustercentroidsC1k.Inthetestingphase,thephasepredictorusestheknowledgelearnedfromthephaseproletopredictthefuturephases^P01vbasedonthepredictedresourceusage^Y01vand^P1vbasedonobservedactualresourceusage^Y1v,andcomparethemtoevaluatethephasepredictionaccuracy. 129

PAGE 130

B user A )Sampletrainingdata B )Sampletestingdata C )Phasetransitions D )Mispredictionpenalties E )Totalcostwithoutpenalty F )Totalcostwithpenalty(Cp=8) 130

PAGE 131

D 131

PAGE 132

F 132

PAGE 133

B C PhaseanalysisofWorldCup'98Bytes In A )Phasetransitions B )Mispredictionpenalties C )Totalcostwithpenalty(Cp=8) 133

PAGE 134

B C PhaseanalysisofWorldCup'98Bytes out A )Phasetransitions B )Mispredictionpenalties C )Totalcostwithpenalty(Cp=8) 134

PAGE 135

Self-managementhasdrawnincreasingattentionsinthelastfewyearsduetotheincreasingsizeandcomplexityofcomputingsystems.Aresourceschedulerthatcanperformself-optimizationandself-congurationcanhelptoimprovethesystemthroughputandfreesystemadministratorsfromlabor-intensiveanderror-pronetasks.However,itischallengingtoequiparesourceschedulerwithsuchself-capacitiesbecauseofthedynamicnatureofsystemperformanceandworkload. Inthisdissertation,weproposetousemachinelearningtechniquestoassistsystemperformancemodelingandapplicationworkloadcharacterization,whichcanprovidesupportforon-demandresourcescheduling.Inaddition,virtualmachinesareusedasresourcecontainerstohostapplicationexecutionsfortheeaseofdynamicresourceprovisioningandloadbalancing. TheapplicationclassicationframeworkpresentedinChapter2usedthePrincipalComponentAnalysis(PCA)toreducethedimensionoftheperformancedataspace.Thenthek-NearestNeighbork-NNalgorithmisusedtoclassifythedataintodierentclassessuchasCPU-intensive,I/O-intensive,memory-intensive,andnetwork-intensive.Itdoesnotrequiremodicationsoftheapplicationsourcecode.Experimentswithvariousbenchmarkapplicationssuggestthatwiththeapplicationclassknowledge,aschedulercanimprovethesystemthroughput22.11%onaveragebyallocatingtheapplicationsofdierentclassestosharethesystemresources. ThefeatureselectionprototypepresentedinChapter3usesaprobabilisticmodel(BayesianNetwork)tosystematicallyselecttherepresentativeperformancefeatures,whichcanprovideoptimalclassicationaccuracyandadapttochangingworkloads.Itshowsthatautonomicfeatureselectionenablesclassicationwithoutrequiringexpertknowledgeintheselectionofrelevantlow-levelperformancemetrics.Thisapproachrequiresnoapplicationsourcecodemodicationnorexecutionintervention.Resultsfrom 135

PAGE 136

Inadditiontotheapplicationresourcedemandmodeling,Chapter4proposesalearningbasedadaptivepredictor,whichcanbeusedtopredictresourceavailability.Itusesthek-NNclassierandPCAtolearntherelationshipbetweenworkloadcharacteristicandsuitedpredictorbasedonhistoricalpredictions,andtoforecastthebestpredictorfortheworkloadunderstudy.Then,onlytheselectedbestpredictorisruntopredictthenextvalueoftheperformancemetric,insteadofrunningmultiplepredictorsinparalleltoidentifythebestone.Theexperimentalresultsshowthatthislearning-aidedadaptiveresourcepredictorcanoftenoutperformthesinglebestpredictorinthepoolwithoutaprioriknowledgeofwhichmodelbesttsthedata. Theapplicationclassicationandthefeatureselectiontechniquescanbeusedtodenetheapplicationresourceconsumptionpatternsatanygivenmoment.Theexperimentalresultsoftheapplicationclassicationsuggestthatallocatingapplicationswhichhavecomplementaryresourceconsumptionpatternstothesameservercanimprovethesystemthroughput. Inadditiontoone-step-aheadperformanceprediction,Chapter 5 studiedthelargescalebehaviorapplicationresourceconsumption.Clusteringbasedalgorithmshavebeenexploredtoprovideamechanismtodeneandpredictthephasebehavioroftheapplicationresourceusagetosupporton-demandresourceallocation.Theexperimentalresultsshowthatanaverageofabove90%ofphasepredictionaccuracycanbeachievedforthefour-phasecasesofthebenchmarkworkloads. 136

PAGE 137

[1] J.KephartandD.Chess,\Thevisionofautonomiccomputing,"Computer,vol.36,no.1,pp.41{50,2003. [2] Y.YangandH.Casanova,\Rumr:Robustschedulingfordivisibleworkloads.,"inProc.12thHigh-PerformanceDistributedComputing,Seattle,WA,June22-24,2003,pp.114{125. [3] J.M.SchopfandF.Berman,\Stochasticscheduling,"inProc.ACM/IEEEConferenceonSupercomputing,Portland,OR,Nov.14{19,1999,p.48. [4] L.Yang,J.M.Schopf,andI.Foster,\Conservativescheduling:Usingpredictedvariancetoimproveschedulingdecisionsindynamicenvironments,"inProc.ACM/IEEEconferenceonSupercomputing,Nov.15-21,2003,p.31. [5] G.Tesauro,N.Jong,R.Das,andM.Bennani,\Ahybridreinforcementlearningapproachtoautonomicresourceallocation,"inProc.IEEEInternationalConferenceonAutonomicComputing(ICAC'06),2006,pp.65{73. [6] G.Tesauro,R.Das,W.Walsh,andJ.Kephart,\Utility-function-drivenresourceallocationinautonomicsystems,"inProc.SecondInternationalConferenceonAutonomicComputing(ICAC'05),2005,pp.342{343. [7] R.Duda,P.Hart,andD.Stork,TheArtofComputerSystemsPerformanceAnalysis:TechniquesforExperimentalDesign,Measurement,Simulation,andModeling,Wiley-Interscience,NewYork,NY,Apr.1991. [8] J.O.Kephart,\Researchchallengesofautonomiccomputing,"inProc.27thInternationalConferenceonSoftwareEngineeringICSE,May2005,pp.15{22. [9] S.M.WeissandC.A.Kulikowski,ComputerSystemsThatLearn:ClassicationandPredictionMethodsfromStatistics,NeuralNets,MachineLearning,andExpertSystems,MorganKaufmann,SanMateo,CA94403,1990. [10] R.P.Goldberg,\Surveyofvirtualmachineresearch,"IEEEComputerMagazine,vol.7,no.6,pp.34{45,June1974. [11] R.Figueiredo,P.Dinda,andJ.Fortes,\Acaseforgridcomputingonvirtualmachines,"inProc.23rdInternationalConferenceonDistributedComputingSystems,May19{22,2003,pp.550{559. [12] S.Pinter,Y.Aridor,S.Shultz,andS.Guenender,\Improvingmachinevirtualizationwith'hotplugmemory',"Proc.17thInternationalSymposiumonComputerArchitectureandHighPerformanceComputing,pp.168{175,2005. [13] C.Clark,K.Fraser,S.Hand,J.Hanseny,E.July,C.Limpach,I.Pratt,andA.Wareld,\Livemigrationofvirtualmachines,"inProc.2ndSymposiumonNetworkedSystemsDesign&Implementation(NSDI'05),Boston,MA,2005. 137

PAGE 138

[14] \Vmotion,"http://www.vmware.com/products/vi/vc/vmotion.html. [15] M.Zhao,J.Zhang,andR.Figueiredo,\Distributedlesystemsupportforvirtualmachinesingridcomputing,"Proc.13thInternationalSymposiumonHighPerfor-manceDistributedComputing,pp.202{211,2004. [16] I.Krsul,A.Ganguly,J.Zhang,J.Fortes,andR.Figueiredo,\Vmplants:Providingandmanagingvirtualmachineexecutionenvironmentsforgridcomputing,"inProc.Supercomputing,Washington,DC,Nov.6{12,2004. [17] J.Sugerman,G.Venkitachalan,andB.Lim,\Virtualizingi/odevicesonvmwareworkstation'shostedvirtualmachinemonitor,"inProc.USENIXAnnualTechnicalConference,2001. [18] J.Dike,\Auser-modeportofthelinuxkernel,"inProc.4thAnnualLinuxShowcaseandConference,USENIXAssociation,Atlanta,GA,Oct.2000. [19] A.SundararajandP.Dinda,\Towardsvirtualnetworksforvirtualmachinegridcomputing,"inProc.3rdUSENIXVirtualMachineResearchandTechnologySymposium,May2004. [20] M.Litzkow,T.Tannenbaum,J.Basney,andM.Livny,\CheckpointandmigrationofUNIXprocessesintheCondordistributedprocessingsystem,"Tech.Rep.UW-CS-TR-1346,UniversityofWisconsin-MadisonComputerSciencesDepartment,Apr.1997. [21] A.Barak,O.Laden,andY.Yarom,\Thenowmosixanditspreemptiveprocessmigrationscheme,"BulletinoftheIEEETechnicalCommitteeonOperatingSystemsandApplicationEnvironments,vol.7,no.2,pp.5{11,1995. [22] R.Duda,P.Hart,andD.Stork,PatternClassication,Wiley-Interscience,NewYork,NY,2001,2ndedition. [23] C.G.Atkeson,A.W.Moore,andS.Schaal,\Locallyweightedlearning,"ArticialIntellegenceReview,vol.11,no.1-5,pp.11{73,1997. [24] S.Adabala,V.Chadha,P.Chawla,R.J.O.Figueiredo,J.A.B.Fortes,I.Krsul,A.M.Matsunaga,M.O.Tsugawa,J.Zhang,M.Zhao,L.Zhu,andX.Zhu,\Fromvirtualizedresourcestovirtualcomputinggrids:thein-vigosystem.,"FutureGenerationComp.Syst.,vol.21,no.6,pp.896{909,2005. [25] L.YuandH.Liu,\Ecientfeatureselectionviaanalysisofrelevanceandredundancy,"JournalofMachineLearningResearch,vol.5,pp.1205{1224,Oct.2004. [26] T.CoverandP.Hart,\Nearestneighborpatternclassication,"IEEETrans.Inf.Theory,vol.13,no.1,pp.21{27,Jan.1967.

PAGE 139

[27] M.L.Massie,B.N.Chun,andD.E.Culler,\Thegangliadistributedmonitoringsystem:Design,implementation,andexperience.,"ParallelComputing,vol.30,no.5-6,pp.817{840,2004. [28] \Netapp,"http://www.netapp.com/tech library/3022.html. [29] R.EigenmannandS.Hassanzadeh,\Benchmarkingwithrealindustrialapplications:thespechigh-performancegroup,"IEEEComputationalScienceandEngineering,vol.3,no.1,pp.18{23,1996. [30] \Ettcp,"http://sourceforge.net/projects/ettcp/. [34] Q.Snell,A.Mikler,andJ.Gustafson,\Netpipe:Anetworkprotocolindependentperformaceevaluator,"June1996. [31] \Simplescalar,"http://www.cs.wisc.edu/mscalar/simplescalar.html. [32] \Ch3d,"http://users.coastal.u.edu/pete/CH3D/ch3d.html. [33] \Bonnie,"http://www.textuality.com/bonnie/. [35] \Vmd,"http://www.ks.uiuc.edu/Research/vmd/. [36] \Spim,"http://www.cs.wisc.edu/larus/spim.html. [37] \Referenceofstream,"http://www.cs.virginia.edu/stream/ref.html. [38] \Autobench,"http://www.xenoclast.org/autobench/. [39] I.GuyonandA.Elissee,\Anintroductiontovariableandfeatureselection,"J.Mach.Learn.Res.,vol.3,pp.1157{1182,Mar.2003. [40] Y.LiaoandV.R.Vemuri,\Usingtextcategorizationtechniquesforintrusiondetection,"in11thUSENIXSecuritySymposium,SanFrancisco,CA,Aug.5{9,2002,pp.51{59. [41] A.K.Ghosh,A.Schwartzbard,andM.Schatz,\Learningprogrambehaviorprolesforintrusiondetection,"inProc.theWorkshoponIntrusionDetectionandNetworkMonitoring,SantaClara,CA,Apr.9{12,1999,pp.51{62. [42] M.AlmgrenandE.Jonsson,\Usingactivelearninginintrusiondetection,"inProc.17thIEEEComputerSecurityFoundationsWorkshop,June28{30,2004,pp.88{98. [43] S.C.LeeandD.V.Heinbuch,\Traininganeural-networkbasedintrusiondetectortorecognizenovelattacks.,"IEEETransactionsonSystems,Man,andCybernetics,PartA,vol.31,no.4,pp.294{299,2001. [44] G.Forman,\Anextensiveempiricalstudyoffeatureselectionmetricsfortextclassication,"J.Mach.Learn.Res.,vol.3,pp.1289{1305,2003.

PAGE 140

[45] N.H.Kapadia,J.A.B.Fortes,andC.E.Brodley,\Predictiveapplication-performancemodelinginacomputationalgridenvironment,"inProc.8thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,RedondoBeach,CA,Aug.3{6,1999,p.6. [46] J.BasneyandM.Livny,\Improvinggoodputbycoschedulingcpuandnetworkcapacity,"Int.J.HighPerform.Comput.Appl.,vol.13,no.3,pp.220{230,Aug.1999. [47] R.Raman,M.Livny,andM.Solomon,\Policydrivenheterogeneousresourceco-allocationwithgangmatching,"inProc.12thIEEEInternationalSymposiumonHighPerformanceDistributedComputing(HPDC'03),Seattle,WA,June22{24,2003,p.80. [48] S.SodhiandJ.Subhlok,\Skeletonbasedperformancepredictiononsharednetworks,"inIEEEInternationalSymposiumonClusterComputingandtheGrid(CCGrid2004),2004,pp.723{730. [49] V.Taylor,X.Wu,andR.Stevens,\Prophesy:aninfrastructureforperformanceanalysisandmodelingofparallelandgridapplications,"SIGMETRICSPerform.Eval.Rev.,vol.30,no.4,pp.13{18,2003. [50] O.Y.Nickolayev,P.C.Roth,andD.A.Reed,\Real-timestatisticalclusteringforeventtracereduction,"TheInternationalJournalofSupercomputerApplicationsandHighPerformanceComputing,vol.11,no.2,pp.144{159,Summer1997. [51] D.H.AhnandJ.S.Vetter,\Scalableanalysistechniquesformicroprocessorperformancecountermetrics,"inProc.SuperComputing,Baltimore,MD,Nov.16{22,2002,pp.1{16. [52] I.Cohen,J.S.Chase,M.Goldszmidt,T.Kelly,andJ.Symons,\Correlatinginstrumentationdatatosystemstates:Abuildingblockforautomateddiagnosisandcontrol.,"in6thUSENIXSymposiumonOperatingSystemsDesignandImplementation,2004,pp.231{244. [53] J.ZhangandR.Figueiredo,\Applicationclassicationthroughmonitoringandlearningofresourceconsumptionpatterns,"inProc.20thInternationalParallel&DistributedProcessingSymposium,RhodesIsland,Greece,Apr.25{29,2006. [54] M.Massie,B.Chun,andD.Culler,TheGangliaDistributedMonitoringSystem:Design,Implementation,andExperience,Addison-Wesley,Reading,MA,2003. [55] S.Agarwala,C.Poellabauer,J.Kong,K.Schwan,andM.Wolf,\Resource-awarestreammanagementwiththecustomizabledprocdistributedmonitoringmechanisms,"inProc.12thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,June22{24,2003,pp.250{259. [56] \Hp,"http://www.managementsoftware.hp.com.

PAGE 141

[57] H.LiuandL.Yu,\Towardintegratingfeatureselectionalgorithmsforclassicationandclustering,"IEEETrans.Knowl.DataEng.,vol.17,no.4,pp.491{502,Apr.2005. [58] J.Pearl,ProbabilisticReasoninginIntelligentSystems:NetworksofPlausibleInference,MorganKaufmannPublishers,SanFrancisco,CA,1988. [59] T.Dean,K.Basye,R.Chekaluk,S.Hyun,M.Lejter,andM.Randazza,\Copingwithuncertaintyinacontrolsystemfornavigationandexploration.,"inProc.8thNationalConferenceonArticialIntelligence,Boston,MA,July29{Aug.3,1990,pp.1010{1015. [60] D.Heckerman,\Probabilisticsimilaritynetworks,"Tech.Rep.,Depts.ofComputerScienceandMedicine,StanfordUniversity,1990. [61] D.J.Spiegelhalter,R.C.Franklin,andK.Bull,\Assessmentcriticismandimprovementofimprecisesubjectiveprobabilitiesforamedicalexpertsystem,"inProc.FifthWorkshoponUncertaintyinArticialIntelligence,1989,pp.335{342. [62] E.CharniakandD.McDermott,IntroductiontoArticialIntelligence,Addison-WesleyLongmanPublishingCo.,Inc.,Boston,MA,USA,1985. [63] T.S.Levitt,J.Mullin,andT.O.Binford,\Model-basedinuencediagramsformachinevision,"inProc.5thWorkshoponUncertaintyinArticialIntelligence,1989,pp.233{244. [64] R.E.Neapolitan,Probabilisticreasoninginexpertsystems:theoryandalgorithms,JohnWiley&Sons,Inc.,NewYork,NY,USA,1990. [65] K.Weinberger,J.Blitzer,andL.Saul,\Distancemetriclearningforlargemarginnearestneighborclassication,"inProc.19thannualConferenceonNeuralInformationProcessingSystems,Vancouver,CA,Dec.2005. [66] R.KohaviandF.Provost,\Glossaryofterms,"MachineLearning,vol.30,pp.271{274,1998. [67] B.Ziebart,D.Roth,R.Campbell,andA.Dey,\Automatedandadaptivethresholdsetting:Enablingtechnologyforautonomyandself-management,"inProc.2ndInternationalConferenceofAutonomicComputing,June13{16,2005,pp.204{215. [68] P.Mitra,C.Murthy,andS.Pal,\Unsupervisedfeatureselectionusingfeaturesimilarity,"IEEETrans.Pat.Anal.Mach.Intel.,vol.24,no.3,pp.301{312,Mar.2002. [69] W.Lee,S.J.Stolfo,andK.W.Mok,\Adaptiveintrusiondetection:Adataminingapproach,"ArticialIntelligenceReview,vol.14,no.6,pp.533{567,2000. [70] M.K.Aguilera,J.C.Mogul,J.L.Wiener,P.Reynolds,andA.Muthitacharoen,\Performancedebuggingfordistributedsystemsofblackboxes,"inProc.19thACM

PAGE 142

[71] R.IsaacsandP.Barham,\Performanceanalysisinloosely-coupleddistributedsystems,"inProc.7thCaberNetRadicalsWorkshop,Bertinoro,Italy,Oct.2002. [72] I.Foster,\Theanatomyofthegrid:enablingscalablevirtualorganizations,"inProc.1stIEEE/ACMInternationalSymposiumonClusterComputingandtheGrid,2001,pp.6{7. [73] R.Wolski,\Dynamicallyforecastingnetworkperformanceusingthenetworkweatherservice,"inJournalofclustercomputing,1998. [74] I.Matsuba,H.Suyari,S.Weon,andD.Sato,\Practicalchaostimeseriesanalysiswithnancialapplications,"inProc.5thInternationalConferenceonSignalProcessing,Beijing,2000,vol.1,pp.265{271. [75] P.MagniandR.Bellazzi,\Astochasticmodeltoassessthevariabilityofbloodglucosetimeseriesindiabeticpatientsself-monitoring,"IEEETrans.Biomed.Eng.,vol.53,no.6,pp.977{985,2006. [76] K.DidanandA.Huete,\Analysisoftheglobalvegetationdynamicmetricsusingmodisvegetationindexandlandcoverproducts,"inIEEEInternationalGeoscienceandRemoteSensingSymposium(IGARSS'04),2004,vol.3,pp.2058{2061. [77] P.Dinda,\Thestatisticalpropertiesofhostload,"ScienticProgramming,,no.7:3-4,1999. [78] P.Dinda,\Hostloadpredictionusinglinearmodels,"ClusterComputing,vol.3,no.4,2000. [79] Y.Zhang,W.Sun,andY.Inoguchi,\CPUloadpredictionsonthecomputationalgrid*,"inProc.6thIEEEInternationalSymposiumonClusterComputingandtheGrid,May2006,vol.1,pp.321{326. [80] J.Liang,K.Nahrstedt,andY.Zhou,\Adaptivemulti-resourcepredictionindistributedresourcesharingenvironment,"inProc.IEEEInternationalSymposiumonClusterComputingandtheGrid,2004,pp.293{300. [81] S.VazhkudaiandJ.Schopf,\Predictingsporadicgriddatatransfers,"Proc.InternationalSymposiumonHighPerformanceDistributedComputing,pp.188{196,2002. [82] S.Vazhkudai,J.Schopf,andI.Foster,\Usingdiskthroughputdatainpredictionsofend-to-endgriddatatransfers,"inProc.3rdInternationalWorkshoponGridComputing,Nov.2002.

PAGE 143

[83] S.GunterandH.Bunke,\Anevaluationofensemblemethodsinhandwrittenwordrecognitionbasedonfeatureselection,"inProc.17thInternationalConferenceonPatternRecognition,Aug.2004,vol.1,pp.388{392. [84] G.Jain,A.Ginwala,andY.Aslandogan,\Anapproachtotextclassicationusingdimensionalityreductionandcombinationofclassiers,"inProc.IEEEInternationalConferenceonInformationReuseandIntegration,Nov.2004,pp.564{569. [85] V.whitepaper,\Comparingthemui,virtualcenter,andvmkusage,". [86] J.D.Cryer,Timeseriesanalysis,DuxburyPress,Boston,MA,1986. [87] S.G.JohnO.RawlingsandD.A.Dickey,AppliedRegressionAnalysis,Springer,2001. [88] R.T.TrevorHastieandJ.Friedman,TheElementsofStatisticalLearning,Springer,2001. [89] E.BinghamandH.Mannila,\Randomprojectionindimensionalityreduction:applicationstoimageandtextdata,"inKnowledgeDiscoveryandDataMining,2001,pp.245{250. [90] L.SirovichandR.Everson,\Managementandanalysisoflargescienticdatasets,"Int.JournalofSupercomputerApplications,vol.6,no.1,pp.50{68,1992. [91] J.Yang,Y.ZhangandB.Kisiel,\Ascalabilityanalysisofclassiersintextcategorization,"inACMSIGIR'03,2003,pp.96{103. [92] F.Friedman,J.H.BaskettandL.Shustek,\Analgorithmforndingnearestneighbors,"IEEETransactionsonComputers,vol.C-24,no.10,pp.1000{1006,Oct.1975. [93] J.Friedman,J.H.BentleyandR.Finkel,\Analgorithmforndingbestmatchesinlogarithmicexpectedtime,"ACMTransactionsonMathematicalSoftware,vol.3,pp.209{226,1977. [94] P.D.G.BangaandJ.Mogul,\Resourcecontainers:Anewfacilityforresourcemanagementinserversystems,"inProc.3rdsymposiumonOperatingSystemDesignandImplementation,NewOrleans,Feb.1999. [95] L.Ramakrishnan,L.Grit,A.Iamnitchi,D.Irwin,A.Yumerefendi,andJ.Chase,\Towardsadoctrineofcontainment:Gridhostingwithadaptiveresourcecontrol,"inProc.Supercomputing,Tampa,FL,Nov.2006. [96] R.Dubes,\Howmanyclustersarebest?-anexperiment,"PatternRecogn.,vol.20,no.6,pp.645{663,Nov.1987. [97] A.K.Jain,M.N.Murty,andP.J.Flynn,\Dataclustering:areview,"ACMComputingSurveys,vol.31,no.3,pp.264{323,1999.

PAGE 144

[98] \Worldcup98,"http://ita.ee.lbl.gov/html/contrib/WorldCup.html. [99] \Logreplayer,"http://www.cs.virginia.edu/rz5b/software/logreplayer-manual.htm. [100] C.Isci,A.Buyuktosunoglu,andM.Martonosi,\Long-termworkloadphases:durationpredictionsandapplicationstodvfs,"IEEEMicro,vol.25,no.5,pp.39{51,2005. [101] C.IsciandM.Martonosi,\Phasecharacterizationforpower:evaluatingcontrol-ow-basedandevent-counter-basedtechniques,"Proc.12thInternationalSymposiumonHigh-PerformanceComputerArchitecture,pp.121{132,2006. [102] T.Sherwood,E.Perelman,G.Hamerly,andB.Calder,\Automaticallycharacterizinglargescaleprogrambehavior,"inProc.10thInternationalCon-ferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2002,pp.45{57. [103] H.Patil,R.Cohn,M.Charney,R.Kapoor,A.Sun,andA.Karunanidhi,\Pinpointingrepresentativeportionsoflargeintelitaniumprogramswithdynamicinstrumentation,"inProc.37thannualinternationalsymposiumonMicroarchitec-ture,2004. [104] R.Balasubramonian,D.Albonesi,A.Buyuktosunoglu,andS.Dwarkadas,\Memoryhierarchyrecongurationforenergyandperformanceingeneralpurposearchitectures,"inProc.33rdannualinternationalsymposiumonmicroarchitecture,Dec.2000,pp.245{257. [105] A.DhodapkarandJ.Smith,\Managingmulti-congurationhardwareviadynamicworkingsetanalysis,"inProc.29thAnnualInternationalSymposiumonComputerArchitecture,Anchorage,AK,May2002,pp.233{244. [106] A.DhodapkarandJ.Smith,\Comparingprogramphasedetectiontechniques,"inProc.36thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2003,pp.217{227. [107] B.Urgaonkar,P.Shenoy,A.Chandra,andP.Goyal,\Dynamicprovisioningofmulti-tierinternetapplications,"inProc.2ndInternationalConferenceofAutonomicComputing,June2005,pp.217{228. [108] J.Wildstrom,P.Stone,E.Witchel,R.J.Mooney,andM.Dahlin,\Towardsself-conguringhardwarefordistributedcomputersystems,"inProc.2ndInterna-tionalConferenceofAutonomicComputing,June2005,pp.241{249. [109] J.S.Chase,D.E.Irwin,L.E.Grit,J.D.Moore,andS.E.Sprenkle,\Dynamicvirtualclustersinagridsitemanager,"Proc.12thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,pp.90{100,June2003.

PAGE 145

[110] D.KusicandN.Kandasamy,\Risk-awarelimitedlookaheadcontrolfordynamicresourceprovisioninginenterprisecomputingsystems,"inProc.3rdInternationalConferenceofAutonomicComputing,2006,pp.74{83.

PAGE 146

JianZhangwasborninChengdu,China.ShereceivedherB.S.degreein1995,fromtheUniversityofElectronicScienceandTechnologyofChina,majoringincomputercommunication.ShereceivedherM.S.degreein2001fromtheUniversityofFlorida,majoringinelectricalandcomputerengineering.Since2002,shehasbeenwiththeAdvancedComputingandInformationSystemsLaboratory(ACIS)attheUniversityofFlorida,pursuingherPh.D.degree.Herresearchinterestsincludedistributedsystems,autonomiccomputing,virtualizationtechnologies,andinformationsystems. 146