University of Florida | Journal of Undergraduate Research | Volume 15, Issue 3 | Spring 2015 1 Performance Prediction for Cloud Systems using Hadoop Rafael Moas Faculty Advisors: Dr. Tuba Yavuz**, and Dr. Jose Fortes** *Computer and Information Science Engineering Department University of Florida **Electrical and Computer Engineering Department, University of Florida Magna Cum Laude, May 2015 In recent years, cloud computing has become an answer to industry demands for high powered computation and algorithm intensive workloads. Trends towa rd big data, cluster computation, and batch processing have given rise to more cloud computing platforms, and more effort from the open source community in to predict job pe rformance through static program analysis. Currently, there is no way of pre dicting how long a cloud based Map Reduce job will take to execute. Our novel approach to performance prediction involves a machine learning algorithm that configures Hadoop jobs t o run with optimized parameter configurations. This algorithm (iTree) leverages combinatorics, code coverage information, and performance metrics to achieve its goal. Our goal, then, is twofold: to predict the performance of cloud based jobs and to find pa rameter configurations that yield the shortest makespans. 1. INTRODUCTION Performance Prediction for Cloud Systems using Hadoop is part of larger project: The Configuration and Resource Utilization Dependence Analyzer. Our high level goal is to create a component based resource utilization model that can help a cloud provider pr edict how well code will perform without having to run it. In this paper, we propose a machine learning approach to performance prediction of cloud based jobs and resource Hadoop Our performance prediction case study uses Apache Hadoop as a platform for running jobs. The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, delivering a highly ava ilable service on top of a cluster of computers  We used a 4 node Hadoop cluster for our job testing We chose Hadoop because it is strongly backed by an open source community and is highly configurable. It is compatible with many add ons, including a cloud monitoring platform. Hadoop provides a reliable shared storage system Hadoop Distributed File System (HDFS) and an analysis system, MapReduce Machine Learning For our research, we have utilized existing machine learning algorithms (iTree, J48) to create our own. For clarity, we define machine learning as any technique that automatically build s models describing the structure at the heart of a set of data. Ideally, these models can be used to predict properties of future data points and people can use them to analyze the domain from which the data originates. Although machine learning algorithms improve their accuracy as they iterate, no learning algorithm can achieve an error rate of zero. What we propose Our proposal is to design and implement a process for performance prediction using our own data structure, adapted from the University of Maryland Computer Science as the machine learning algorithm iterates, with each growth cycle exploring new software configurability The final iteration yields a set of test cases with optimized results for a desired goals. The goal based algorithm determine s a success based on code coverage, performance metrics, or a combination of the two The figure below shows four high level steps taken to achieve our goal of a desired parameter configuration. Although the 2 nd to the algorithm, it will not be detailed in this paper. Clustering is achieved using their Weka project .
RAFAEL MOAS, DR. TUBA YAVUZ, AND DR. JOSE FORTES University of Florida | Journal of Undergraduate Research | Volume 15, Issue 3 | Summer 2014 2 Figure 1. Process overview of the performance prediction algorithm 2. COMBINATORIAL INTERACTION TESTING Our initial approach to instrumenting a machine learning algorithm focused on finding high coverage configurations of parameters using static program analysis. A method called Combinatorial Interaction Testing (CIT) was used to generate test cases of varying parameter configurations. Specifically, an open source, C based tool named Covering Arrays by Simulated Annealing (CASA) was used to generate arrays of configuration parameters. Covering arrays are relatively small in size and provide good coverage We used this tool by incorporating it into a s hell script whic h outputted groups of configuration files formatted with Hadoop properties in XML. Figure 2. Preliminary Hadoop Configuration File Generator using CIT We achieved this by providing relevant parameters and a strength value as inputs. In our implementa tion, each line in the covering array resulted in a new Hadoop c onfiguration file Using CIT to determine configurations of parameters created too many test cases and an excess of data. To refine our results, we added machine learning functionality to the method and clustered the data from the original set. 3. ORIGINAL ITREE The Interaction Tree Discovery Algorithm (iTree) is an iterative learning algorithm that efficiently searches for a small set of configurations that closely approximates a ctive configuration space . iTree added machine learning functionality to existing methods used by CASA. Each iteration of iTree tests the system on sample s of selected configurations, monitors the effects of those configurations, then applies machine l earning to identify which combinations of settings are potentially responsible for the observed effects. While CIT yields a covering array of all n way configuration combinations, iTree identifies smaller configuration sets to test. In other words, iTree narrows down our set of outputs, by using CIT and iterative machine learning. The algorithm begins with an interaction tree containing just one node, true. Each node represents the proto interaction that is the conjunction of settings along the path from t he root node. At the beginning of each iteration, a heuristic is used to pick a leaf node to explore next. This heuristic chooses based on hope of potential code coverage, and is important because exploring all nodes of the tree would be too expensive . Figure 3. The interaction tree for a sample program In implementation, each node is given a priority, which is used to find the best leaf node. iTree computes a slightly biased average coverage for all configurations consistent proto interaction. This is the node priority. The algorithm then runs in a loop, iterating until developer supplied stopping criteria are met. For the original iTree, an achieved percentage of code coverage was the stopping criteria. If the algorithm itera ted for too long, a time based stopping criteria would terminate the program. Figure 4. Pseudo code for the interaction tree discovery
T ITLE OF P APER ON O DD P AGES (T IMES 10, S MALL C APS ) University of Florida | Journal of Undergradua te Research |Volume 16, Issue 2 | Spring 2015 3 We built off the original iTree ( made by the University of Maryland Computer Science department ) to make a Java version for ourselves. It aimed strictly to achieve maximum code coverage for a given set of configuration parameters. Our iTree is serializable. It is written to and read from hard memory at the beginning and end of each iteration. 4. INS TRUMENTED ITREE Our next step involves instrumenting a goal based iTree. While the ultimate goal is still performance prediction, the means changed. While originally we set to discover high coverage code configurations, we now aim for resource efficient co de configurations. The original iTree created will be enhanced by adding the consideration of execution metrics data. The reason for this change in direction is the realization that there is no obvious correlation between code coverage and job make span W e decided to keep the original implementation of iTree as an option, and even create a hybrid algorithm which considers resource metrics and code coverage as heuristic factors. 5. DISCOVERING METRICS Instrumenting a new iTree that considers performance met rics requires a metrics collection tool. For this, we use d a Hadoop Metrics gathering function. Hadoop uses a source and sink configuration to transfer metrics data around the cluster, using two basic methods, getM etrics() and putMetrics () The framework is designed to collect and dispatch per process metrics to monitor the overall status of the Hadoop system. Producers register the metrics sources with the metrics system, while consumers register the sinks based on configuration options . The Design is depicted below. Figure 6. Hadoop Metric collection system Ganglia is a scalable distributed monitoring system for high performance computing systems such as clusters and Grids . It integrates well with Hadoop and offers multiple options for metrics collection. A conventional Ganglia configuration runs the Gan glia Monitoring Daemo n (Gmond) n on each node in the cluster to collect statistics and the Ganglia Meta Daemon (Gmetad) on one master node to collect data from Ganglia Monitoring file system in the form of a round own file type. The last Ganglia component is the front end, a PHP graph visualization that produces real time feedback. Fi gure 7 A conventional Ganglia configuration For our purposes, we are not interested as much i n the graph visualization as we are in the metrics files. We created a custom script to convert the files to XML format and extract the metrics, provided by default in 15 second intervals. monitoring system in an isolated environment and ran sample MapReduce programs. There are a multitude of resource metrics available for use. The next two sections detail what those are. Generic Metrics The generic metrics which come with Ganglia are categorized by L oad, Memory, CPU, and Network. We are most interested in the Memory and CPU categories, because memory metrics recorded in intervals are memory buffers, cached memory, free memory, shared memory, and free swap space. The CPU metrics available include CPU aidle (the pe rcent of time since boot idle CPU), CPU idle (when the processor is not being used by any program), CPU wio (Wait I/O time can typically cause a bottleneck), CPU Nice, CPU User, and CPU System. Hadoop Specific Metrics With further customization, Ganglia can track Hadoop spe cific metrics. These include JobTr acker, N ameNode, DataN ode, and TaskTracker. Each have daemons to expose runtime metrics. JobTracker metrics include files created and deleted, and block operations. NameNode Metrics include blocks read from, written to, and removed.
RAFAEL MOAS, DR. TUBA YAVUZ, AND DR. JOSE FORTES University of Florida | Journal of Undergraduate Research | Volume 15, Issue 3 | Summer 2014 4 DataNode metrics include the number of jobs running, maps waiting, and reduces waiting. Finally, TaskTracker metrics show the number of tasks completed, maps running, and reduces running. For the sake of simplicity, w e have h ave postponed the addition of Hadoop specific metrics to our iTree instrumentation. 6. IMPLEMENTATION The addition of metrics collection gave us a refreshed approach to the iTree implementation. The improved algorithm is goal based. Depending on the goal, the inner workings of iTree will change most importantly, the node priority calculation will change We propose three iTree modes, which are modular enough to look the same at surface level, but function differently under the hood. The idea is to standardize how iTree appears for ease of use. Our three sugges ted modes are: 1. Coverage Maximizer 2. Metrics a. Resource Specific b. All Resources 3. Coverage and Metrics The chart below shows how the three different modes would grow different iTrees and explo re different clusters of test cases based on their goal. Figure 8 Various modes for software configurations The Coverage Maximizer mode works the same way the iTree works from Section 3. The Metrics mode incorporates the previously mention ed metrics, collected by ganglia during algorithm iteration to determine node priorities. As an example, a resource specific metrics run may single out CPU based resources. It may also single out a combination of CPU based resources and Memory based resour ces. As another example, an all resource metrics run will incorporate every metric available in developing each we intend to weight the metrics values based on user determined importance. This w ay, more important metrics being explored. The third mode, Coverage + Metrics calculates coverage priority and metric priority, then weighs them to determine a total node priority. We predict tha t this hybrid mode will be most useful in identifying promising program configurations. 7. CONCLUSION We have presented in detail a scalable model for performance prediction to support the testing of highly configurable systems. Our goal is to select a sma ll set of suite will achieve a short makespan and efficient resource utilization. It is imperative that extensive testing be done after the completion of this machine learning algorithm, with ad ho c jobs designed to strain different steps in a MapReduce job As our work continues, more light is shed on performance prediction techniques and accurate resource estimation. We believe that the completion of this project will make a large ripple in the field of cloud computing and the way big data jobs are processed 7. ACKNOWLEDGMENTS This research was supported by the University of Florida Electrical & Computer Engineering Department. 8. REFERENCES (Citations: Use the citation style most appropriate to your discipline. Times New Roman 9 full justified; )  Song, C., Porter, A., & Foster, J. (2011). ITree: Efficiently Discovering High Coverage Configurations Using Interaction Trees. IEEE Transactions on Software Engin eering, 251 265. Retrieved April 17, 2015, from http://www.cs.umd.edu/~csfalcon/iTree.pdf  Ganglia Monitoring System. (2014, December 12). Retrieved January 14, 2015, from http://ganglia.sourceforge.net/  What is Hadoop Metrics2? (2013, February 4). Retrieved March 1, 2015, from http://blog.cloudera.com/blog/2012/10/what is hadoop metrics2/  November 26, 2014, from http://hadoop.apache.org/  CLOPE. (2012, July 1). Retrieved December 2, 2014, from http://weka.sourceforge.net/doc.packages/CLOPE/weka/clusterers/C LOPE.html