<%BANNER%>

Improving Utilization and Availability of High-Performance Computing in Space

HIDE
 Title Page
 Dedication
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 Abstract
 Introduction
 Optimization of checkpointing-related...
 Effective task scheduling (Phase...
 A fault-tolerant message passing...
 Conclusions
 References
 Biographical sketch
 

PAGE 1

1 IMPROVING UTILIZATION AND AVAILABILITY OF HIGH-PERFORMANCE COMPUTING IN SPACE By RAJAGOPAL SUBRAMANIYAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006

PAGE 2

2 Copyright 2006 by Rajagopal Subramaniyan

PAGE 3

3 To my dad Mr. Subramanian and my mom Mrs. Vijayalakshmi

PAGE 4

4 ACKNOWLEDGMENTS I thank Dr. Alan George for having confidence in me and for allowing me to pursue this research. I also thank him for his support during this work. I thank Scott Studham at Oak Ridge National Laboratory for laying th e foundation of this work and for providing great support and guidance. I express my gratitude to my parents, who have been with me during the ups and downs of my life. I thank them for their moral, emoti onal and financial support throughout my career. I would be nowhere without them. I thank my friends Giridhar and Yamini for being friendly, encouraging and very supportive in helping me to graduate. I also thank my friends Maakans, Anitha, Arun, Anand, Thanni, PD, and Kandy for making life at UF memorable. I thank my colleagues at the HCS Lab for thei r technical support a nd peer reviews. I especially thank Casey, Adam, Eric and Ian for be ing friendly and more than just lab mates, making my term at the lab a memorable one. Finall y, I thank all the sponsors of this work. This work was supported by NSF, ORNL, and NASA/JPL.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4 LIST OF TABLES................................................................................................................. ..........7 LIST OF FIGURES................................................................................................................ .........8 ABSTRACT....................................................................................................................... ............11 CHAPTER 1 INTRODUCTION..................................................................................................................13 2 OPTIMIZATION OF CHECKPOINTINGRELATED INPUT/OUTPUT (PHASE I)........19 2.1 Background on Technology Growth Trends....................................................................21 2.1.1 CPU Processing Power...........................................................................................22 2.1.2 Disk Capacity.........................................................................................................23 2.1.3 Disk Performance Growth......................................................................................24 2.1.4 Sustained Bandwidth to/from Disk........................................................................25 2.1.5 I/O Performance.....................................................................................................27 2.2 Optimal Usage of Input/Output........................................................................................27 2.2.1 Checkpointing Alternatives....................................................................................28 2.2.2 Optimizing Checkpointing Frequency...................................................................30 2.3 Analytical Modeling of Opti mum Checkpointing Frequency..........................................30 2.4 Simulative Verification of Checkpointing Model............................................................40 2.4.1 System Model.........................................................................................................41 2.4.2 Node Model............................................................................................................41 2.4.3 Multiple Nodes Model............................................................................................42 2.4.4 Network Model.......................................................................................................43 2.4.5 Central Memory Model..........................................................................................44 2.4.6 Fault Generator Model...........................................................................................44 2.4.7 Simulation Process.................................................................................................45 2.4.8 Verification of Analytical Model...........................................................................46 2.5 Optimal Checkpointing Frequencies in Typical Systems.................................................50 2.5.1 Traditional Ground-based High-Per formance Computing (HPC) Systems...........51 2.5.2 Space-based HPC Systems.....................................................................................52 2.7 Conclusions and Future Research.....................................................................................54 3 EFFECTIVE TASK SCHEDULING (PHASE II).................................................................57 3.1 Introduction............................................................................................................... ........57 3.2 Background................................................................................................................. ......59 3.2.1 Representative Reconfigurab le Computing (RC) Systems....................................59 3.2.2 Representative Applications...................................................................................61

PAGE 6

6 3.3 Task Scheduling............................................................................................................ ....62 3.3.1 Scope of the Scheduler...........................................................................................63 3.3.2 Performance Model................................................................................................63 3.3.3 Scheduling Heuristics.............................................................................................66 3.4 Simulative Analysis........................................................................................................ ..67 3.4.1 Simulation Setup....................................................................................................69 3.4.2 Simulation Results..................................................................................................70 3.4.3 Simulation Results on Small-scale Systems...........................................................76 3.5 Conclusions and Future Research.....................................................................................79 4 A FAULT-TOLERANT MESSAGE PASSING INTERFACE (PHASE III).......................81 4.1 Introduction............................................................................................................... ........82 4.2 Background and Related Research...................................................................................83 4.2.1 Limitations of Message Passi ng Interface (MPI) Standard....................................84 4.2.2 Fault-tolerant MPI Implementations for Traditional Clusters................................85 4.3 Design of Fault-tolerant Embedded MPI (FEMPI)..........................................................90 4.3.1 FEMPI Architecture...............................................................................................91 4.3.2 Point-to-Point Messaging (Unicast Communication)............................................94 4.3.3 Collective Communication.....................................................................................95 4.3.4 Failure Recovery....................................................................................................95 4.3.5 Covering All MPI Func tion Call Categories..........................................................96 4.4 Performance Analysis.......................................................................................................98 4.4.1 Experimental Setup................................................................................................98 4.4.2 Results and Analysis.............................................................................................100 4.4.2.1 Point-to-point communication....................................................................100 4.4.2.2 Collective communication..........................................................................102 4.5 Performance Analysis of Failure Recovery....................................................................106 4.5.1 Non-fault-tolerant MPI variants...........................................................................107 4.5.2 Failure Recovery Timing in FEMPI.....................................................................108 4.5.3 Failure Recovery Timing: Comparative Analysis................................................108 4.6. Parallel Application Experiments and Results..............................................................110 4.6.1 LU Decomposition...............................................................................................111 4.6.2 Failure-free Performance......................................................................................112 4.6.3 Failure Recovery Performance.............................................................................114 4.7 Conclusions and Future Research...................................................................................116 5 CONCLUSIONS..................................................................................................................118 LIST OF REFERENCES.............................................................................................................124 BIOGRAPHICAL SKETCH.......................................................................................................129

PAGE 7

7 LIST OF TABLES Table page 2-1 Disk performance growth over time..................................................................................25 2-2 Growth rate of co mputing technologies.............................................................................27 3-1 Characteristics of applications...........................................................................................68 3-2 Characteristics of RC systems...........................................................................................68 4-1 Baseline MPI functions for FEMP I in this phase of research............................................94 4-2 Barrier synchronization usi ng FEMPI on prototype testbed............................................104

PAGE 8

8 LIST OF FIGURES Figure page 1-1 Typical high-performance computing system...................................................................16 2-1 Supercomputer performance over time..............................................................................23 2-2 Hard drive capacity over time............................................................................................24 2-3 Performance of a single disk with respect to the amount of data stored............................26 2-4 Growth of disk dr ive bandwidth over time........................................................................26 2-5 Number of disks required in future syst ems to maintain present performance levels.......28 2-6 Job execution without checkpointing.................................................................................31 2-7 Job execution with checkpointing......................................................................................31 2-8 Execution modeled as a Markov process...........................................................................32 2-9 Impact of I/O bandwidth on the syst em execution time and checkpointing overhead A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day.....................................38 2-10 Impact of increasing memory capacity in systems on sustainable I/O bandwidth requirement A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day..................39 2-11 System model.............................................................................................................. .......41 2-12 Node model................................................................................................................ ........42 2-13 Multiple no des model...................................................................................................... ..43 2-14 Network model............................................................................................................. ......43 2-15 Central memory model......................................................................................................44 2-16 Fault generator model..................................................................................................... ...45 2-17 Optimum checkpointing interv al based on simulations of systems with 5 TB memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of 50 GB/s.....47 2-18 Numerical method solution for the analytical model.........................................................49 2-19 Optimum checkpointing in terval based on simulations of systems with 75 TB memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of 50 GB/s........................................................................................................................ ......50

PAGE 9

9 2-20 Optimum checkpointing interval for various systems A) System with 5 TB memory. B) System with 30 TB memory. C) System with 75 TB memory.....................................52 2-21 Optimum checkpointing interval for vari ous systems in space A) System with 512 MB memory. B) System with 2 GB memo ry. C) System with 8 GB memory.................53 3-1 Task dependence using directed acyclic graphs for sample jobs.......................................63 3-2 Scheduling of SPPM on various RC systems....................................................................71 3-3 Scheduling of UMT on various RC systems......................................................................72 3-4 Scheduling of heterogeneous tasks on various RC systems..............................................73 3-5 Scheduling of various homogeneous ta sks on a system with heterogeneous RC machines....................................................................................................................... .....74 3-6 Scheduling of heterogeneous tasks on a system with heterogeneous RC machines..........75 3-7 Batch scheduling of various heterogene ous tasks on a system with homogeneous RC machines....................................................................................................................... .....75 3-8 Batch scheduling of heterogeneous ta sks on a system with heterogeneous RC machines....................................................................................................................... .....76 3-9 Scheduling of homogeneous ta sks on homogeneous systems...........................................77 3-10 Scheduling of heterogeneous tasks on various RC systems..............................................78 3-11 Scheduling of homogeneous ta sks on heterogeneous systems..........................................78 4-1 MPICH-V architecture (Courtesy: [50])............................................................................86 4-2 Starfish architect ure (Courtesy: [51])................................................................................87 4-3 Egida architectur e (Courtesy: [53])...................................................................................88 4-4 Architecture of FEMPI......................................................................................................92 4-5 System configuration of the prototype testbed..................................................................99 4-6 Performance of point-to-point comm unication on a traditional cluster...........................101 4-7 Performance of point-to-point co mmunication on prototype testbed..............................102 4-8 Performance of broadcast comm unication on a traditional cluster..................................103 4-9 Performance of barrier synchr onization on a tradi tional cluster......................................104

PAGE 10

10 4-10 Performance of gather and scatter communication on a trad itional cluster.....................105 4-11 Performance of gather and scatte r communication on prototype testbed........................105 4-12 Data decomposition in parallel LUD...............................................................................111 4-13 Parallel LUD algorithm....................................................................................................112 4-14 Failure-free execution time of parallel LUD application kernel with increasing system size.................................................................................................................... ...113 4-15 Recovery time from a failure with increas ing system size for applications with small datasets....................................................................................................................... ......115 4-16 Recovery time from a failure with increas ing system size for applications with large datasets....................................................................................................................... ......116

PAGE 11

11 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy IMPROVING UTILIZATION AND AVAILABILITY OF HIGH-PERFORMANCE COMPUTING IN SPACE By Rajagopal Subramaniyan December 2006 Chair: Alan D. George Major Department: Electrical and Computer Engineering Space missions involving science and defense ve ntures have ever-increasing demands for data returns from their resources in space. The traditional approach of data gathering, data compression and data transmission is no longer viable due to the vast amounts of data. Over the past few decades, there have been several rese arch efforts to make high-performance computing (HPC) systems available in space. The idea has been to have enough on-board processing power to support the many space and earth explorat ion and experimentati on satellites orbiting earth and/or exploring the solar system. Such efforts have le d to small-scale supercomputers embedded in the spacecraft and, more recently, to the idea of using commercial-off-the-shelf (COTS) components to provide HPC in space. Susceptibility of COTS components to SingleEvent Upsets (SEUs) is a concern especially since space systems need to be self-healing and robust to survive the hostile environment. Faulttolerant system functions need to be developed to manage the resources available and improve th e availability of the HPC system in space. However, resources available to provide fault tolerance are fewer than traditional HPC systems on earth. Several techniques exist in traditional HPC to provide fault tolerance and improve overall computation rate, but adapting these techniques for HPC in space is a challenge due to the

PAGE 12

12 resource constraints. In this dissertation, this challenge is a ddressed by providing solutions to improve and complement HPC in space. Three techniques are introduced and investigated in three different phases of this dissertation to impr ove the effective utilizatio n and availability of HPC in space. In the first phase, new model to perform checkpointing at an optimal rate is developed to improve useful computation time. The results suggest the requirement of I/O capabilities much superior to present systems. While the performan ce of several common HPC scheduling heuristics that can be used for effective task schedu ling to improve overall execution time is simulatively analyzed in the second phase, availability is improved by designing a new lightweight fault-tolerant me ssage passing middleware in the third phase. Analyses of applications developed with the fault-tolerant middleware show th at robustness of the systems in space can be significantly improved without degrad ing the performance. In summary, this dissertation provides novel methodol ogies to improve utilization and availability in space-based high-performance computing, thereby providing better and effective fa ult tolerance.

PAGE 13

13 CHAPTER 1 INTRODUCTION Space research has advanced leaps and bounds in the past few decades with numerous satellites, probes and space shut tles launched to explore near-ear th space and other astronomical bodies including earth itself. More and more of our universe is being explored and, with permanent space stations already in orbit around earth, space explora tion is continuing to grow. There has been a significant increase in the capability of instruments deployed in space. With such high-tech instruments in place, space missions involving science and defense ventures have ever-increasing demands for data returns from their corresponding resources in space. The traditional implementation approach of data ga thering, data compression, and data transmission is no longer viable. The amount of data being generated is becoming too large to be transmitted via available downlink channels in reasonable tim e. An industry-proposed solution to reduce the demand on the downlink is to move processing onto the spacecraft [1]. The idea has been to have enough on-board processing power to supp ort the many space and earth exploration and experimentation satellites orbi ting earth and/or exploring the solar system. However, this approach is mired by the limited capabilities of todays on-board processors and the prohibitive cost of developing radiation-harden ed high-performance electronics. Microelectronics designed for environments wi th high levels of ioni zing radiation such as space have special design challenge s. A single charged particle of radiation can knock thousands of electrons loose, causing electr onic noise, signal spikes, and in the case of digital circuits, plainly incorrect results. The probl em is particularly serious in th e design of artificial satellites, spacecraft, military aircraft, nuclear power stations, and nuclear weapons. In order to ensure the proper operation of such systems, manufacturers of inte grated circuits and sensors intended for the aerospace markets employ vari ous methods of radiation harden ing. The resulting systems are

PAGE 14

14 said to be rad(iation)-hardened or rad-hard. Most rad-hard chips are based on their more mundane commercial equivalents, with manufact uring and design variat ions that reduce the susceptibility to radiation and electrical and magnetic interference. Typi cally, the hardened variants lag behind the cutting-edge commercia l products by several technology generations due to the extensive development and testing require d to produce a radiation-tolerant design [2]. As mentioned earlier, the approach of movi ng processing onto the spacecraft is hindered by the radiation effects influencing technological capabilities and cost of the high-performance electronics. This situat ion has encouraged researchers and industry to consider the use of commercial-off-the-shelf (COTS) components for on-board processing. Furthermore, the recent adoption of silicon-on-insulator (SOI) technol ogy by COTS integrated foundries is resulting in devices with moderate space radiation toleran ce [1]. Such an approach would make highperformance computing (HPC) possible in space. However, in spite of this progress, COTS components continue to be hi ghly susceptible to Single-Ev ent Upsets (SEU) that are consequences of radiation. SEU as defined by National Aeronautics a nd Space Administration (NASA) is radiation-induced errors in microele ctronic circuits caused when charged particles (usually from the radiation belts or from co smic rays) lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs [3]. High rate of SEUs lead to many component failures making HPC in space a significant challenge and certainly more complex than traditional HPC on ground. T echnology must be developed that capitalize on the strengths of COTS devices to realize HPC in space while overcoming their susceptibility to SEUs without negating their benefits. In general, computations are lost when there are failures and the full potential of the system is lost. Failures decrease the availability of the system. Availability gives a measure of the

PAGE 15

15 readiness of usage of a system (e.g., a highly av ailable system has a low probability of being down at any instant of time). Im proving availability requires the system to be built with faulttolerant features. But, these features might take additional resources reducing the effective utilization of the system. Effective utilization gives a measure of the time the system is used for actual computation. The impact of failures is worse in space environments compared to traditional HPC environments as the computational resources are limited. The chances of failures are high in harsh environm ents and the liberty in terms of resources available to provide fault tolerance is much less. HPC in space is not the same as traditional HPC on ground due to several constraints including harsh environments (e.g., high radiation, high rate of SEUs, high rate of component failures), limited resources attributed primarily to weight and volume constraints (e.g., processing speed, storage capaci ty, power availability), cost and timing constraints. Additionally, space computing also requires automated processing and repair to recover from faults if and when they occur in the system. The traditional approach to SEU or transient error mitigation in soft, radiation-tolerant hardware is redundant self-checki ng and comparison, either in hardwa re or in software. In some applications, size, weight, and power constr aints of the mission may preclude the use of redundant hardware, and the time constraints of the mission may preclude the use of redundant computation in software. In such cases, an alternative approach must be found which can provide adequate SEU protection with a singlestring or single-execution implementation. For example, Samson et al. compare the overhead of traditional redundant self-checking for SEU mitigation with an application-sp ecific fault tolerance method [4]. It is reported that selfchecking configurations consume twice the power twice the weight, twi ce the size, twice the cost, and roughly one half the reliab ility of the single-board solution.

PAGE 16

16 Several techniques exist for traditional HPC to provide fault tolerance and improve overall computation rate. Some of the solutions incl ude periodic backup of data (e.g., checkpointing), exposing computation to lesser fa ults by reducing overall execution time of applications (e.g., effective task scheduling) and reducing the impact of failure (e.g., faulttolerant middleware). However, adapting these solutions for HPC in space is a challenge due to the resource constraints discussed earlier. In this disse rtation, we address this challe nge by investigating, developing and evaluating techniques to improve and complement HPC in space. Figure 1-1. Typical high-performance computing system. Figure 1-1 shows the various software agents involved in a typical HPC system either on ground or in space. The system has three types of nodes: a Controller Node, several Computing Nodes, and one or more Storage Nodes. The Cont roller Node is in charge of accepting tasks for

PAGE 17

17 execution and scheduling the tasks on idle resources (i.e., Compu ting Nodes) and is generally radiation hardened in space systems. The Com puting Nodes perform the actual execution of the application while the Storage Node is responsible for data storage and backup. The data could be input, output or system stat es (required for restarts on failu res). The application processes executing on the Computing Nodes communicate via the Message-passing Middleware. In this dissertation, we address fault to lerance by focusing on the shaded areas in Figure 1-1, namely checkpointing, scheduling and message passing middleware. Three techniques are discussed in three differe nt phases of this research with the overall goal of improving the effective utilization and av ailability of HPC in space. In Phase 1, the useful computation time of the system is im proved (increased) by optimizing checkpointingrelated defensive I/O. We model the optimum ch eckpointing frequency for applications in terms of the mean-time-between-failures (MTBF) of the system, amount of memory checkpointed, and sustainable I/O bandwidth of the system. Optim al checkpointing maximizes useful computation time without jeopardizing the a pplications due to failures wh ile minimizing the usage of resources. In Phase 2, the overall execution tim e of the application is improved (reduced) by simulatively analyzing scheduling heuristics fo r application task scheduling. We analyze techniques to effectively schedul e tasks on parallel hardware rec onfigurable resources that would be part of the HPC space system. Effective ta sk scheduling reduces th e overall execution time thereby exposing the application to lesser faults while reducing th e resource usage as well. In Phase 3, availability of the system is improve d by designing a new lightwe ight fault-tolerant message-passing middleware. The fault-tolerant middleware reduces the impact of failures on the system. The recovery time of the system is improved allowing unimpeded execution of

PAGE 18

18 applications as much as possible. The techni ques that we propose in this research are also applicable to traditional HPC on the gr ound but are critical for HPC in space. This dissertation contains a discussion of the modeling of checkpointing process and optimization of defensive I/O. The dissertation al so describes the simulative analysis of dynamic scheduling heuristics for effective task schedul ing followed by the design and evaluation of a fault-tolerant message passing middleware. Conclu sions and directions for future research are provided finally.

PAGE 19

19 CHAPTER 2 OPTIMIZATION OF CHECKPOINT ING-RELATED I/O (PHASE I) In general, computation of large-scale scien tific applications can be divided into three phases: start-up, computation, and close-down with I/O existing in all phases. A typical I/O pattern has start-up phase dominated by reads, close-down by writes and computation phase by both reads and writes. The primary questions of importance with relevance to I/O involved in the three phases are when, how much and how often? These questions can be addressed by segregating I/O into two portions : productive I/O and defensive I/O [5]. Productive I/O is the writing of data that the user need s for actual science such as visu alization dumps, traces of key scientific variables over time, etc. Defensive I/O is employed to manage a large application executed over a period of time much larger than the platforms MTBF. Defensive I/O is only used for restarting a job in the event of applicat ion failure in order to retain the state of the computation and hence the progress since the last application failure. Thus, one would like to minimize the amount of resources devoted to defens ive I/O and computation lost due to platform failure. As the time spent on defensive I/O (back up mechanisms for fault tolerance) is reduced, the time spent on useful computations will increase. This philosophy applies to highperformance distributed computing in the majority of environments ranging from supercomputing platforms to small embedded cl uster systems, although the impact varies depending on the system and other resource constraints. The impact of productive I/O on I/O bandwidth requirements can be reduced by better storage techniques and to some extent through improved programming techni ques. It has been observed that defensive I/O domina tes productive I/O in large applic ations with about 75% of the overall I/O being used for checkpo inting, storing restart files, a nd other such similar techniques for failure recovery [5]. Hence, by optimizing the rate of defensive I/O, we can reduce the

PAGE 20

20 overall I/O bandwidth requirement. Another advant age is that the optimiza tions used to control defensive I/O would be more generic and not speci fic to applications and platforms. However, reducing defensive I/O is a significant challenge. Checkpointing of a systems memory, to mitigate the impact of failures is a primary driver of the sustainable bandwidth to high-performan ce filesystems. Checkpointing refers to the process of saving program/application state, usually to a stable storage (i .e., taking the snapshot of a running application for late r use). Checkpointing forms the crux of rollback recovery and hence fault-tolerance, debugging, data replicatio n and process migration for high-performance applications. The amount of time an application will tolerate suspending calculations to perform a checkpoint is directly related to the failure rate of the system. Hence, the rate of checkpointing (how often) is primarily driven by failure rate of the system. If the checkpointing rate is low, less resource are consumed but the chance of high computational loss (both time and data) is increased. If the checkpointing rate is high, reso urce consumption is greater but the chance of computational loss is reduced. It is important to strike a balance and an optimum rate of checkpointing is required. Finding a balance is a difficult problem even in traditional groundbased HPC with fewer failures and more resources The problem is aggravated for HPC with embedded cluster systems in harsh environments such as space with more failures and less resources. In this phase of the dissertation, we analyt ically model the process of checkpointing in terms of MTBF of the system, amount of memo ry checkpointed, sustainable I/O bandwidth and frequency of checkpointing. We identify the op timum frequency of checkpointing to be used on systems with given specifications, thereby maki ng way for efficient use of available resources and gain maximum performance of the system without compromising on the fault tolerance

PAGE 21

21 aspects. The useful computation time is in creased thereby improving the effective system utilization. Further, we devel op discrete-event models simula ting the checkpointing process to verify the analytical model for optimum chec kpointing. The simulation models are developed using the Mission-Level Designer (MLD) simulation tool from MLDesign Technologies Inc. [6]. In the simulation models, we use performance numbers that represent systems ranging from small cluster systems to large supercomputers. The remainder of this chapter is organized as follows. Section 2.1 provides background on the growth trends of technologies to study the effectiv eness of this resear ch to improve I/O usage. Section 2.2 briefly highlights why ch eckpointing is the most common methodology of providing fault tolerance. In Section 2.3, the ch eckpointing process is an alytically modeled to identify the optimum frequency of checkpoi nting to be used on systems with given specifications. Section 2.4 describes the simula tion models that we develop to verify our analytical models, while the resu lts derived from the analytical model are provided in Section 2.5. Section 2.6 provides conclusions for this chap ter and directions for future research. 2.1 Background on Technology Growth Trends In this section, we study the gr owth trend of technologies i nvolved in HPC, highlighting the poor growth of sustainable I/O bandwidth. The poor growth of I/O bandwidth compared to other technologies substantia tes our approach to reduce resource consumption and improve useful computation time by optimizing checkpointingrelated defensive I/O. It is important to mention that although we have used performa nce numbers from traditional ground-based HPC and supercomputing platforms (due to lack of an y standard representative platforms for HPC in space) to study the growth trends, the numbers a nd trends are representative of what would be coming for HPC in space. The performance of a ll the technologies however is relatively poorer for HPC in space due to radiation hardening.

PAGE 22

22 Gordon Moore observed an exponential growth in the number of transistors per integrated circuit and predicted this trend would continue [7]. He made this famous observation in 1965, just four years after the first planar integr ated circuit was discovered. This doubling of transistors every eighteen months referred to as Moores Law, has been maintained and still holds true. Similar to other exponentially growi ng systems, Moores law can be expressed as a mathematical equation with a derived compound annual growth rate. Let x be a quantity growing exponentially, in this case the number of transistors per in tegrated circuit, with respect to time t as measured in years. Then, kte x t x0) ( where k is the compound annual growth rate, and the rate of change follows kx dt dx For Moores law, the compound annual gr owth rate can be established as kmoore=0.46. When t=1.5, and 2 0x x ; 46 0 5 1 ) 2 ln( k. However, this compound a nnual growth rate is not the same for the other technologies involved in HP C. We briefly overview the growth of several technologies including CPU pro cessing power, disk capacity, disk performance, and sustained bandwidth to/from disks in the remainder of this section. 2.1.1 CPU Processing Power Modern HPC systems are made by tightly c oupling multiple integrated circuits. In addition to being able to capitalize on the expone ntial growth observed in Moores Law, these systems have also been able to increase the average number of pr ocessors in systems to have a peak performance that exceeds Moores law. Figure 2-1 shows the LINPACK [8, 9] rating for the tenth most powerful computer in the world for several years as ranked by the Top500 [10] list. The compound annual growth rate can be established as kHPC=0.58 (values of k for the technologies are calculated as shown for Moore s Law). We picked the tenth computer for no

PAGE 23

23 special reason except that in our opinion, the com puters at the very top mi ght have been custom tuned and hence may not provide the general trend. The benchmark used to establish the Top500 list highlights the combined performance of the pro cessors used to build the supercomputer. It is important to realize that other t echnologies such as di sk performance, memory performance, and networking performance are not fully represente d in the Top500 benchmark [10]. Growing the relative performance of these other technologies is equally important to the CPU processing power when looking to establis h a well balanced system. Figure 2-1. Supercompute r performance over time 2.1.2 Disk Capacity Figure 2-2 shows the capacity of a 95mm 7, 200RPM disk drive over time. Areal density of hard drives has grown at an impr essive compound annual growth rate of kIO_cap=0.62, and has accelerated to greater than a kIO_cap=0.75 rate since 1999 as shown in the figure. We are now seeing the delivery of 120 GB/inch2 for magnetic disk technology, with demonstrations over 240 GB/inch2 routinely occurring in laboratories [11]. The first 1 TB (1,000GB) hard disk drives are expected to ship in 2007[11]. Heinze et.al [12] demonstrate the theoretical limit of physical media at approximately 256 TB/inch2 using a single atomic laye r to form a two-dimensional

PAGE 24

24 antiferromagnet. In the next fi ve to ten years, the likelihood of perpendicular r ecording using a patterned media may likely appear to further incr ease recording densities [11]. In addition to continued evolutionary advancement, there are new disruptive storage t echnologies nearly ready to enter the market place lik e Magnetic Random Access Memo ry (MRAM) [13] and MicroElectro-Mechanical-Systems (MEMS) [14]. Thus we see a good growth in hard drive capacities. 1 10 100 1000 199519971999200120032005Capacity (GB) Figure 2-2. Hard drive capacity over time 2.1.3 Disk Performance Growth Disk performance has not kept pace with the growth in di sk capacities. More densely packed data means fewer disk actuators for a given amount of storage. While the compound annual growth rate of the areal de nsity of magnetic disk recording has increased at an average of over 60 percent annually, the maximum number of random I/Os per second that a drive can deliver is improving at an annual co mpounded growth rate of less than kIO_perf_io=0.20. Continual increases in capacity without co rresponding performance improveme nts at the drive level create a performance imbalance that is defined by the ratio called access density. Access density is the ratio of performance, measured in I/Os per second, to the capacity of the drive, usually measured in gigabytes (access density = I/ Os per second per gigabyte). As seen in Table 2-1, access density has stead ily declined while the capacity has increased substantially. Access density is becoming a significant factor in managing storage subsystem

PAGE 25

25 performance and the tradeoffs of using higher-cap acity disks must be carefully evaluated as lowering the cost-per-megabyte most often m eans lowering the performance. Future I/Ointensive applications will require higher acce ss densities than are indicated by the current development roadmaps. Higher access densities may be achieved through lowering the capacity per actuator or dramatica lly increasing the I/Os-per-second capabil ities of the drive. The latter is much harder to accomplish, particularly for ra ndom access applications where a seek (disk arm movement) is required. Vendors are making small high-performance dr ives such as a 15,000 RPM 73.4 GB drive from Seagate with an advertised seek time of 3.6 ms leading to an access density of about 3.8 IO/s per GB. As of April 2006 th ese high performance drives cost $3.3 per GB [15], a factor of 11 higher than the $0.36 per GB fo r low cost commodity drives [16] The factor of 9 increase in performance is offset by the factor of 9 increa se in cost, leaving most architects to select commodity drives for the additional capacity. Table 2-1. Disk performance growth over time Year Drive Seek Time (ms)I/Os per Sec Capacity (GB) I/Os per Sec per GB 1964 2314 112.5 8.9 0.029 306.50 1975 3350 36.7 27.2 0.317 86.00 1987 3380K 24.6 40.7 1.890 21.50 1996 3390-3 23.2 43.1 8.520 5.10 1998 Cheetah-18 18.0 55.6 18.200 3.10 2001 WD1000 8.9 112.4 100.000 1.10 2002 180GXP 8.5 117.6 180.000 0.70 2004 Deskstar 7K400 8.5 117.6 400.000 0.30 2005 Deskstar 7K500 8.5 117.6 500.000 0.24 2.1.4 Sustained Bandwidth to/from Disk Sustained bandwidth from a disk relative to capacity of the disk has also continued to decline. The sustained bandwidth of a disk is dependent upon th e physical location of the data on the disk. Due to a fixed rotati onal speed, the closer to the center of the disk platter the slower

PAGE 26

26 the sustained read rate. Figure 2-3 shows the sustaine d transfer rate in MB/s for the 15,000 RPM, 37 GB disk drive from Seagate [17]. The firs t 5 gigabytes of data are transferred at a rate of 76 MB/s; meanwhile the final 3 GB of da ta only sustained 49 MB /s. Although vendors provide the peak performance number in ge neral, the average and minimum sustained performance can be significantly lo wer. It is important to prop erly layout the data on the hard drive to achieve consistent performance. Figure 2-3. Performance of a single disk with respect to the amount of data stored Figure 2-4 highlights the mini mum, average and maximum performance for typical disk drives introduced between 1995 and 2004. The aver age sustainable bandwidt h from a hard disk drive has grown at an annual compounded growth rate of kIO_perf_BW=0.26 per year. Figure 2-4. Growth of disk drive bandwidth over time

PAGE 27

27 2.1.5 I/O Performance The technologies pertinent to HPC discussed in this section thus far are all growing exponentially although some are growing substa ntially slower than others. Table 2-2 summarizes the rate of cha nge for these technologies. Table 2-2. Growth rate of computing technologies Technology Growth rate Transistors per integrated circuit kmoore 0.46 LINPACK on Top10 supercomputer kHPC 0.58 Capacity of hard drives kIO_cap 0.62 Cost per GB of storage kIO_cost -0.83 Performance of hard drive in I/Os per secondkIO_perf_io 0.20 Performance of hard drive in bandwidth kIO_perf_BW 0.26 2.2 Optimal Usage of I/O The performance of disk drives measured in both I/Os per second and sustained bandwidth is not keeping up with other te chnology trends. Assuming current systems meet I/O performance needs and for the balance of bandwidth per comp utational power to be maintained, we can use the formula ) () (IO_perf HPC-k k te t d to calculate number of more disk drives that will be needed in order to maintain that balance. In order to show the importance of efficient usage of I/O reso urces, in Figure 2-5 we show how many disks will be required to work in perfect parallel to maintain system balance in the coming years. Given these growth trends we will need to use 46 times more disks in ten years in order to maintain the same balance of I/Os per second on a classical supercomputer. For example in 2004, a typical 10 TeraFLOP superc omputer may be capable of 500,000 theoretical I/Os per second, or 50 I/Os per GigaFLOP. System s of this size normally have disk drives in the order of around 5,000. Given hi storical growth rates a compar able supercomputer would have 3.4 PetaFLOPs in 2014 and need to be capable of 170,000,000 theoretical I/Os per second to

PAGE 28

28 keep the same balance. However, as disk pe rformance is not growing at the same exponential rate as the computer performan ce, it would be necessary to have 46 times more disk drives, necessitating disk drives in the order of around 230,000. Th e gap might further widen in the years to follow necessitating a fundamental ch ange to the way we approach storage. Figure 2-5. Number of disks requ ired in future systems to maintain present performance levels Based on the growth trend of the different technologies, it can be seen that I/O performance has fallen behind othe r technologies and the criticalit y of efficient usage of I/O resources can be realized. In the harsh space en vironment that is prone with failures and where the technologies are slower than their counterparts on ground, optim al usage of I/O resources is even more critical. As mentioned earlier, checkpoin ting is a critical process that drives the need to improve the I/O bandwidth with frequent a nd at times lengthy accesses to the disk. Hence, optimizing the rate of checkpoint ing will optimize the usage of I/ O resources. An alternative would be improving the checkpoint ing process itself with new techniques to consume fewer resources. 2.2.1 Checkpointing Alternatives Several strategies are used as alternatives to traditional disk-based checkpointing. Many researchers are working on diskle ss checkpointing (checkpoints are encoded and stored in a

PAGE 29

29 distributed system instead of a cen tral stable storage), for exampl e [18], and suggest this strategy as an alternative to conventional disk-based checkpointing to reduce the I/O traffic due to checkpointing. The National Nuclear Security Administration (NNSA) i ssued a news release about the first full-system three-dimensiona l simulations of a nuclear weapon explosion (Crestone project), a significant achievement for Los Alamos National Laboratory and Science Applications Internati onal Corporation [19]. The simulation is essent ially a 24-hour, seven-daya-week job for more than seven months. For a highly important seven-month computation such as Crestone, the notion of checkpointing and a ro lling computation go hand-in-hand. The small success stories of using diskless checkpointing fail in such cases of larg e applications. True disk-based incarnations are requi red over heavy I/O phases of th e code. Moreover, diskless checkpointing is often applicationdependent and not generic. Other checkpointing alternativ es include incremental chec kpointing (i.e., checkpointing only the information that has changed since pr evious checkpoint) and di stributed checkpointing (i.e., individual nodes in the system ch eckpoint asynchronously thereby reducing the simultaneous load on the network). Although these emergent schemes may have their advantages, more maturity is required for their us e in large-scale, mission-critical applications. In the current scenario, checkpoints are mo stly synchronous with all the nodes writing checkpoints at the same time. Additionally, the checkpoints ideally involve the storage of the full system memory (at least 70% to 80% of the fu ll memory in general) [19]. This scenario is quite common when schedulers such as Condor [20] are used or when applications checkpoint using libraries such as Libckpt [21]. The fr equency of checkpointing can be decreased for productive I/O to surpass defensiv e I/O, but not without the risk of losing more computation due to system failures.

PAGE 30

30 2.2.2 Optimizing Checkpointing Frequency In this phase of the research, we model the ov erhead in a system due to checkpointing with respect to the MTBF, memory cap acity and I/O bandwidth of th e system. In so doing, we identify the optimum frequency of checkpoi nting to be used on systems with given specifications. Optimal checkpointing helps to make efficient use of the available I/O and gain the maximum performance out of the system without compromising on its fault tolerance aspects. There have been similar efforts earl ier to model and identify optimum checkpointing frequency for distributed systems [22-25]. Howeve r, these efforts have not been simulatively or experimentally verified, and the approaches as yet are too theoretical to be practically implemented on systems. It should be mentioned that optimizing the fr equency of checkpointi ng is just one method of reducing the impact of defensive I/O on I/O bandwidth requirements. Other methods might include improvement of the storage system (hi gh-performance storage system, high-performance interconnects, etc.), novel methods and algorithms for checkpointing, etc. There are claims that we can hide the impact of defensive I/O and wo rk around this problem rath er than tackling it. However, there are no recorded proofs to substantiate such claims. 2.3 Analytical Modeling of Optimum Checkpointing Frequency Distribution of the to tal running time t of a job is de termined by several parameters including: Failure rate ; MTBFT 1 where TMTBF is the MTBF of the system Execution time without fa ilure or checkpointing, Tx Execution time between checkpoi nts (checkpoint interval), Ti Time to perform a checkpoint (checkpointing time), Tc

PAGE 31

31 Operating system startup time, To Time to perform a restart (recovery time), Tr Figures 2-6 and 2-7 show the various time pa rameters involved in the execution of a job without and with checkpointing re spectively. Without checkpointing, the system has to be failure free for a duration of Tx for the computation to complete successfully. When a failure occurs, computation is restarted fr om its initial state af ter system recovery. With checkpointing, the system state is checkpointed periodically. When the system is recovered from a failure, computation is resumed from the latest stab le checkpointed state on system recovery. Figure 2-6. Job execution without checkpointing Figure 2-7. Job execution with checkpointing The execution process with checkpointing a nd failures can be modeled as a Markov process as shown in Figure 2-8 where nodes 0,1,2,, n represent stable checkpointed states and 0,1,2,., n represent failed states. Let t1, t2, t3,, tn be the random variables for the time spent in each cycle between two checkpointed states These random variables are represented by in general.

PAGE 32

32 Figure 2-8. Execution mode led as a Markov process The delays associated with each ev ent in Figure 2-8 are as follows: a: T b: T c: T+T d: T T here c iT T T and r oT T T The probabilities associated with each event in Figure 2-8 are as follows: Te p ) ('T Te p qprobp T 1 ) ( qprob' 1 ) ( p T T It should be noted that the total running time is a sum of individual random variables representing individual checkpoi nting cycles. However, the random variables are similar and hence the mean total running time ( t ) can be given as the sum of the mean running time of each cycle. ) ( ..... ) ( ) ( ) ( ) (3 2 1 nt E t E t E t E t E t ) (1t E n t (2.1) The mean running time of each cycle can be found by multiplying the probabilities associated with each path in the Markov chai n and the corresponding time delay. There are several paths in the Markov chain that the process can actually ta ke. The process can move from state 0 to state 1 with probability p and the delay associated with that transition is a When there are failures in the system, the process does not dir ectly move from state 0 to state 1. Instead, the process moves to state 0 with probability q and loops back in the sa me state with probability q '. The delays associated with thes e state transitions represented by b and d respectively. b and d are exponential random vari ables with upper limit of T and T + T respectively. With probability

PAGE 33

33 p ', the system moves from the failed state to th e stable state 1 and delay associated with the transition represented by c is equal to T + T ', the upper limit of d ... 2 ' ' ) (2 1 d c b p q q d c b p q q c b p q a p t E 1 q d q c b q a p (2.2) where T a ; 1 1 T e bT; T T c ; 1 1 ) (' T T e dT T; Substituting the corresponding time delays into Eq. 2.2 we get Eq. 2.3 ' 2 ' 11 1 1 1 ) (T T T T T T T T T T Te e e T T e T e e Te t E (2.3) As can be seen from Eqs. 2.1 and 2.3, th e expression for the mean total running time is complicated. To avoid further complexity, we followed a different method as follows. The Laplace transform of a function f ( x ) is given by dx e x f ssx x) ( ) (. We can find the Laplace transform of the pdf of the total running time. The negative of the derivative of the Laplace transform at s =0 gives the mean total running time. dt e t f t dt e t f ds d sst st t) ( ) ( ) ( ) ( (2.4) ) 0 ( ] [ ) ( ) ( ) 0 (t tt t E dt t f t (2.5) Laplace transform can be calculated by finding the Laplace transforms of individual transitions in the Markov process and then combining them together. For example, the Laplace transform on the pdf of the time required for th e transition from state 0 to state 1 without a failure is given by sT e and the transition happens with probability p The transform on the pdf

PAGE 34

34 of the time required for transition from state 0 to state 0' that ha ppens with probability q is given by T e T s e s dt T st e t e T e T s 1 ) ( 1 0 1 1 ) (. The Laplace transform on the pdf of the time re quired for the transition from state 0' to state 1 that happens with probability p is given by T T s e If there is looping in state 0' due to repeated failures, the transform on the pdf of the time spent returning to state 0 (transition probability q ') is given by ) ( 1 ) )( ( 1 0 ) ( 1 1 ) ( T T e T T s e s dt T T st e t e T T e T T s The Laplace transform for the process with failures (state transition from 0 to 1 via 0') is found by doing a weighted sum of the Laplace tran sforms on the pdfs of the individual random variables. The transform is given by i T s q i T T s e p T s q ) ( 0 ) ( ) ( where i denotes the number of number of loops in state 0'. Hence, the Laplace transform of th e pdf for one cycle of the Markov process is be given by i T s q i T T s e p T s q sT pe s t ) ( 0 ) ( ) ( ) ( 1 We get Eq. 2.6 from the above and differentiating Eq. 2.6 w.r.t s and substituting s =0, we get mean total running time given by Eq. 2.7. We verified the validity of the expressions for the mean total running time given by Eqs. 2.4 and 2.7 by running a Monte Carlo simulation with 10000 iterations and cross checking the results. We found that the time given by the expressions in the equations closely matched the simulation re sults. We will be using the expression in Eq. 2.7 for further development for simplicity re asons both in terms of representation and computation. n s t s n t s t s t s t s t ) ( 1 ) ( ...... ) ( 3 ) ( 2 ) ( 1 ) (

PAGE 35

35 n T T s T T s T s T s te s e e e s '1 ) ( (2.6) 1 0 ) ( 1 ) 0 ( n t T T e T e n t r o r o c iT T T T T T T T te e n e e n t '1 ) 0 ( (2.7) We find the optimum checkpointing interval Ti_opt that gives the minimum total running time by differentiating the mean to tal running time with respect to Ti and equating to zero as shown in Eq. 2.8. We set n to be equal to the ratio of Tx and Ti 0 i T T T T T T x i iT e e T T T tr o r o c i (2.8) Solving for Ti in Eq. 2.8, we get the optimum checkpointing interval as follows: c opt iT T opt ie T _1_ (2.9) Eq. 2.9 can be represented in the form e 1where opt i T and c T e From this form of representation, it can be seen that Eq. 2.9 is a transcendental equation and it is impossible to find a solution for except by defining a new function. There is no analytic solution to this equation to obtain a closed form solution. However, it can be seen on expansion of e that is bound by the limit < 1, which implies that Ti_opt is bound by the limit Ti_opt < -1 (i.e, Ti_opt < MTBF of the system). Also, since 1 ewe have 1. Thus, we have a lower bound on optimum checkpointing interval as c T e opt i T 1 _. Hence, we can use numerical methods to solve the equati on for optimum chec kpointing interval.

PAGE 36

36 Impact of Checkpointing Overhead on I/O Bandwidth We modeled the optimum checkpointing fre quency that provides the minimum overall running time for applications. With the model developed, we study the impact of checkpointing overhead on the sustainable I/O bandwidth of sy stems in this section. The representative performance numbers used in this study are typical in HPC and supercomputing systems on ground. In our view, given a system with a speci fic memory capacity and MTBF, it would be useful to study the I/O bandwidth requirements of th e system with respect to the overhead that is imposed by the checkpointing done in the system and the subsequent performance loss in terms of the total execution time of the application. Eq. 2.10 obtains this performance loss as a function of Ti and Tc. c opt i r o r o c opt i r o r o c opt iT T T T T T T T opt i x T T T T T T xe e e T T e e T F _ _1_ (2.10) Let F denote the factor of increase in the total running time of the application due to checkpointing overhead and failures while runn ing with optimum ch eckpoint interval. F is given by the ratio of t to Tx as given by Eq. 2.10. In Eq. 2.10, To can be considered negligible compared to the other times. Also, Tr can be considered equal to Tc as both represent the time to move the same amount of data through the same I/O channel. Tc can be given by the ratio of memory capacity to I/O bandwidth. Given a value of F we can solve for Ti_opt by solving a quadratic equation on iTe. 2 4 ln 12M K K Topt i (2.11) wherec cT TFe e K 2 and cTFe M3

PAGE 37

37 Figures 2-9(a) and 2-9(b) show Ti_opt and hence the impact of checkpointing on the overall system execution time in systems with MTBF of 8 hours and 24 hours respectively for varying system memory capacities and I/O bandwidth. The MTBF values used in the figures are typical in recent high-performance systems [5]. The value of F is fixed at 1.2 in the figures. We pick 1.2 because F represents the impact of checkpointing on overall execution time and lower values of F (as close to 1 as possible) are certainly desirabl e. It can be seen from the figures that for systems with low I/O bandwidth, optimum checkpoi nting is not even possible implying that the allowable overhead (represented by F) is not achievable. As I/O bandwidth of the system is increased, the optimum checkpointin g interval also increases. Since the total execution time has been fixed (1.2 times the actual execution tim e), an increase in checkpoint interval means decrease in the number of checkpoints (n) during the course of the execution. Fewer checkpoints implies less overhead on the systems I/O. It can be seen from the Figure 2-9 that su stainable I/O bandwidth is a key to obtain optimum overall execution time. As memory capac ity of the system increases, so does the requirement of higher I/O bandwidth. For sy stems with higher memory capacities, it is impossible to find a solution for optimum checkpoi nting interval to obt ain the optimum overall execution time. For example, in a system with a MTBF of 8 hours and memory capacity of 75 TB, there is no solution for th e optimum checkpointi ng interval unt il the I/O bandwidth is increased to 29 GB/s. The impact is less in sy stems with higher MTBF values. For a system with similar memory capacity but a MTBF of 24 hours, there exis ts a solution starting with I/O bandwidth of 10 GB/s. However, an important fact or to note is that although solutions exist for the optimum checkpointing interval that gives the minimum total running time, the system might be bogged down by too many checkpoints if the checkpoi nt interval is low. For example, in a

PAGE 38

38 system with a memory capacity of 75 TB and MTBF of 24 hours, the optimum checkpointing interval is around 10 minutes. Performing large checkpoints at such a high frequency will certainly cause a great load on the system and is not desirable. A Optimum checkpoint interval on a system with MTBF=8 hours0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1591317212529 I/O Bandwidth (GB/s)Optimum checkpoint interval (hours) 1 TB 5 TB 10 TB 25 TB 50 TB 75 TB B Optimum checkpoint interval on a system with MTBF=24 hours0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1591317212529 I/O Bandwidth (GB/s)Optimum checkpoint interval (hours) 1 TB 5 TB 10 TB 25 TB 50 TB 75 TB Figure 2-9. Impact of I/O bandwidth on the system execution time and checkpointing overhead A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day. In certain scenarios or systems, the utility of the system within each cycle can be critical. In a given system, let R1 denote the utility in a cycle (i.e., the ratio of time spent doing useful calculations to the overall time spent in a cycle). c i iT T T R 1 and when Ti = Ti_opt,

PAGE 39

39 c T T T TT e e Rc opt i c opt i _1 11 (2.12) Eq. 2.12 gives the utility in each cycle when the checkpointing is performed at the optimum checkpointing interv al. The checkpointing time Tc can be again given by the ratio of CMEM (memory capacity) and IOBW (sustainable I/O bandwidth). Figures 2-10(a) and 2-10(b) give the utility in a cycle for several memory capacities and I/O bandwidth for systems with MTBF of 8 hours and 24 hours re spectively. The value of F is fixed at 1.2. The trend of I/O bandwidth requirement is similar to that in Figure 2-9. A Optimum checkpoint interval on a system with MTBF=8 hours0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1591317212529 I/O Bandwidth (GB/s)Optimum checkpoint interval (hours) 1 TB 5 TB 10 TB 25 TB 50 TB 75 TB B Optimum checkpoint interval on a system with MTBF=24 hours0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1591317212529 I/O Bandwidth (GB/s)Optimum checkpoint interval (hours) 1 TB 5 TB 10 TB 25 TB 50 TB 75 TB Figure 2-10. Impact of increasing memory cap acity in systems on sustainable I/O bandwidth requirement A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day.

PAGE 40

40 It can be seen from the figures, in most of the cases, the utility in a checkpoint cycle is much less than 90% which implies that most of the time within a cycle is spent checkpointing and not doing useful calculations. Although not shown in the figure, it was found that for a system with MTBF of 8 hours and memory capac ity of 75 TB, an I/O bandwidth of about 150 GB/s is required to utilize at least 90% of a checkpoint cycle on useful calculations. Such high values of I/O bandwidth are much higher than what is available for present systems. The fact that the system cannot even provide useful comp utations in many cases shows the gravity of the situation. We see from Figure 2-10 that the utility w ithin a cycle almost sa turates beyond a certain I/O bandwidth. The I/O bandwidth at which the percen tage utility begins to flatten out is what is desirable for the system both present and future For example, I/O bandwidth of around 10 GB/s would be desirable for a system with a memory capacity of 1 TB and MTBF of 8 hours. For a similar system with MTBF of 24 hours even a lesser I/O bandwidth (around 5 GB/s) would suffice. But as the memory capacity of the system increases, the I/O bandwidth desirable is dramatically higher. 2.4 Simulative Verification of Checkpointing Model Simulation is a useful tool to observe the dynami c behavior of a system as its configuration and components change. As preliminary verifi cation, Monte Carlo simu lations were performed on the analytical model. The simulation and an alytical results matche d closely verifying the correctness of the model mathematically. Howeve r, for more accurate verification of the model, we develop simulation models to mimic supe r and embedded computing environments using Mission-Level Designer (MLD), a discrete-eve nt simulation platform developed by MLDesign Technologies Inc. Section 2.4.1 presents the syst em model used to gather results, and Sections 2.4.2-2.4.6 provide details about each major component model.

PAGE 41

41 2.4.1 System Model Figure 2-11 displays the system model th at we employed to conduct the simulation experiments. The system consists of four key component models, multiple nodes, network, central memory, and fault generator. Each compon ent has associated parameters that are userdefinable in order to model different system se ttings. The components an d their parameters are discussed in detail in th e subsequent sections. Figure 2-11. System model 2.4.2 Node Model A node is defined as a device that processes a nd checkpoints data, and is prone to failures. Figure 2-12 shows the MLDesigner node model. Th e behavior of the node is modeled such that it checkpoints its entire main memory at a sp ecified time period. The time spent between checkpoints represents useful computation, co mmunication, and productive I/O time used to complete a specific task (see Computation sectio n in Figure 2-12). Afte r a checkpoint has been successfully completed, the nodes can continue processing data. The node model was designed in order to provid e an abstract represen tation of a clustered processing element that runs a parallel job along w ith the other nodes in the system. If a single node in the system fails, all the nodes must recove r from the last successful checkpoint to ensure data integrity. The statistics gathered at th e node level (see Statistics section in Figure 2-12) include completed computation time and lost com putation time due to a failure. The user can

PAGE 42

42 define numerous parameters for the node mode l including checkpoint interval, main memory size, and application completion time. Computation section Statistics section Figure 2-12. Node model 2.4.3 Multiple Nodes Model The multiple nodes model, illustrated in Figur e 2-13, uses a capability of the MLDesigner tool, dynamic instantiation, that a llows a single block to represent multiple instances of a model. The node model described in Section 2.4.2 is dynamically instantiated in the multiple nodes model. The technique is used to ease the design and configuration procedur e used to model large homogenous systems. The main function of the multiple nodes model is to ensure global synchronous checkpoints and to collect statistics. The statistics gathered include completed checkpoint time and total checkpoint time lost. That is, it records the amount of time taken to successfully complete a checkpoint and also the am ount of time lost when a failure occurs during a checkpoint.

PAGE 43

43 Figure 2-13. Multiple nodes model 2.4.4 Network Model The network model is a coarse-grained repres entation of a generic switchor bus-based network. The model uses user-defined latency and effective bandwidth parameters to calculate network delay based on the size of the transfer Figure 2-14 illustrates the network model. Figure 2-14. Network model

PAGE 44

44 2.4.5 Central Memory Model The central memory model is a nother coarse-grained model that is used as a mechanism to represent the storing and restoring of checkpoi nted data. The model accepts each checkpoint request and sends a reply when the checkpoint is completed. When a node fails, each node is sent its last successful checkpoint after some user-definable rec overy time which represents the time needed for the system to respond to the fail ure. No actual data is stored in the central memory since the simulated system does not actu ally process real data. Also, the central memory is assumed to be large enough to hold all checkpoint data, ther efore overflow is not considered. Figure 2-15 illustrates the central memory model. Figure 2-15. Central memory model 2.4.6 Fault Generator Model When a failure occurs in a component, the system must recover using a predefined procedure. This procedure starts by halti ng each working node followed by the transmission of the last checkpoint by the central memory model. When the failed node recovers and all nodes receive their last checkpoint, the system begi ns processing once again. The fault generator

PAGE 45

45 model shown in Figure 2-16 controls this proces s by creating a failure event at some random time based on an exponential distribution with a user-definable time (i.e., MTBF) and orchestrating the recovery process of the componen ts in the system. For example, when a failure occurs, the fault generator will pass a notice to th e node models to let them know that they must recover. It also tracks the stat us of each node regarding their rec overy status (e.g. recovered or recovering). Figure 2-16. Fault generator model 2.4.7 Simulation Process Total execution time of the application, MTBF of the system, sustainable I/O bandwidth, amount of memory to be checkpointed and freque ncy of checkpointing are input parameters to the simulation model and are variable. Faults are generated in the system based on the MTBF value input. After every checkpoi nting period, data (equal to th e size of memory specified) is checkpointed to the central memory. The time to checkpoint is dependent on the amount of checkpoint data and the I/O bandw idth values. While recovering from a failure, the nodes in the system collect the data from the central memory and the transfer time is again dependent on the

PAGE 46

46 data size and the I/O bandwidth value. If a failur e occurs in the system during the data recovery process, the transfer is reinitiated an d the process is repeated until successful. 2.4.8 Verification of Analytical Model In order to verify the correctness of our an alytical model, we simulated the checkpointing process in several systems with memory ranging from 5 TB to 75 TB and I/O bandwidth ranging from 5 GB/s to 50 GB/s, running applications wi th execution times ranging from 15 days to 6 months (180 days). For each combination of va lues for the above parameters, the system was simulated for MTBF values of 8 hours, 16 hours and 24 hours. The numbers listed above are typical of high-performance computers found in seve ral research labs and production centers. In this section, we pick a few systems and show the simulation results to verify the analytical model for optimum checkpointing frequency. Figure 2-17 shows the results from simulations of the system running applications with fault-free execution time of 15 days (15=360 hours). For a given value of checkpointing interval (Ti), the y-axes in the figure give the applic ation completion time (i.e., total time for the application to complete with fa ults occurring in the system). Hence, the optimum checkpoint interval is the value for which the application co mpletion time is the least with the given system resources. Based on the completion time give n by Figure 2-17(a), for a system with 5 TB memory and 5 GB/s sustained I/O bandwidth, the optimum checkpoint intervals are around 2 hours, 3 hours and 4 hours respectively when th e MTBF of the system is 8 hours, 16 hours and 24 hours. Similarly, for a system with 5 TB me mory and 50 GB/s sustained I/O bandwidth, the optimum checkpoint intervals as shown by Fi gure 2-17(b) are 0.5 hour s, 1 hour and 1 hour respectively when the MTBF values are 8 hours, 16 hours and 24 hours. In order to verify the correctness of the analy tical model, Eq. 2.9 needs to be solved to find Ti_opt given the same system parameters used for simulation in Figure 2-17. We use the fixed

PAGE 47

47 point numerical method [26] to solve Eq. 2.9. Such strategies are quite common in solving equations as not all problems yield to direct anal ytical solutions [27]. Fixed point iteration is a method for finding roots of an equation f(x) = 0 if it can be cast into the form x = g(x) as is Eq. 2.9. A fixed point of the function g(x) is the solution of the equation x = g(x) and the solution is found using an iterative technique. A value is supplied for the unknown vari able in the equation and the process is repeated until an answer is achieved. A System with 5 TB memory, 5 GB/s I/O bandwidth200 300 400 500 600 700 800 900 1000 0.020.51248162436Checkpointing Interval (hours)Application Completion Time (hours) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours B System with 5 TB memory, 50 GB/s I/O bandwidth200 300 400 500 600 700 800 900 1000 0.020.51248162436Checkpointing Interval (hours)Application Completion Time (hours) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours Figure 2-17. Optimum checkpoint ing interval based on simulati ons of systems with 5 TB memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of 50 GB/s. In order to show the validity of using numeri cal method to solve our analytical model, we show the results from the fixed point iterative me thod that we used in Figure 2-18. The system used for illustration in the figure is the one with 5 TB of memory and 5 GB/s of I/O bandwidth. For Eq. 2.9 to have a solution (i.e., for a given ch eckpoint interval to be the optimum), the left-

PAGE 48

48 hand side (LHS) is equal to the right-hand side (RHS) of the equation. For a given value of checkpointing interval (Ti), the y-axis in Figure 2-18 referred to as the error in solution give the difference between the LHS and the RHS of Eq. 2.9. Hence, the optimum checkpoint interval is the one for which the error is zero. The checkpo inting interval values used in our iterative technique are bound by the lower and upper limits discussed in Section 2.3. The optimum checkpoint intervals identified using the numerical method in Figure 2-18 for the system are 2 hours, 3 hours and 3.5 hours respectively when the MTBF is 8 hours, 16 hours, and 24 hours. The results of the analytical model are then cross compared with the solutions from the simulation model shown in Figure 2-17. It can also be seen that the solutions are well below the upper bound (MTBF) derived in Section 2.3. Section 2.5 provides the results from the analytical model for a wide range of systems. Solving our analytical model for a system with 5 TB of memory and 5 GB/s of I/O bandwidth, the optimum checkpointing interv als are around 2 hours, 3 hours and 3.5 hours respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. For a similar system with 5 TB of memory but 50 GB/s of I/O bandwidth, the optimum checkpointing intervals are around 0.5 hours, 0.75 hours and 1 hour respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. Based on the simulation results for optimum chec kpoint intervals in Figure 2-17, it can be seen that the analytical results match that of simulation. According to both analytical and simulation models, for a system with 5 TB of memory and 5 GB/s of I/O bandwidth, the optimum checkpointing intervals are around 2 hours, 3 hours and 4 hours respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. Likewise for a system with 5 TB of memory but 50 GB/s of I/O bandwidth, the opt imum checkpointing inte rvals are around 0.5

PAGE 49

49 hours, 1 hour and 1 hour respectively when the MT BF of the system is 8 hours, 16 hours and 24 hours. The analytical model is thus verified by the simulation model. System with 5 TB Memory, 5 GB/s I/O Bandwidth0 2 4 6 8 10 12 14 16 18 1836547290108126144162180198Checkpointing Interval (minutes)Error in solution MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours Figure 2-18. Numerical method solu tion for the analytical model Eq. 2.9 is independent of the to tal execution time of the application. In order to verify this behavior of the analytical model, we ran simulatio ns for applications with larger execution times. Figure 2-19 shows the results from simulations of the system running applications with fault-free execution time of 180 days (180=4320 hours). We also wanted to verify that the analytical model holds true for systems w ith parameters different from those used. Hence the system shown in Figure 2-19 has 75 TB of memory which is close to the high end of presently existing supercomputers. Simulation results in Figure 2-19(a) show that for a system with 75 TB of memory and 5 GB/s of I/O bandwidth, the optimum checkpoint ing intervals are around 6 hours, 9 hours and 12 hours respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. For the same system, the analytical model also provide s the same optimum checkpointing intervals as can be seen from figures in Section 2.5.

PAGE 50

50 Simulation results in Figure 219(b) show that for a system with 75 TB of memory but 50 GB/s of I/O bandwidth, the optimum checkpoint ing intervals are around 2 hours, 3 hours and 4 hours respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. Analytical results for the same system matches the simulation results. Thus, it can be seen from Figures 217 through 2-19 that the analytical results match closely with the simulation results, further verifying the correctness of the analytical model. A System with 75 TB memory, 5 GB/s I/O bandwidth4000 6000 8000 10000 12000 14000 16000 18000 0.020.51248162436 Checkpointing Interval (hours)Application Completion Time (hours) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours B System with 75 TB memory, 50 GB/s I/O bandwidth0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0.020.51248162436 Checkpointing Interval (hours)Application Completion Time (hours) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours Figure 2-19. Optimum checkpoint ing interval based on simulati ons of systems with 75 TB memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of 50 GB/s. 2.5 Optimal Checkpointing Frequencies in Typical Systems The results shown in Section 2.5 were primarily used to verify the analytical model. In this section, we show results de rived from the analytical model fo r a wide range of systems with parameter values representing large supercompu ters traditionally developed for HPC. The results provide insight about the ch eckpoint intervals that should be used in typical scenarios. In

PAGE 51

51 order to analyze the applicability of our model to a broader spectrum of systems, we also studied small systems intended for high-performance embedded computing (HPEC) in space. The results provide insight about the checkpoint intervals that should be used in typical scenarios. It should be mentioned here that at present there are no existing systems for HPC in space. However there are initiatives for developing such systems such as the one at the Highperformance Computing and Simulation Research Laboratory at UF in collaboration with NASA and Honeywell Inc [1]. The values supplied for the parameters (resources) in our studies are inspired by the system at University of Florida and hence fall in that range. 2.5.1 Traditional Ground-based HPC Systems We show the optimal checkpointing frequenc y results for three different cases with parameter values representing traditional gr ound-based HPC systems in Figure 2-20. The optimal checkpointing frequencies are calculat ed using the same method discussed in the previous section (i.e., fixed point iteration method). Figure 2-20(a) gives the results for a system at the lower end with 5 TB memory while Figure 2-20(b) gives the results for a system at the other extreme with 75 TB memory. These hi gher-end resources are similar to what an application such as the Crestone project discus sed earlier would solicit. The results for a midrange system with 30 TB memory is shown in Figure 2-20(c). The optimum checkpointing interval for typical HPC systems is in the order of hours. Increase in MTBF implies lesser chances of faults and hence it is not required to checkpoint often. It can be s een from Figure 20 that as the MT BF of the system increases the optimum checkpointing interval al so increases. However, the difference is much pronounced only in systems with less I/O bandwidth. As the I/O bandwidth of the system is increased, the difference in optimum checkpointing intervals fo r varying MTBF decreases. It can also be observed from the figure that for a system with a given memory capacity and MTBF, the

PAGE 52

52 optimum checkpointing interval decr eases as the I/O bandwidth of th e system is increased. This trend implies that it is better to checkpoint at a higher frequency if the time to checkpoint lowers. This observation can also be ascertained by the f act that for the same I/O bandwidth the optimum checkpointing interval increases as the memory cap acity of the systems increases (implying that it is optimal to checkpoint at a lesser frequency if checkpointing time increases). A System with 5 TB Memory0 50 100 150 200 250 300 5101520253035404550 I/O Bandwidth (GB/s)Optimum Checkpointing Interval (minutes) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours System with 30 TB Memory0 50 100 150 200 250 300 350 400 450 500 5101520253035404550I/O Bandwidth (GB/s)Optimum Checkpointing Interval (minutes) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours B C System with 75 TB Memory0 100 200 300 400 500 600 700 800 5101520253035404550I/O Bandwidth (GB/s)Optimum Checkpointing Interval (minutes) MTBF = 8 hours MTBF = 16 hours MTBF = 24 hours Figure 2-20. Optimum checkpoint ing interval for various syst ems A) System with 5 TB memory. B) System with 30 TB memo ry. C) System with 75 TB memory. 2.5.2 Space-based HPC Systems Figure 2-21 shows the optimal checkpointing frequencies for systems with parameter values in the range of HPC systems in space. Figu re 2-21(a) gives the results for a system at the lower end with 512 MB memory while Figure 2-21(b) gives the results for a system at the other extreme with 8 GB memory. The results for a mi drange system with 2 GB memory is shown in

PAGE 53

53 Figure 2-21(c). The optimum checkpointing interval for typical HPEC systems is in the order of minutes. A System with 512 MB Memory0 100 200 300 400 500 600 700 100600110016002100260031003600I/O Bandwidth (Mb/s)Optimum Checkpointing Interval (seconds) MTBF = 5 minutes MTBF = 30 minutes MTBF = 60 minutes System with 2 GB Memory0 100 200 300 400 500 600 700 800 900 1000 100600110016002100260031003600I/O Bandwidth (Mb/s)Optimum Checkpointing Interval (seconds) MTBF = 5 minutes MTBF = 30 minutes MTBF = 60 minutes B C System with 8 GB Memory0 200 400 600 800 1000 1200 1400 1600 1800 100600110016002100260031003600I/O Bandwidth (Mb/s)Optimum Checkpointing Interval (seconds) MTBF = 5 minutes MTBF = 30 minutes MTBF = 60 minutes Figure 2-21. Optimum checkpointing interval fo r various systems in space A) System with 512 MB memory. B) System with 2 GB me mory. C) System with 8 GB memory. It can be seen from Figure 2-21 that the be havior of the space-based HPEC systems is similar to the ground-based HPC systems show n in Figure 2-20 although HPEC systems have much less resources and high failure rate. For a given system, the optimal checkpoint interval increases as the MTBF value is increased. Li kewise for a system with a given MTBF and I/O bandwidth, the optimum checkpoint interval increa ses as the memory of the system increases. As I/O bandwidth of the system is increased w ith other system parameters kept constant the optimum checkpoint interval decreases.

PAGE 54

54 2.7 Conclusions and Future Research In this phase of the research, the useful computation time of the system is improved (increased) by optimizing checkpoi nting-related defensive I/O. We studied the growth in performance of the various technologies involv ed in high-performance computing, highlighting the poor performance growth of disk I/O compared to other technologies. Presently, defensive I/O that is based on checkpointing is the primary driver of I/O bandwidth, rather than productive I/O that is based on processor performance. Th ere are several research efforts to improve the process of checkpointing itself but they are gene rally application-specific. To our knowledge, there have been no successful attempts to optim ize the rate of checkpointing, which is our approach to reduce the overhead of fault-toleranc e. Such an approach is not applicationor system-specific and is applicable to both HP C systems in space and on ground. The primary contribution of this phase of research is th e development of a new model for checkpointing process and in so doing identify the optimum rate of checkpointing for a given system. Checkpointing at an optimal rate reduces the risk of loosing much comput ation on a failure while increasing the amount of useful computation time available. We developed a mathematical model to re present the checkpointing process in large systems and analytically modeled the optimum computation time within a checkpoint cycle in terms of the time spent on checkpointing, th e MTBF of the system, amount of memory checkpointed and sustainable I/O bandwidth of the system. Optimal checkpointing maximizes useful computation time without jeopardizing the applications due to failures while minimizing the usage of resources. The optimum computati on time leads to the minimal wastage of useful time by reducing the time spent on overhead task s involving checkpointi ng and recovering from failures.

PAGE 55

55 The model showed that the optimum checkpointi ng interval is indepe ndent of duration of application execution and is only dependent on system resources. In order to see the impact of checkpointing overheads and the overhead to recover from failures on the I/O bandwidth required in the systems, we analyzed the overall execution time of applications including these overhead tasks relative to the ideal execution time in a system without any checkpointing or failures. Results pertaining to the time spent on useful calcula tions between checkpoints suggest the need for a very high sustainabl e I/O bandwidth for future systems. A system with a projected MTBF of 1 day (24 hours) and a memory capacity of 75 TB, which may be typical for a system in the near future, would require a sustainable I/ O bandwidth of about 53 GB/s to effectively use 90% of time on useful computations. For a sim ilar system with a projected MTBF of 8 hours, the required I/O bandwidth would be 150 GB/s for the same performance. In order to verify our analytical model, we developed discrete-event simulation models to simulate the checkpointing process. In the simulation models, we used performance numbers that represent systems ranging from small embe dded cluster systems (space-based) to large supercomputers (ground-based). The simulatio n results matched closel y with those from our analytical model. Finally, we also showed th e optimum checkpointing interval for a wide range of HPC and HPEC systems. The results deri ved from the analytical model showed that irrespective of the dura tion of the applications fault-free execution, the optimum checkpointing interval is in the orders of hour s for HPC systems and in the order of minutes for HPEC systems. For a system with a given MTBF, the checkpoint ing time determined by the ratio of memory capacity and I/O bandwidth is the primary f actor influencing the optimum checkpointing interval.

PAGE 56

56 As future work, the model can be experime ntally verified with real systems running scientific applications. When successfully veri fied, the model can be used to find the optimal checkpointing frequency for various HPC a nd HPEC systems based on their resource capabilities.

PAGE 57

57 CHAPTER 3 EFFECTIVE TASK SCHEDULING (PHASE II) Several techniques exist for traditional HPC to provide fault tolerance and improve overall computation rate. We discussed the option of pe riodic backup of data at an optimum frequency during the first phase of this dissertation. In this next phase of the dissertation we focus on exposing computation to lesser fa ults by reducing overall execution time of applications through effective task scheduling of applications for HPC systems in space. We propose to analyze techniques to effectively schedul e tasks on parallel hardware rec onfigurable resources that could be part of the HPC space system. 3.1 Introduction Recently, the processing speed and efficiency requirements of HPC applications have driven research efforts in a different directi on toward systems augmented with customizable hardware such as Field-Programmable Gate Arra ys (FPGAs). Application-Specific Integrated Circuits (ASICs) almost always outperform General-Purpose Pro cessors (GPPs) in terms of raw performance for a given application but tend to have a much higher development cost and cannot be adapted once developed. By contrast, RC offers flexibility at multiple levels by exposing the raw building blocks to form complex logic opera tions optimized for the task at hand. In addition, when a task is completed, the same hardware can be reconfigured to perform a completely different task. FPGA-based dynamica lly reconfigurable computing can be more cost-effective and flexible than ASIC solutio ns and simultaneously faster than GPP-based conventional computing. RC has gained in popularity recently and there are numerous ongoing RC-related research efforts at various institutions in addition to a large following in the HPC vendor community with products from Cray, SRC and SGI already deployed. Empowering clusters and supercomputers with customizab le and dynamically rec onfigurable hardware

PAGE 58

58 resources is a potential solution to address the requirements of HPC applications in terms of enormity, processing speed and efficiency. Th e strengths of GPPs and FPGAs compliment each other and an efficient merger can produce a pow erful combination. However, this area of research is still relatively new and much resear ch is required to bri ng the tools and runtime environment to a level of maturity seen in traditional HPC systems. HPC systems in space might potentially i nvolve such FPGA-based reconfigurable computing alongside traditional HPC based on GPPs. The reason is two-fold: the first one being that FPGAs fit well with the embedded system s mentality of space computing and the second being the improved speedup in the execution of ta sks. We followed a modeling-based approach in the first phase to improve the useful computat ion time. In this phase, we reduce the execution time of tasks thereby exposing them to lesser num ber of faults. However, reducing execution time would require effective sche duling of tasks. Since FPGA-ba sed parallel computing is a relatively new field of research, there are not many standardized and well researched scheduling mechanisms for scheduling RC applications. In this phase of the dissertation, we explore so me initial steps toward the goal of effective task scheduling in space-based HPC systems enhanced with reconfigurable hardware accelerators. Developing a robust service for au tomated scheduling, execution, and management of tasks including RC tasks on network-based cluste rs, or any scalable distributed system, is a major challenge. Performance modeling, dynamic scheduling, and management of distributed parallel RC applications and syst ems have not been well studied to our knowledge. We take a first step towards the solution of automated job management by analyzing via simulation the performance of several traditional scheduling heur istics on the computing re sources as part of a future automated job management service for para llel RC systems such as [28]. We test these

PAGE 59

59 heuristics by executing typical a pplications used in the evalua tion of Department of Energy (DOE) systems on simulated RC resources in space. We also develop a performance model to predict the overall execution time of tasks on RC resources used by our scheduling heuristics to schedule these tasks. With this analysis, we progress towards our primary goal of fault tolerance. Additionally studying th e scheduling of tasks on RC processing elements has similarities to and hence helps task scheduling on traditional pr ocessing elements involved in HPC in space. The rest of the chapter is organized as fo llows. A brief background on RC and information on representative RC resources and applications used in our si mulations is provided in the second section. In the third section, we deve lop the performance model used for our scheduling heuristics and the various heuristics that we compare in this phase of research. We discuss the simulation results and compare the different heuristics in the f ourth section. Finally, the conclusions and contributions are presented in the fifth section. 3.2 Background A general background of RC is not included here The topic has been well presented by many sources such as Compton and Hauck [29] and Enzler, Plessl and Platzner [30]. In this section, we provide background information on th e RC resources and HPC applications used to drive our simulations. 3.2.1 Representative RC Systems As part of this work, we studied several RC systems including PCI-based boards in clusters of workstations, the Cray XD1 with a fast me ssage-passing interconnect between processors and FPGAs, and SGIs RASC for the Altix350 which features a global shared-memory system between all system components. The speedups th at the RC systems provide in our simulations are based on empirical measurements of thei r characteristics and performance on sample applications. A brief description of the RC systems studied follows.

PAGE 60

60 We studied three boards attached to host machines as peripheral components via a PCI slot: BenNUEY, RC1000, and CPX2100. The Be nNUEY [31] RC board from Nallatech hosts a Virtex-II 6000 FPGA and an attached BenBLUE-II daughter card with two Virtex-II 6000s for a total of three Virtex-II 6000s. For a PCI-based board, the BenNUEY provides a fairly substantial amount of processing power. The arch itecture allows for DIME module expansion of up to six Virtex-II 8000s in addition to the FPGA on the BenNUEY in a tightly-coupled package. The RC1000 from Celoxica [32] c ontains a Virtex 2000E FPGA. This PCI-based board includes a front-side memory interface that can provide significant storage and direct-memory access capability while using minimal FPGA resources in contrast with boards having off-chip memory behind the FPGA. The CPX2100 [33] from Tara ri contains three Vi rtex-II 2000s, although only two are available for user designs. The third FPGA is used for memory management, PCI bus management, and other control processes allowing the other two FPGAs to be used for user designs more efficiently. SGIs Altix family, built upon In tel Itanium 2 processors and the Linux OS, scales up to thousands of processors via their NUMAflex globa l shared-memory architecture. The Altix line also employs the NUMAlink message-passing inte rconnect with 1 s MPI short-message latency and 3.2 GB/s unidirectional bandwidth per li nk. SGI has integrated a Virtex-II 6000 FPGA, known as the Reconfigurable Application-Specifi c Computing (RASC) modul e [34], directly to the NUMAlink fabric, allowing the RASC hardware to integrate to systems up to 512 processors and beyond. Cray has also created a system w ith RC capabilities in the XD1 line by integrating an FPGA directly onto each 2or 4-way symmetric multiprocessor nodes local RapidArray connection. Two system flavors exist today, one including Virtex-II Pros and another including Virtex-4 FPGAs. The FPGA considered in our si mulations belongs to th e Virtex-4 category.

PAGE 61

61 3.2.2 Representative Applications The representative applications that were us ed in our simulations were chosen to possess heterogeneous characteristics in terms of execution time of individual tasks, memory locality, computational intensity (ratio of computation to memory access), etc. The choice of applications and their performance numbers are based on the pe rformance evaluation performed by Vetter et al. [35]. It should be mentione d that we do not consider the in-depth characteristics of the applications and only consider their comparative performance pattern. Also, the applications chosen are traditional HPC applications that have well studied before and are not specific to space computing. The reason is that space appl ications on HPC and RC environments are not well studied yet and there are no st andard performance evaluations of these applications to our knowledge. We use seven benchmark appli cations (many of them are used on ASC1 platforms of the Department of Energy) including s PPM, SMG2000, SPHOT, IRS, UMT, AZTEC, and SWEEP3D. The applications are truly scalable, scaling to thousands of processors, and some have executed on platforms using as many as 800 0 processors. A coarse-grained, distributed memory model is used by all the applications. Below is a brief description of the applications as described by [35]. sPPM [36] solves three-dimensional gas dynamics problem (compressible Euler equations) using a simplified Piecewise Parabolic Method (PPM). PPM is a finite volume technique in which each grid point uses the in formation at the four nearest neighbors along each spatial dimension to update the values of its variables. SMG2000 [37] is a parallel semicoarsening multigrid solver for linear systems arising from finite difference, volume, or elemen t discretizations of the diffusion equation on logically rectangular grids. 1. Advanced Simulation and Computing (ASC) Program earlier know n as Accelerated Strategic Computing Initiative (ASCI) is the National Nuclear Security Administration (NNSA) colla borative program among DOEs se veral national laboratories.

PAGE 62

62 SPHOT is a two-dimensional photon transport co de to track particles through a logically rectangular 2-D mesh. Monte Carlo transport solves the Bo ltzmann transport equation by directly mimicking the behavior of photons. Sweep3D [38] solves a three-dimensional, timeindependent, particle transport equation on an orthogonal mesh using a multidimensional wavefront algorithm for deterministic particle transport simulation. IRS [39] is an implicit radiation solver code to solve radiation transport by the flux-limited diffusion approximation using an implicit matrix solution. UMT is a three-dimensional, deterministi c, multigroup, photon transport code for unstructured meshes solving first-order form of the steady-state Boltzmann transport equation. AZTEC [40] is a parallel iterative library for solving linear systems. 3.3 Task Scheduling Scheduling is the process of assigning task s to required resources. Efficient task scheduling is a critical part of automated mana gement and is imperative to maximize application performance and system utilization. Scheduling can be broadly classified as static scheduling wherein tasks are assigned resour ces prior to the start of app lication execution, and dynamic scheduling wherein tasks are assi gned resources dynamically while applications are executing. In general, dynamic scheduling often implies th at tasks are migrated between resources while executing based on the availability and performan ce of the various resources at hand. Static scheduling typically simplifies runtime management requirements but often leads to inefficient resource utilization. Dynamic scheduling can improve overall application execution times but requires a more complicated runtime management mechanism, especially in a heterogeneous environment. Another form of scheduling whic h blends the two previous schemes schedules tasks as and when they arrive (no fixed static schedules) but w ithout any dynamic task migration capabilities. We choose this in termediate approach for our simu lations to provide the balance between efficiency and complexity.

PAGE 63

63 Task scheduling can be further categorized into divisible workload scheduling and independent task scheduling. In divisible worklo ad scheduling, the workload can be divided into arbitrary sized chunks for scheduling. This t ype of division of workload enables flexible scheduling and hence can be more efficient. Howe ver, in general, it may be difficult to divide a workload into chunks of any arbitrary size. By contrast, independent ta sk scheduling schedules the workload that has been already divided into independent tasks and th is is the scheme we adopt for our simulations. Our simulations only focus on resource-level scheduling (i.e., a task scheduled to a resource owns th e resource until completed, at wh ich time other tasks may use the resource). 3.3.1 Scope of the Scheduler Jobs represented by a task-data dependency grap h such as a directed acyclic graph (DAG) are submitted to a Job Manager (JM) component of the automated management service. Figure 3-1 shows DAGs for two sample jobs. Figure 3-1. Task dependence using direct ed acyclic graphs for sample jobs The JM provides the scheduler with the task details (i.e., performance requirements and the order of execution of tasks). The tasks are schedu led in the same order as they are represented in the DAG. If multiple tasks can be executed simultane ously (i.e., if the tasks are in the same level

PAGE 64

64 on the DAG), then the tasks can be schedul ed in a batch mode. The communication dependencies between the tasks (t hat might be executing simultaneou sly) are not considered in the performance model that we use to schedule the tasks. 3.3.2 Performance Model Tasks are scheduled on a set of machines (we use the terms machines and resources interchangeably) based on the performance of th e tasks on the machines. Generally, in any scheduler, the job of predicting task performan ce is based on a model which when supplied with the parameters representing both tasks and machines gives the expected performance of the tasks on the machines. Many such performance models exist for traditional HPC systems but few for RC systems. The models that exist are too complex for simulative studies of scheduling heuristics [41]. Hence, in this section, we de velop a performance model to predict the overall execution time of a task on a machine given the char acteristics of the task and the machine. We use this performance model for our simulations. Eq. 3.1 through 3.5 give the performance model that we develop for our system. BCOMP T BCOMM T HCOMP T HCOMM T OET (3.1) OET : Overall execution time THCOMM, THCOMP : Communication and computation time involving the host machine TBCOMM, TBCOMP : Communication and computation time involving the RC board BOARD T CONF T BCOMP T (3.2) TRANSFER N DATA S BCF T BCOMM T (3.3) NO HCOMP T HCOMP T (3.4) HDATA T EXE T HCF T HCOMM T (3.5) TCONF : Board configuration time TBOARD : Board execution time TBCF : Configuration file transfer time to the board from the host SDATA : Average size per data transfer b/w board and host NTRANSFER : Average number of data transfers per execution THOMP_NO : Computation time in hosts that is non-overlapping with computation in board

PAGE 65

65 THCF : Configuration file transfer time to host TEXE : Host execution file transfer time to host THDATA : IO data transfer time to host In general, a portion of the task execut es on the GPP machine which hosts the FPGA resource and controls the FPGA portion of the task. We call applications with such tasks dualparadigm applications, one paradigm being co mputing with the GPP and the other the FPGAbased RC. The overall execution time of a dualparadigm applications task can be broadly divided into computation time and communica tion time corresponding to the host machine and the RC board. Communication time corresponding to the host incl udes the time to move the host executable, configuration file for the FGPA and I/ O data to the host from their respective source (remote) locations. Computation time correspondi ng to the host involves modeling the host and has been extensively researched by many conve ntional computing researchers. For our simulations, we are interested only in the RC part of the tasks. Hence we ignore the communication part of the host by assuming that the executables and I/O data are available locally in the host and the computation part of the host by assuming that the host part of execution overlaps with the RC (hence hidden). The communication time corresponding to th e RC board includes the time to move the configuration file and I/O da ta from the host to the RC board. The computation time corresponding to the RC board includes the time to configure the FPGA and the actual execution time of the task. The communication time (cor responding to both the host and the RC board) involved in the overall execution time is critical only when the data transfer time is very large. Such cases are common in computational gr ids or other such distributed computing environments over long-haul networks. Our assumption to neglect the communication time

PAGE 66

66 corresponding to the host is made because we are only interested in systems that are not distributed over wide-area networks in this research. 3.3.3 Scheduling Heuristics In general, scheduling decisions are affected by the relations between the tasks scheduled, heterogeneity of the resources on which the task s are scheduled, and the performance model that is used to predict the performance of the task s on the available resources. The quality of scheduler is greatly impacted by the performan ce model. We discussed the performance model developed for our simulations previously. Given a set of independent tasks and a set of available resources, our goal is to minimize the overall execu tion time of the entire task set. Based on the model, we can predict the time it takes for each task to execute on each resource. Assuming there are N tasks to be scheduled and there are M machines available, an N M matrix can be created with each element in the matrix re presenting the performance of a task on the corresponding machine in terms of a certain metric. The metric could be anything required by the user such as execution time, memory used, etc. However, it is not always possible to assign the best machine to each task due to resource c ontention. So we use heuristics to identify the machine to which each task should be assigned. In this dissertation, we compare the performa nce of six such heuristics for scheduling of tasks [42] on parallel RC systems. The heuristi cs compared in this work include opportunistic load balancing (OLB), minimum execution time (MET), minimum completion time (MCT), switching algorithm (SA), MIN-MIN, and MAX-MI N. These heuristics have been studied widely before for general-purpose systems but not RC systems. The following is a brief description of each heuristic. MCT: Assign each task to the machine with minimum completion time (machine available time + estimated time of computation).

PAGE 67

67 MET: Only consider the expected execution time of each task on the machines and select the machine that provides the minimum execution time for the task. SA: Switch between MCT and MET based on load imbalance (if load imbalance increases above a threshold, revert back to MCT from MET). The ratio of completion time when scheduled using MCT to that when scheduled using MET is used to switch between the two heuristics. OLB: Assign each task to the next available machine without considering the expected execution time on that machine. The perfor mance of the machine for the task being scheduled is not considered. MIN-MIN: First, select a best MCT machine for each task. Second, from all tasks, schedule the one with minimum completion time (send a task to the machine which is available earliest and executes the task fastest). MAX-MIN: Same first step as MIN-MIN but schedule the task with maximum completion time (tasks with long completion time are schedul ed first on the best available machines and executed in parallel with other tasks, hence better load-balancing). The first four heuristics (i.e., MCT, MET, SA, and OLB) are categorized as inline scheduling heuristics because they are used for sc heduling individual tasks as they arrive. The last two heuristics (i.e., MINMIN and MAX-MIN) are categorized as batch scheduling as they are used to schedule multiple tasks simultaneously. The batch heuristics are commonly used to schedule massively parallel applica tions with many parallel tasks. 3.4 Simulative Analysis The characteristics of the applications used in our simulations are given in Table 3-1 based on the performance evaluation performed by Vetter et al.[35]. The tests in [35] were run on an IBM SP system, located at Lawrence Livermor e National Laboratory and composed of 68 IBM RS/6000 NightHawk-2 16-way SMP nodes using 375 MHz IBM 64-bit POWER3-II CPUs [43]. The POWER3 provides unit-stride hardware prefetching and he nce applications with higher memory locality benefit more in this architecture The computational intensity gives the ratio of operations to number of memory accesses (loads and stores).

PAGE 68

68 Based on our study of the characteristics of the RC systems and the speedups that they provide for applications in gene ral, we have summarized the characteristics of the RC systems used in our simulations in Table 3-2. It should be noted that the base speedup given in the table is relative and not absolute speedup. For exam ple, if a task speedup is 1.3 times on the RC1000 compared to a GPP machine, then the speedup for the same task is 1.8 times on the Tarari. The speedup numbers are merely representative a nd may vary widely depending on the specific algorithm but these values represent a form of general behavior for each system. Table 3-1. Characteristic s of applications Application Execution Time (sec) Computational Intensity Memory Locality (Spatial) sPPM 19.0 1.45 VERY HIGH SMG2000 4.0 0.08 LOW SPHOT 14.0 1.70 HIGH Sweep3D 3.5 0.75 MEDIUM IRS 135.0 0.45 VERY HIGH UMT 350.0 1.50 HIGH Aztec 14.0 0.45 VERY HIGH Table 3-2. Characteristics of RC systems Board/System Estimated Base Speedup Configfile Size (KB) Bus Speed RC1000 1.3 1200 PCI(64,66) TARARI 1.8 500 PCI(64,66) NALLATECH 2.5 2600 PCI(64,33) SGI 2.8 2733 NumaLink CRAY 3.0 2377 RapidArray We know the performance of the tasks on a G PP from Table 3-1. For the performance model that we use for task scheduling in our simulations, we require the execution time (predicted) of the tasks on the RC systems. We calculate this execution time as ty nalIntensi Computatio p BaseSpeedu lity M emoryLoca ionTime B aseExecu t nTime RCExecutio where BaseExecutionTime, MemoryLocality and ComputationalIntensity are provided in Table 3-1 while BaseSpeedup is

PAGE 69

69 provided in Table 3-2. We assume that the speedup of a task is higher in an RC system when the computational intensity is higher. If the memory locality of a task is higher, it implies that POWER3s unit stride hardware prefetching has bene fited the task. The task will not be able to take advantage of this facility in RC system and hence the speedup will decrease. The numerical values used for MemoryLocality in the calculation of RCExecutionTime are as follows: 0.99 for VERY HIGH, 0.94 for HIGH, 0.87 for MEDIUM, and 0.80 for LOW. 3.4.1 Simulation Setup Scheduling is mainly influenced by the execution time of the tasks on the specific RC resource. Hence the other parameters in our performance model are kept constant. Configuration file management is a research area in itself that trie s to identify the best ways to store a configuration file location of storage, replication of heavily accessed files, etc. to optimally retrieve files and configure the FPGAs [44]. Analyzing these and many such tradeoffs in detail would be good research fo r the future. In this research we restrict our scope to the analysis of the scheduling heuris tics alone. We assume that the configuration files specific to any application task are available on the mach ine hosting the RC resource. Only the time to move the configuration file from the host machine to the FPGA and configure the FPGA is considered in our simulations. The following describes the four differe nt simulation setups investigated: Homogeneous tasks on homogeneous machines: All the machines in the cluster are the same. The tasks that arrive for scheduling are the same. Heterogeneous tasks on homogeneous machines: All the machines in the cluster are the same. The tasks that arrive are a mix of the 7 tasks listed earlier. To maintain complete heterogeneity in the simulation the tasks are chosen in a round-robin fashion. Homogeneous tasks on heterogeneous machines: The cluster is composed of a mix of the RC resources listed. In a heterogeneous setup with 12 machines in the system, there are 3 RC1000s and Tararis each, and 2 Nallatechs, SGIs and Crays each. The tasks that arrive to be scheduled are the same.

PAGE 70

70 Heterogeneous tasks on heterogeneous machines: The cluster is composed of a mix of the RC resources as described in the previous setup. The tasks that arrive are also a mix of the seven tasks listed above. 3.4.2 Simulation Results on Large-Scale Systems In this section, we study th e simulation results on large-sc ale systems typically found in HPC systems on ground. The task arrival to the sc heduler is modeled as a Poisson distribution. The total time for which the task s arrive is fixed at 5000 seconds. The mean value of the Poisson distribution is fixed at 5 sec, 10 sec, and 20 sec to simulate the arrival of 1000 tasks, 500 tasks and 250 tasks, respectively. Th e number of machines in the sy stem is fixed at 12 for all simulations. The average makespan (i.e., time take n to complete all the tasks that arrive) and average sharing penalty (i.e., difference between the completion time for a task in the present simulation setup and when the task is executed al one without any other ta sk in the system) are measured. However, only makespan is shown in th is dissertation. The re sults are an average of five trials. Figure 3-2 shows the performance of inline scheduling heuristi cs when homogeneous tasks (all SPPM in this case) are scheduled on a homog eneous system. There is no difference in the makespan of the tasks for MCT and OLB as long as the tasks and machines are homogeneous. Since the tasks execute very fast, the makespan is driven only by the arrival rate of the tasks. Since there are 12 machines in the system, it will take at least 60 seconds on average, with the fastest arrival rate (mean of 5 sec), for a machin e to be scheduled with the next task. But the SPPM task does not execute for 60 sec on any of the machines. Hence the difference in speedup between the machines does not affect the makesp an of the tasks across the machines for MCT and OLB. However, for the MET heuristic, all th e tasks can potentially be scheduled on the first machine since MET schedules a task on a machine that provides the least execution time for the task. With a homogeneous system, each machine provides the same execution time and hence

PAGE 71

71 the first machine is always picked, hence the large performance difference between MET, and MCT and OLB. As the machines speed increases, the difference between MET, and MCT and OLB decreases. Homogeneous Tasks (SPPM) on Homogeneous Machines (RC1000) 0 2000 4000 6000 8000 10000 12000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (SPPM) on Homogeneous Machines (Tarari) 0 2000 4000 6000 8000 10000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (SPPM) on Homogeneous Machines (Nallatech) 0 1000 2000 3000 4000 5000 6000 7000 8000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks ( SPPM) on Homogeneous Machines (SGI) 0 1000 2000 3000 4000 5000 6000 7000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (SPPM) on Homogeneous Machines (Cray) 0 1000 2000 3000 4000 5000 6000 7000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-2. Scheduling of SPPM on various RC systems Figure 3-3 shows the performance of inline scheduling heuristics for the same setup as Figure 3-2 but for UMT, which has a longer exec ution time compared to SPPM. Both IRS and UMT have long execution times but we have shown only the results for UMT here due to space restrictions. It can be seen from Figure 3-3 that for tasks with longer execution times, the average makespan is driven by the actual executio n times and not by the arrival rate of tasks.

PAGE 72

72 Makespan increases with the number of tasks and d ecreases with faster mach ines in the system. Again, as the tasks and machines are homogeneous, MCT and OLB perform the same. MET performs poorly compared to MCT and OLB becaus e, with MET, there is a possibility of only a few machines that perform well being loaded with tasks. With MCT and OLB, the tasks are well distributed among the machines in general. Homogeneous Tasks (UMT) on Homogeneous Machines (RC1000)0 4000 8000 12000 16000 20000 2505001000Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (UMT) on Homogeneous Machines (Tarari) 0 5000 10000 15000 20000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (UMT) on Homogeneous Machines (Nallatech) 0 4000 8000 12000 16000 20000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (UMT) on Homogeneous Machines (SGI) 0 4000 8000 12000 16000 20000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (UMT) on Homogeneous Machines (Cray) 0 4000 8000 12000 16000 20000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-3. Scheduling of UMT on various RC systems Figure 3-4 shows the performan ce of inline scheduling heuris tics when heterogeneous tasks are scheduled on a system with homoge neous machines. Since the machines are homogeneous, MCT and OLB perform the same. The makespan for MCT and OLB does not

PAGE 73

73 differ between machines for the same reason pr eviously stated (i.e., not enough load on the machines). Heterogeneous Tasks on Homogeneous Machines (RC1000) 0 5000 10000 15000 20000 25000 30000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Heterogeneous Tasks on Homogeneous Machines (Tarari) 0 5000 10000 15000 20000 25000 30000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Heterogeneous Tasks on Homogeneous Machines (Nallatech) 0 5000 10000 15000 20000 25000 30000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Heterogeneous Tasks on Homogeneous Machines (SGI) 0 5000 10000 15000 20000 25000 30000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Heterogeneous Tasks on Homogeneous Machines (Cray) 0 5000 10000 15000 20000 25000 30000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-4. Scheduling of heterogene ous tasks on various RC systems Figure 3-5 shows the performance of inline scheduling heuristi cs when homogeneous tasks are scheduled on a system with heterogeneous m achines. There is no difference in performance between MCT and OLB for tasks that do not have long execution times. However, when the machines are highly loaded, as in the case of 1000 tasks of UMT and IRS, MCT performs slightly better than OLB. Thus we can say th at for the difference in performance between MCT and OLB to be realized, the machine workload must be high. In our opinion, such heavy loading

PAGE 74

74 of machines is difficult to real ize in reality as it would requ ire a large number of long-running tasks. Homogeneous Tasks (SPPM) on Heterogeneous Machines 0 2000 4000 6000 8000 10000 12000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (SMG) on Heterogeneous Machines 0 5000 10000 15000 20000 25000 30000 35000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (SPHOT) on Heterogeneous Machines 0 1000 2000 3000 4000 5000 6000 7000 8000 2505001000Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (SWEEP3D) on Heterogeneous Machines4960 4970 4980 4990 5000 5010 5020 5030 5040 5050 5060 2505001000Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (IRS) on Heterogeneous Machines 0 5000 10000 15000 20000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (UMT) on Heterogeneous Machines 0 5000 10000 15000 20000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-5. Scheduling of vari ous homogeneous tasks on a system with heterogeneous RC machines Figure 3-6 shows the performan ce of inline scheduling heuris tics when heterogeneous tasks are scheduled on a system with heterogeneous machines. There is no difference in performance between MCT and OLB. In orde r to find a performance difference between MCT and OLB, we also ran simulations for a system with 6 machines. Even for such a setup (results not shown in figure), the system load was not high enough to widely differentiate the

PAGE 75

75 performance between MCT and OLB (e.g. the makespan for 1000 tasks was 7186 sec for MCT and 7301 sec for OLB). Heterogeneous Tasks on Heterogeneous Machines 0 5000 10000 15000 20000 25000 30000 2505001000 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-6. Scheduling of heter ogeneous tasks on a system with heterogeneous RC machines Heterogenous Tasks on Homogeneous Machines (RC1000)0 2000 4000 6000 8000 10000 12000 2505001000Number of TasksAverage Makespan (seconds) MAX-MIN MIN-MIN Heterogenous Tasks on Homogeneous Machines (Tarari) 0 2000 4000 6000 8000 10000 12000 2505001000 Number of TasksAverage Makespan (seconds) MAX-MIN MIN-MIN Heterogenous Tasks on Homogeneous Machines (Nallatech) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2505001000 Number of TasksAverage Makespan (seconds) MAX-MIN MIN-MIN Heterogenous Tasks on Homogeneous Machines (SRC) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2505001000 Number of TasksAverage Makespan (seconds) MAX-MIN MIN-MIN Heterogenous Ta sks on Homogeneous Machines (Cray) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 2505001000 Number of TasksAverage Makespan (seconds) MAX-MIN MIN-MIN Figure 3-7. Batch scheduling of various hetero geneous tasks on a system with homogeneous RC machines

PAGE 76

76 The results shown until now were for inline scheduling heuristics. Figure 3-7 shows the performance of batch scheduling heuristics for heterogeneous tasks on a system with homogeneous machines. The results for setups when homogeneous tasks are scheduled are not shown because MAX-MIN and MIN-MIN heuristics perform the same as long as the tasks are homogeneous. The results show that MAX-MIN pe rforms better than MIN-MIN in terms of the average makespan. MAX-MIN also improves with the number of tasks being scheduled. For heterogeneous tasks on a system with he terogeneous RC machines (shown in Figure 38), both batch scheduling heuristic s show the same behavior seen in Figure 3-7. MAX-MIN performs better than MIN-MIN in all trials. Heterogeneous Tasks on Heterogeneous Machines0 2000 4000 6000 8000 10000 2505001000Number of TasksAverage Makespan (seconds) MAX-MIN MIN-MIN Figure 3-8. Batch scheduling of heterogeneous tasks on a system with heterogeneous RC machines 3.4.3 Simulation Results on Small-scale Systems In the previous section, we studied the pe rformance of the heuristics on relatively large systems executing in the order of several hundred tasks. However, the space systems might be smaller in terms of the size of the system and th e number of tasks executing on them as well. Hence, performance of the heuristics is briefly st udied for small-scale systems in this section to compare with the results presented for largescale systems in the previous section. The task arrival to the scheduler is again modeled as a Poisson distribution. The total time for which the tasks arrive is fixed at 200 seconds. The mean value of the Poisson distribution is

PAGE 77

77 fixed at 5 sec, 10 sec, and 20 sec to simulate the arrival of 40 tasks, 20 tasks and 10 tasks, respectively. The number of machines in the syst em is fixed at 3 for all simulations including a RC1000, a Tarari and a SRC. The average makespan (i.e., time taken to complete all the tasks that arrive) is measured for the same simulation setups described in the previous section namely homogeneous tasks on homogeneous machines, he terogeneous tasks on homogeneous machines, homogeneous tasks on heterogeneous machines and heterogeneous tasks on heterogeneous machines. Figure 3-9 shows the performance of inline scheduling heuristi cs when homogeneous tasks are scheduled on a homogeneous system. The perf ormance of the heuristics in the small-scale systems was similar to that in large-scale systems and hence only two cases namely SPPM on RC1000 and UMT on SGI are shown here. There is no difference in the makespan of the tasks for MCT and OLB as long as the tasks and mach ines are homogeneous. Since the tasks execute very fast, the makespan is driven only by the ar rival rate of the tasks for the case of SPPM. However, for UMT with longer execution times, makespan increases with the number of tasks and decreases with faster machines in the system. Homogeneous Tasks (SPPM) on Homogeneous Machines (RC1000) 0 100 200 300 400 500 102040 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (UMT) on Homogeneous Machines (SGI) 0 500 1000 1500 2000 2500 3000 3500 102040 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-9. Scheduling of homogene ous tasks on homogeneous systems Figure 3-10 shows the performance of inline scheduling heuristics when heterogeneous tasks are scheduled on a system with homoge neous machines. Since the machines are homogeneous, MCT and OLB perform the same. The makespan for MCT and OLB does not

PAGE 78

78 differ much between machines for the same reas on previously stated (i.e., not enough load on the machines). Heterogeneous Tasks on Homogeneous Machines (RC1000) 1 10 100 1000 10000 102040 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Heterogeneous Tasks on Homogeneous Machines (SGI) 1 10 100 1000 10000 102040 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-10. Scheduling of heterogene ous tasks on various RC systems Figure 3-11 shows the performance of inline scheduling heuristics when homogeneous tasks namely SMG and IRS are scheduled on a system with heterogeneous machines. The trend is similar to that in large-scale systems. There is no difference in performance between MCT and OLB since the machines are not overloaded with tasks. Homogeneous Tasks (SMG) on Heterogeneous Machines 0 100 200 300 400 500 600 700 102040 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Homogeneous Tasks (IRS) on Heterogeneous Machines 0 500 1000 1500 2000 2500 102040 Number of TasksAverage Makespan (seconds) MCT MET OLB SA Figure 3-11. Scheduling of homogeneous tasks on heterogeneous systems The performance of the heuris tics was also studied in a heterogeneous system when heterogeneous tasks were scheduled and the trend was similar to that seen in large-scale systems and hence those results are not shown here. In summary, the heuristics perform similarly in both large-scale and small-scale systems.

PAGE 79

79 3.5 Conclusions and Future Research Recently, systems augmented with FPGAs offe ring a fusion of trad itional parallel and distributed machines with customizable and dyna mically reconfigurable hardware have emerged as a cost-effective alte rnative to traditional systems. However, providing a robust runtime environment for such systems to which HPC users have become accustomed has been fraught with numerous challenges. Dynamic scheduli ng of HPC applications in such parallel reconfigurable computing (RC) environments is one such challenge and has not been sufficiently studied to our knowledge. In this phase of research, we to improve the overall execution time of applications by analyzing methodologies for effective task sche duling. Reducing the ov erall execution time of applications exposes the applicatio ns to lesser number of faults in the system. We analyzed the performance of several scheduling heuristics that can be used to schedu le application tasks on HPC systems. Typical HPC applications and FP GA/RC resources were used as representative cases for our simulations. A performance mode l was also developed to predict the overall execution time of tasks on RC resources. The model was used by our scheduling heuristics to schedule the tasks. The performance model and the extensive simulative analysis of scheduling heuristics for FPGA-based RC applications and pl atforms is the first of its kind and is the primary research contribution of this phase. Among the inline scheduling heuristics, MCT and OLB have similar performance unless the system is nearly fully utilized. When the system is very heavily loaded, MCT outperforms OLB, however such high loads may not be real istic. In other cases, the performance of OLB equals MCT. MET performs poorly in all tr ials. SA which switches between MCT and MET has a performance better than MET but not as good as MCT and OLB. Among the batch scheduling heuristics, MAX-MIN outperforms MI N-MIN in terms of average makespan. Even

PAGE 80

80 with a small 12-machine system, the load on the system was poor. With larger systems, the impact of poor load will further diminish th e performance difference between the scheduling heuristics. In a smaller system with 3 machin es, typical of what woul d be available in space systems, the performance trend was similar to that of the 12-machine system. Also, the space systems are not going to be continuously heav ily loaded with tasks and hence a simple scheduling heuristic such as the OLB might suffice for scheduling tasks. The intent of this phase of research is to analyze several scheduli ng heuristics and compare their performance to identify th e suitable candidates for use in space-based HPC systems. We have shown the performance results for averag e makespan. Based on the performance metric that is critical for a given e nvironment, the corresponding scheduli ng heuristics can be used to schedule tasks. In future, the knowledge gained so far to suggest suitable scheduling heuristics for an experimental scheduler can be applied in the job management service intended for the space-based HPC system being developed at the HCS Research Laboratory at University of Florida. Another interesting futu re research would be to use typi cal parallel RC applications to study the validity of the simulative analysis by comparing with the effectiveness of task scheduling (in terms of makespan) of the job management service.

PAGE 81

81 CHAPTER 4 A FAULT-TOLERANT MESSAGE PA SSING INTERFACE (PHASE III) Fault tolerance is a critical factor for HP C systems in space to meet the emerging highavailability and reliability require ments. Recovery from failure n eeds to be faster and automatic while the impact of failures on th e system as a whole should be mi nimal. In the previous phases of this dissertation, we addresse d the issue of minimizing the imp act of failures through indirect approaches, mechanisms that do not address di rect recovery from faults. The indirect approaches certainly avoid comput ation loss but in order to enable applications to meet highavailability and high-reliability requirements we need to consider other options. Some of the options include: incorporating fault-tolerant feat ures directly into the applications, developing specialized hardware that is fault-tolerant, making use of and enhancing the fault-tolerant features of the operating system, and developing application-independent middleware that would provide fault-tolerant capabil ities. Among these options, deve loping application-independent middleware has the minimal intrusion in the sy stem and can support any general application including legacy applications th at fall into the umbrella of the corresponding middleware model. In this phase of the dissertation, we inves tigate, design, develop and evaluate a faulttolerant, application-independe nt middleware for embedded cluster computing known as FEMPI (Fault-tolerant Embedded Message Passing Interface). We also present performance results with FEMPI on a traditional PC-based cluster and on a COTS-based, embedded cluster system prototype for satellite payload pr ocessing in development at Hone ywell Inc. and the University of Florida for the Space Technology 8 (ST-8) mission of NASAs New Millennium Program. We take a direct approach to provide fault-tolerance and improve the availability of the HPC system in space. FEMPI is a lightweight faulttolerant variant of the Message Passing Interface (MPI) standard.

PAGE 82

82 4.1 Introduction Because of its widespread usage, MPI [45] has emerged as the de-facto standard for development and execution of high-performance pa rallel applications. By its nature as a communication library facilitating user-level communication among a group of processes, the MPI library needs to maintain global awareness of the processes that collectively constitute a parallel application. An MPI library conseque ntly emerges as a logical and suitable place to incorporate selected fault-tolerant features in order to enable legacy and new applications to meet the emerging high-availability and reliability requ irements of HPC systems in space. However, fault tolerance is absent in bot h the MPI-1 and MPI-2 standards. To the best of our knowledge, no satisfactory products or resear ch results offer an effectiv e path to providing scalable computing applications for cluster systems and specifically resource-constrained embedded systems such as in space with effective fault-tolerance. However, there have been a few efforts to develop lightweight MPI [ 46] and more specifically for embedded systems [47, 48] but without fault-tolerance. In this dissertation, we present the de sign and analyze the characteristics of FEMPI, a new lightweight, fault-tolerant message passing middleware for clusters of embedded systems. The scope of this paper is focused upon a presentation of the design of FEMPI and performance results with it on a CO TS-based, embedded cluster system prototype for space. Considering the small scale of the embedded cluster, we also provide results from experiments on an Intel Xeon cluster to show scalability of FEMPI. The experiments on the Xeon cluster highlight the compatibility of FEMP I across platforms and also permit performance comparisons with conventional MPI middleware. Finally, we also study the performance of FEMPI in comparison to conventional MPI va riants with a real application. Performance and fault-tolerance are in gene ral competing goals in HPC. Achieving a sufficient level of fault-tolerance for a large set of practical environments with minimal or no

PAGE 83

83 impact on performance is a significant challenge. Providing fault-tolerant services in software inevitably adds extra processing overhead and increas es system resource usage. In this phase of the research, we focus on developing a messa ge-passing middleware architecture that can successfully address both fault tolerance and perf ormance and if conflicting then fault tolerance is given priority. We address fault-tolerance i ssues in both the MPI-1 an d MPI-2 standards, with attention to key application cl asses, fault-free overhead, and recovery strategies. When developed, the architecture would provide ke y new capabilities to parallel programs, programmers, and cluster systems, includi ng the enhancement of existing commercial applications based on MPI. The optimizations ta rgeted at the popular recovery mechanisms, for key classes of applications, can be applied to any middlew are and hence would result in improving the performance of applications in general. The rest of this chapter is organized as follows. Section 4.2 provides background on several existing fault-tolerant MPI implementations for traditional general-purpose HPC systems. Section 4.3 discusses the architect ure and design of FEMPI. Failu re-free performance results are presented in Section 4.4 while failure recovery perf ormance is analyzed in Section 4.5. Performance of FEMPI when used in a real appli cation is discussed in Section 4.6. Section 4.7 concludes the paper and summarizes insight and directions for future research. The final section provides the summary and research scope. 4.2 Background and Related Research In this section, we provide an overview of MPI and its inherent limits relative to fault tolerance. Also included is a brief survey and summary of existing t ools with features for bringing fault tolerance to MPI, albeit primarily targeting conv entional, resource-plentiful HPC systems and their applications instead of em bedded, mission-critical systems that are the emphasis of our work.

PAGE 84

84 4.2.1 Limitations of MPI Standard The MPI forum released the first MPI standard MPI-1, in 1995, with drafts provided over the previous 18 months. The main goals of the standard in this releas e were high performance and portability. Achieving reliability typically includes utilization of additional resources and methodologies. This additional utilization confli cts with the main goal of high performance and supports the MPI Forums decision for limited reliability measures. Emphasis on high performance thus led to a static process model with limited error handling. The success of an MPI application is guaranteed only when all cons tituent processes finish successfully. Failure or crash of one or more processes leads to a default application termination procedure which is in fact required standard behavior. Subseque ntly, current designs/implementations of MPI suffer inadequacies in various asp ects to providing reliability. Fault Model: MPI assumes a reliable communication la yer. The standard does not provide methods to deal with node failures, process failur es, and lost messages. MPI also limits faults recognized in the system to inco rrect parameters in function ca lls and resource errors. This coverage of faults is incomplete and insuffi cient for high-scale parallel systems and missioncritical systems that are affected by transient faults such as SEUs. Fault Detection: Fault detection is not defined by MPI. The default mode of operation of MPI treats all errors as fatal and terminates the entire parallel job. MPI provides for limited fault notification in the form of return codes from the MPI functions. However, critical faults, such as process crashes, may preempt functions from retu rning these return codes to the caller. The ability to continue execu tion after the return of certain erro r codes is completely ambiguous, and left as implementation-specific pr operties and hence not portable. Fault Recovery: MPI provides users with functions to re gister error-handling callback functions. These callback functions are invok ed by the MPI implementation in the event of an error in MPI

PAGE 85

85 functions. Callback functions are registered on a per communi cator (communication context defined in MPI for groups of processes) basis and do not allow per-function based error handlers. Callback functions provide limited capability an d flexibility and cannot be invoked in case of process crashes and hangs. The MPI Forum released the MPI-2 standard in 1998, after a multi-year standards process. MPI-2 consists of extensions in the areas of process creation and management, one-sided communications, extended collective operations, a nd parallel I/O. A signi ficant contribution of MPI-2 is Dynamic Process Management (DPM), which allows user programs to create and terminate additional groups of processes on demand. DPM may be used to compensate for the loss of a process while MPI I/O can be used for check pointing the state of the applications. However, the lack of failure detection pr ecludes the potential for added reliability. 4.2.2 Fault-tolerant MPI Implementa tions for Traditional Clusters Several research efforts have been undertaken to make MPI more reliable for traditional HPC systems. This section introduces some of these efforts and analyzes their approaches in providing a reliable MPI middleware. CoCheck [49] from the University of Germ any, Munich, is a checkpointing environment for parallel applications and is one of the earliest efforts to make MPI more reliable. CoCheck was primarily targeted for process migration, load balancing, a nd stalling long-running applications for later resumption. CoCheck ex tends the single process checkpoint mechanisms to a distributed message-passing application. Unlike most checkpointing middleware, CoCheck sits on top of the message passing system a nd provides checkpointing transparent to the application. CoCheck incurs a large overhead by checkpointing entire pr ocess state and requires a centralized coordinator. Recovery of a dead process is achieved by a re covery function run at

PAGE 86

86 the user level. The status of inconsistent inte rnal data structures in message-passing middleware is not addressed. Thus, CoCheck provides coarse reliability measures for the MPI. MPICH-V [50] from University of South Paris, France, is an MPI environment that is based upon uncoordinated checkpointing/rollback and distributed message logging. The architecture as shown in Figure 4-1 relies on channel memories a nd checkpoint servers. It is assumed that on a failure, a node is no more r eachable and the computa tions by the failed node will have no impact on the eventual results. Channel memories are special nodes doing the job of a middleman through which all the communica tions pass through and hence are logged. The dispatcher is a coordinating node that schedul es tasks to computing nodes and coordinates resources as well. On a failure, the coordinator detects the failure and, with the help of the channel memories and checkpointing schedulers, th e task is rescheduled on fault-free nodes. Although MPICH-V suffers from single points of failu re it is suitable for Master/Worker types of MPI applications. MPICH-V2 MPICH-V1 Channel Memory Event Logger Coordinator Network Network Checkpoint Server Checkpoint Server Checkpoint Scheduler Dispatcher Computing Nodes Computing Nodes Figure 4-1. MPICH-V arch itecture (Courtesy: [50]) Starfish [51] from the Technion University Israel, is an environment for executing dynamic MPI-2 programs. Figure 4-2 illustrates the architecture of Starfish. Every node executes a starfish daemon and many such daemons form a process group using the Ensemble group communication toolkit [52].

PAGE 87

87 Figure 4-2. Starfish arch itecture (Courtesy: [51]) The daemons are responsible for interacti ng with clients, spawning MPI programs and tracking and recovering from failures along with group membership. Starfish uses an event model that requires the processes and components to register to listen on events. The event bus that provides a fast data path supplies messages to reflect cluster change s and process failures. Each application process includes a group comm unication handler module to interact with the daemon, an application module th at includes the user supplied MPI code, a checkpoint/restart module, an MPI module and a virtual network inte rface (VNI). The architecture allows for any checkpoint/restart protocol implementation. Li kewise, the architecture can be ported to any network by providing a thin layer inside the VNI pertaining to the pa rticular network. Egida [53] from the University of Texas, Au stin, is an extensib le toolkit to support transparent rollback recovery. Egida as shown in Figure 4-3 is built ar ound a library of objects that implement a set of functionalities that are th e core of all log-based rollback recovery. Any arbitrary rollback protocol can be specified and Egida can synthesize an implementation. Egida also allows for the coexistence of multiple im plementations. Egida has been ported to MPICH [54], a widely used and free implementation of MP I. Egida shares some of its drawbacks with

PAGE 88

88 CoCheck. Egida checkpoints the state of both processes and messages and may lead to large overheads in some cases. Figure 4-3. Egida archit ecture (Courtesy: [53]) Implicit FT-MPI [55] from the University of Cyprus takes an approach similar to CoCheck but is a more stripped-down version. Implicit FT -MPI targets only the mast er/slave. A separate observer process to which the master node sends a ll the messages is responsible for coordination of all the processes and also re cords the status of all the pro cesses. The observer is also responsible for creation of new pr ocesses on failure of slave nodes and, if the master node fails, the observer takes up the role of the master itself. Implicit FT-MPI is very simple in terms of features available and suffers from single point of failures.

PAGE 89

89 LAM/MPI [56], an implementation of MP I from Indiana University and Ohio Supercomputing Center also has some fault-tolerant features built into it. Special functions such as lamgrow and lamshrink are used for dynamic addition and deletion of hosts. A fail-stop model is assumed and, whenever a node fails, it is detected as dead and the resource manager removes the node from the host lists. All the surviving hosts are notif ied asynchronously and the MPI library invalidates all the communicators that include the dead node. Pending communication requests are marked as errors. Si nce attempts to use invalid communicators raise errors, applications can detect these erro rs and free the invalid communicators and new communicators can be creat ed if necessary. FT-MPI [57] from University of Tennessee, Kn oxville, attempts to provide fault-tolerance in MPI by extending the MPI process states and communicator states from the simple {valid, invalid} as specified by the standard to a larg er number of states. A communicator is an important scoping and addressing data structure defined in the MPI standard that defines a communication context, and a set of processes in the context. Th e range of communicator states specified by FT-MPI helps the application with the ability to decide how to alter the communicator, its state and the be havior of the communication be tween intermediate states on occurrence of a failure. In case of a failure, FT-MPI returns a handle back to the application. MPI communicator functions specified in the MPI2 standard are used to shrink, expand or rebuild communicators. FT-MPI provides for gr aceful degradation of applications but has no support for transparent recovery from faults. Th e design concept of FEMPI is also largely based upon FT-MPI, although there are signi ficant differences given that FEMPI is designed to target embedded, resource-limited, and mission-critical systems where faults are more commonplace, such as payload processing in space.

PAGE 90

90 4.3 Design of FEMPI As described in the previous section, severa l efforts have been made to develop faulttolerant MPI implementation. However, all the designs described target conventional large-scale HPC systems with sufficient system resources to cover the additional overhead for fault tolerance. More importantly, very few among t hose designs have been successfully implemented and are mature enough for practical use. Also the designs are quite heavyweight primarily because fault tolerance is based on extensive me ssage logging and checkpointing. Many of the designs are also based on centralized coordinato rs that can incur severe overhead and become a bottleneck in the system. An HPC system in sp ace requires a reliable an d lightweight design of fault-tolerant MPI. Among those that exist, FT-MPI from University of Tennessee [57] was viewed to be the one with the least overhead and is the most mature design in terms of realization. However, FT-MPI is built atop a me tacomputing system called Harness that can be too heavy for embedded systems to handle. For the design of FEMPI, we try to avoid de sign and performance pitf alls of existing HPC tools for MPI but leverage useful ideas from these tools. FEMPI is a lightweight design in that it does not depend upon any message logging, speciali zed checkpointing, cent ralized coordination or other large middleware systems. Our desi gn of FEMPI resembles FT-MPI. However, the recovery mechanism in FEMPI is different in that it is completely distributed and does not require any reconstruction of communicators. On the ot her hand, FT-MPI requires the reconstruction of communicators on a failure for which the system enters an election state and a voting-based leader selection is performed. Th e leader is responsible for distributing the new communicator to all th e nodes in the system.

PAGE 91

91 4.3.1 FEMPI Architecture Fault tolerance is provided th rough three stages incl uding detection of a fault, notification of the fault, and recovery from the fault. In order to reduce development time and to ensure reliability, FEMPI is built atop a commercial high-availability (HA) middleware called SelfReliant (SR) from GoAhead Inc. whose services are used to provide de tection and notification capabilities. The primary f unctions of the HA middleware ar e resource monitoring, fault detection, fault diagnosis, fault recovery, fault reporting, cluster config uration, event logging, and distributed messaging. SR is based on a small, reliable, cross-platform kernel that provides the foundation for all standard serv ices, and its extensions. The ke rnel also provides a portability layer limiting user dependencies on the unde rlying operating system and hardware. SR allows processes to heartb eat through certain fault handlers and hence has the potential to detect the failure of processes and nodes, en abled by the Availability and Cluster Management Services (AMS and CMS) in Fi gure 1. CMS manages the physic al nodes or instances of SR, while AMS manages the logical representation of these and other resources in the availability system model. The fault notification service for application processe s is developed as an extension to SR. Application he artbeats are managed locally with in each node by the service and only health state changes are reported externally by a lightweight watchdog process. In the system, the watchdog processes are managed in a hierarchical manner by a lead watchdog process executing on a controller node. State transition notifications from any node may be observed by agents executing on other nodes by s ubscribing to the appropriate multicast group within the reliable messaging servi ce. Also, SR is responsible for discovering, incorporating, and monitoring the nodes within the cluster along with their associated network interfaces. SR also guarantees reliable communica tion via network interface failove r capabilities and in-order delivery of messages between the nodes in the sy stem through its Distributed Messaging Service

PAGE 92

92 (DMS). It should be mentioned he re that a lightweight version of SR is used for our prototype with just enough services for cluster manage ment. Moreover, FEMPI only uses minimal services of SR such as failure detection a nd DMS. Hence, FEMPI with SR would be less stressful on the embedded cluster as opposed to FT-MPI with Harness which is intended for large clusters. AMS: Availability Management Service CMS: Cluster Management Service Figure 4-4 Architecture of FEMPI Figure 4-4 shows the architecture of FEMPI. Th e application is required to register with the Control Agent in each node, whic h in turn is responsible for upda ting the health status of the application in that node to a centr al Control Process. The Control Process is similar to a system manager, scheduling applications to various data processing nodes. In the ST-8 mission, the Control Process will execute on a radiation-hardened, single-board computer that is also responsible for interacting with the main controller for the en tire spacecraft. Although such system controllers are highly reliable components, they can be deployed in a redundant fashion for highly critical or long -term missions with cold or hot sparing. With regard to MPI applications, failures can be broadly classified as process failures (individual processes of an MPI application crash) and network failures (communication failure

PAGE 93

93 between two MPI processes). FEMPI ensures re liable communication (re ducing the changes of network failures) with all the low-level comm unication through DMS. A process failure in conventional MPI implementations causes the entire application to crash, whereas FEMPI avoids application-wide crashes when individual processe s fail. MPI Restore, a component of FEMPI, resides on the System Controller and communicates with the Contro l Process to update the status of nodes. On initialization by the application using the MPI_Init function call, the FEMPI runtime environment attached to the application process subscribes itself to a Health Status channel. The same procedure is performed for all the MPI processes corresponding to the application executing in all the nodes. It is through this channe l that updates about the status of nodes are received from MPI Restore. On a failu re, MPI Restore notifies all other MPI processes regarding the failure via DMS. The status of se nders and receivers (of messages) are checked in FEMPI before communication to avoid trying to es tablish communication with failed processes. If the communication partner (sender or receiv er) fails after the status check and before communication, then a timeout-based recovery is used to recover out of the MPI function call. With the initial design of FEMPI as descri bed in this dissertation, we focus only on selected baseline functions of the MPI standard. The first version of FEMPI includes 19 baseline functions shown in Table 4-1, of which four are setup, three are point-to-point messaging, five are collective communication, and seven are cust om data-definition calls. The function calls were selected based upon profiling of popular an d commonly used space science kernels such as Fast Fourier Transform, LU D ecomposition, etc. With the ab ility to develop these common application kernels, the baseline functions would be sufficient to support the desired applications for the ST-8 mission. Provision of just certain baseline functions (along with ability to develop the other functions defined in the MPI standard) in order to support the development of desired

PAGE 94

94 applications is common during the design of a new MPI implementation[46-48, 58]. In our initial version of FEMPI, we also focus only on blocking and sync hronous communications, which is the dominant mode in many MP I applications. Blocking and synchronous communications require the corres ponding calls (e.g. send and receiv e) to be posted at both the sender and receiver processes for either pr ocess to continue further execution. Table 4-1. Baseline MPI functions fo r FEMPI in this phase of research Function MPI Call Type Purpose Initialization MPI_Init Setup Prepares system fo r message-passing functions. Communication Rank MPI_Comm_rank Setup Provides a unique node number for each node. Communication Size MPI_Comm_size Setup Provides the number of nodes in the system. Finalize MPI_Finalize Setup Disable communication services. Send MPI_Send P2P Send data to a matching receive. Receive MPI_Recv P2P Receive data from a matching send. Send-Receive MPI_Sendrecv P2P Simultaneous send and receive between two nodes (i.e., send and receive using the same buffer). Barrier Synchronization MPI_Barrier CollectiveSynchronize al l the nodes together. Broadcast MPI_Bcast CollectiveRoot node sends same data to all other nodes. Gather MPI_Gather Collective Each node sends a separate block of data to root node to provide an all-to-one scheme Scatter MPI_Scatter Collective Root node sends a different set of data to each node providing a one-to-allscheme All-to-All MPI_Allgather Collective All nodes share their data with all other nodes in the system. Datatype MPI_Type_* Custom Seven functions to define custom data types useful for complicated calculations 4.3.2 Point-to-Point Messaging (Unicast Communication) The basic communication mechanism of MPI, re ferred to as point-to-point messaging is the transmittal of data between a pair of processes, one sending and the other receiving. A set of send and receive functions allow the communication of data of a specific datatype with an associated tag. The tag allows selectivity of messages at the receiving end: one can receive on a particular tag, or one can wild-card this quantity allowing reception of messages with any tag. Message selectivity on the source proce ss of the message is also provided.

PAGE 95

95 DMS, the messaging service of SR, operates by managing distributed virtual multicast groups with publish and subscribe mechanisms over primary and secondary networks. The publish and subscribe mechanisms are used by FEMPI for provisioning MPI point-to-point messaging. On initialization, each FEMPI (applica tion) process is registered to a data channel wherein messages can be published or received. E ach process registers with its unique identity (MPI process rank). The message to be transm itted from a process is published on the data channel with additional information to indicate the identity of the target receiver. The identity being unique to each process, DMS filters the message to be relayed only to the target process and is not broadcasted to the other processes. 4.3.3 Collective Communication A collective operation is executed by havi ng all processes in the group call the communication routine, with matching arguments Collective communications transmit data among all processes in a group specified by an intracommunicator object. One exception however is the MPI_Barrier function that serves to synchr onize processes without passing data. In FEMPI, the publish and subscribe mechanis m of DMS is used directly for collective routines such as broadcast or all-to-all co mmunication that involves the transmission of a message from a single node to all other nodes. However other routines th at involve more than one process transmitting messages are designed as variations of point-to-point messaging (sequence of multiple point-to-point operations). 4.3.4 Failure Recovery Presently, two different recovery modes have been developed in FEMPI, namely IGNORE and RECOVER. In the IGNORE mode of rec overy, when a FEMPI f unction is called to communicate with a failed process, th e communication is not performed and MPI_PROC_NULL (meaning the process location is empty) is returned to the application. Basically, the failed

PAGE 96

96 process is ignored and computati on proceeds without any effect of the failure. The application can either re-spawn the process through the control process or pr oceed with one less process. The MPI communicators are left unchanged. IGNORE mode is useful for applications that can execute with reduced number of processes while th e failed process is bei ng recovered, especially if the recovery procedure is of a long duration. With the RECOVER mode of recovery, the failed process has to be re-spawned back to its healthy state either on the same or a different processing node. When a FEMPI function is calle d to communicate with a failed process, the function call is halted until the process is successfully re-spawned. When the process is restored, the call is completed and the control is returned back to the application. Here again, the MPI communicators are left changed. RECOVER mode is useful for appl ications that cannot execute with any less number of nodes than when they started execution. In future, we plan to develop a third mode of recovery, namely REMOVE. With the REMOVE mode of recovery, when a process fails, it is removed from the process list. The MPI communicator is altered (shrunk) to reflect the ch ange that the system has one less process. REMOVE mode would be useful for cases when any process or processes have irrecoverably failed while running applications that cannot pro ceed with any failed process in the system. However, this mode would requ ire the coordination of all the MPI processes via a MPI control process to consistently update the communicator. 4.3.5 Covering All MPI Function Call Categories MPI function calls can be broadly classified into four different categories based on the locality of impact of a failure in the sy stem. The categories include point-to-point communication calls with communication between specific processes, collective communication calls with communication between a group of processes, process-specif ic calls that are local to a single process, and group-specific calls that are collec tive in nature but do not involve explicit

PAGE 97

97 message passing. The function calls developed for the initial version of FEMPI in this dissertation, also listed in Ta ble 4-1, have samples in all these categories (except for the nonblocking version of point-to-po int communication) indicating the ability for FEMPI to be extended to the other function call s specified by the MPI standard. For point-to-point communication calls, the impact of a process failure only extends to the partner (communication pa rtner) process. In FEMPI, the failure is handled by waiting for the partner process to recover or by returning an error to the calling routine. FEMPI calls that fall in this category include MPI_Send, MPI_Recv and MPI_Sendrecv. The locality of impact of a failure during collective communication calls depe nd on whether the failure was in a root or nonroot process. Several collective routines such as broadcast and gather ha ve a single originating or receiving process. Such processes are called the root. On failure of a non-root process, the impact is only on the root which is the solitary process communicating with non-roots. Hence all other processes can complete successfully. The r oot deals with the failure similar to point-topoint communication calls. On the other hand, if the root fails, all non-roots are impacted. The non-root processes wait for the root to recover or return an error to th e calling routing. FEMPI calls that fall in this category include MPI_Bcast, MPI_Gather, MPI_Scatter, MPI_Alltoall and MPI_Barrier. In the case of process-specific calls, failure s do not impact any other process and hence do not mandate any fault handling in the other non-fau lty processes. FEMPI calls in this category include MPI_Type_*, MPI_Comm_rank and MPI_Comm_size. However, recovery of the faulty process is still required. Gr oup-specific calls are similar to collective communication calls. Although there is no explic it communication of application me ssages, these calls involve the exchange of several control messages. The func tions in this category can be considered as a

PAGE 98

98 collective communication calls with all the pro cesses involved being roots. FEMPI function calls in this category include MPI_Init and MPI_Finalize. 4.4 Performance Analysis In this section, we discus s the performance of FEMPI based on experiments conducted on both a prototype system and a ground-based cluster system. Beyond providing results for scalability analysis, experiments on the groundbased cluster system demonstrate FEMPIs flexibility and the ease with which space scientists may develop their applications on an equivalent platform. The next section descri bes the setup for these systems, followed by a discussion of the performance of FEMPI on failure-free systems. We discuss the performance of FEMPI in systems with failures in Section 4-5. 4.4.1 Experimental Setup For the current research phase of the NASA s New Millennium Project, a prototype system has been designed to mirror when possible and emul ate when necessary the features of a typical satellite system. The system to be launched in 2009 has been developed at Honeywell Inc. and the University of Florida. The prototype hardwa re shown in Figure 4-5 consists of a collection of single-board computers, some augmented w ith FPGA coprocessors, a power supply and reset controller for performing power-off resets on a pe r-node basis, redundant Ethernet switches, and a development workstation acti ng as satellite controller. Six Orion Technologies COTS Single-Board Computers (SBCs) are used to mirror the specified data processor boards to be featured in the flight experi ment (four) and also to emulate the functionality of the radiation-hardened comp onents (two) currently under development. Each board is comprised of a 650MHz IBM 750fx PowerPC, 128MB of high-speed DDR SDRAM, dual Fast Ethernet interfaces, dual PCI mezzanine card slots, and 256MB of flash memory, and runs MontaVista Linux. Other operating systems may be used in future (e.g. one of several real-

PAGE 99

99 time variants), but Linux provides a rich set of development tools from which to leverage. Ethernet is the prevalent networ k for processing clusters due to its low cost and relatively high performance, and the packet switched technology provides distinct advantages for the system over bus-based approaches currently in use by space systems. Of the six boards, one is used as the primary system controller node. Another SBC is used to emulate bot h the backup controller (that takes over the control f unctions when the primary contro ller encounters a failure) and a central data store (used for stor ing input and output data of app lications along w ith application checkpoints). The data store is currently implem ented as a 40 GB PMC hard drive, while the flight system will likely include a radiation-hard ened solid-state storage device. The remaining four SBCs are used as data processing nodes whic h we will exercise to s how the performance of FEMPI. Data Processor 4 FPGA Spacecraft Command and Control Processor (Linux Workstation) Secondary Ethernet Switch Data Processor 3 Data Processor 2 Data Processor 1 Backup System Controller Primary System Controller Primary Ethernet Switch Power and Reset Controller Other Ethernet Switch cPCI Chassis FPGA Mass Data Store NIC NIC Figure 4-5. System configura tion of the prototype testbed In order to test the scalability of FEMPI and see how it will perform on future systems beyond the six-node testbed developed to emulat e the exact system to be flown in 2009, experiments were conducted on a 16-node cluster. The cluster consists of traditional server machines each with 2.4GHz Intel Xeon proces sors running Redhat Linux 9.0, 1GB of DDR RAM and connected via Gigabit Ethernet network. As previously describe d, the fact that the

PAGE 100

100 underlying PowerPC processors in the testbed and the Xeons in the ground-based cluster have vastly different architectures and in struction sets (e.g. they each us e different endian standards) is masked by the fact that the operating system and reliable messaging middleware provide abstract interfaces to the underlying hardware. These abst ract interfaces provide a means to ensure portability and these experiment s demonstrate that por ting applications de veloped using FEMPI from a ground-based cluster to the embedded sp ace system is as easy as recompiling on the different platform. 4.4.2 Results and Analysis In this section, we present the performance results of FEMPI on failure-free systems where we report the general performance of FEMPI. The re lative performance of FEMPI in comparison with conventional and commonly us ed MPI variants such as MPICH[54] and LAM/MPI[56] are also discussed in this section. 4.4.2.1 Point-to-point communication Figure 4-6 shows performance of FEMPIs point-to-point comm unication on the Xeon platform. The performance of the message pass ing middleware is report ed in terms of the application-level throughput. A specified amount of data is sent from one node (sender) to another (receiver) and time taken to finish the communication completely is measured. Throughput is measured as the ratio of the size of data transferred and the time taken to transfer the data. In order to avoid any transient errors, the results reported are averages of 100 trials. For one-sided communications using MPI_Send and MPI_Recv, the maximum throughput using FEMPI reaches about 590 Mbps on the Xeon cluster with 1 Gbps links. This maximum throughput value is comparable to the th roughput provided by the conventional MPI implementations (i.e., MPICH and LAM/MPI). It can be seen that the throughput with MPICH saturates at approximately 430 Mbps while that for LAM/MPI saturates at approximately 700

PAGE 101

101 Mbps. However, both MPICH and LAM/MPI perfor m better for smaller data sizes because of data buffering. Presently, FEMPI does not impl ement buffering of data. It is true that LAM/MPI performs better than FEMPI for all data sizes; however, it must be noted that we get the additional benefit of fault tolerance with FEMPI. 0 100 200 300 400 500 600 700 8000.0020.030.58128204832768Payload (Data size in KB)Throughput (Mbps) Xeon, Point-to-point, LAM/MPI Xeon, Point-to-point, MPICH Xeon, Point-to-point, FEMPI Figure 4-6. Performance of point-to-point communication on a traditional cluster Figure 4-7 shows the performance of FEMPIs point-to-point comm unication calls, both one-sided and two-sided, on the pr ototype testbed. Due to limited resource availability in the embedded system testbed it was not possible to us e the conventional MPI va riants that require more resources. Hence, we do not show the perf ormance of these variants on the testbed. On the embedded cluster with 100 Mbps links and slower processors compared to the Xeon cluster, the maximum throughput is approximately 31 Mbps, which is again comparable to what can be achieved using popular MPI implementations on Fast Ethernet networks. For the MPI_Sendrecv function call (two-sided communication), the thr oughput is slightly lower than the one-sided communication using MPI_Send and MPI_Recv. In a system with two nodes, data is exchanged between the two processing nodes simultaneously in a sendrecv call leading to possible network congestion and hence lowering the throughput. Fo r systems with more than two nodes, the

PAGE 102

102 throughput can increase if the send and receive (that are part of a node) are not with a same node thereby enabling simultaneous communication with different nodes (i.e., data can be sent to one node while receiving data from a different node). 0 10 20 30 40 50 600.0020.030.58128204832768Payload (Data size in KB)Throughput (Mbps) Embedded, Point-to-point (Send, Recv), FEMPI Embedded, Point-to-point (Sendrecv), FEMPI Figure 4-7. Performance of point-to-poi nt communication on prototype testbed 4.4.2.2 Collective communication Collective communication involves the transfer of messages between multiple processes in the system. Figure 4-8 shows performance of FEMPIs broadcast communication on the Xeon cluster in comparison to that of the conventional MPI variants on three nodes. The throughput trend for the broadcast communication is similar to that for point-to-point communication. The throughput for FEMPI saturates at approximate ly 590 Mbps, while that for LAM/MPI at 710 Mbps and 490 Mbps for MPICH. For larger data sizes, FEMPI perfor ms better than MPICH although it is poorer than LAM/MPI. Since broa dcast is a collective communication call that generally involves more than two nodes, we st udied the performance in systems with up to 16 nodes. However, there was not much variation in the throughput results. This behavior is attributed to that fact that the broadcast call is primarily dominated by the root node and the nonroots do not have a major impact on the perfor mance. Similar to the Xeon cluster, the

PAGE 103

103 performance of FEMPIs broadcast communicati on on the prototype testbed (not shown here) resembled the trend of point-to-point communication. 0 100 200 300 400 500 600 7000.0020.030.58128204832768Payload (Data size in KB)Throughput (Mbps) Xeon, Broadcast, LAM/MPI Xeon, Broadcast, MPICH Xeon, Broadcast, FEMPI Figure 4-8. Performance of broadcast communication on a traditional cluster Figure 4-9 gives the timing for synchroni zation of nodes with FEMPI, MPICH and LAM/MPI using the barrier synchronization function MPI_Barrier defined by the MPI standard. It can be seen from the figure that in most cas es, the synchronization ti me in FEMPI is lower than that in LAM/MPI and MPICH on the Xeon cluster. The synchronization time is on the order of tens of milliseconds for all the three MPI variants in smaller systems. However, as the system size goes beyond 8 nodes, the synchroniza tion time for MPICH increases beyond 100 ms. This abrupt increase can be attributed to the desi gn of barrier using tree structures in MPICH. Such an increase is also seen for LAM/MPI when system size is increased beyond 4 nodes. However, the synchronization time for LAM/MPI is almost independent of system size beyond this point. For both MPICH and FEMPI, the sync hronization time increases with system size but the rate is lesser for FEMPI. Synchronization in FEMPI does not depend on any tree structure rather it is distributed and only depends on e xplicit synchronization messages broadcast using DMS.

PAGE 104

104 0 25 50 75 100 125 150246810121416Number of nodesBarrier synchronization time (msec) Xeon, MPICH Xeon, LAM/MPI Xeon, FEMPI Figure 4-9. Performance of barrier s ynchronization on a tr aditional cluster Table 4-2 gives the timing for synchroniza tion of nodes with FEMPI on the prototype testbed. As expected, the synchronization time is higher on the embedded system, which can be attributed to the slow pro cessing power and network. Table 4-2. Barrier synchronization using FEMPI on prototype testbed No. of nodes Synchronization Time (ms) 2 23.5 3 30.0 4 36.5 Figures 4-10 and 4-11 show the performan ce of other collective communication calls including MPI_Gather and MPI_Scatter on the Xeon cluster and on the prototype testbed respectively. We only repor t the performance of FEMPI here and do not compare the performance with LAM/MPI and MPICH. As men tioned earlier, both these MPI variants have data buffering capabilities which are presentl y absent in FEMPI (provisions for buffering increases the complexity of provi ding fault tolerance). Ability to buffer data boosts performance of these communication calls, as transactions wi th a node can be processed while simultaneously sending/receiving data to/from another node. Hence, it would be unfair to compare the performance of FEMPIs gather and scatter calls with that of LAM/MPI and MPICH.

PAGE 105

105 0 100 200 3000.0020.030.58128204832768Payload (Data size in KB)Throughput (Mbps) Xeon, Gather, 3 nodes Xeon, Gather, 2 nodes Xeon, Scatter, 3 nodes Xeon, Scatter, 2 nodes Figure 4-10. Performance of gather and s catter communication on a traditional cluster 0 10 20 30 400.0020.030.58128204832768Payload (Data size in KB)Throughput (Mbps) Embedded, Gather, 3 nodes Embedded, Gather, 2 nodes Embedded, Scatter, 3 nodes Embedded, Scatter, 2 nodes Figure 4-11. Performance of gather and scatter communication on prototype testbed It can be seen from Figures 4-10 and 4-11 that the performance of MPI_Gather and MPI_Scatter is poor compared to broadcast communi cation. The root node individually communicates with each client to transfer data (data being unique to each client) in scatter and gather communication while with th e broadcast the root publishes the same data to all the clients using a single transfer. The MPI_Gather function is used to gather data at a single root node from all the other nodes in the system. Lo cal data at the root node is copied (memcpy) from the send buffer to the receive buffer while data receiv ed from the non-root nodes are placed directly

PAGE 106

106 in the receive buffer. MPI_Scatter on the other ha nd is used to transfer data from the root node to all the non-root nodes. The throughput is lo w because the root node se rializes the receives (in case of gather) and sends (in case of scatter) with the c lient nodes that are ready to send/receive their data. Once co mmunication has been initiated by the root with a client, all transactions with that client are fini shed before moving to another client. So far, we have reported the general performance of FEMPI on failure-free systems. From the results presented, it can be seen that FE MPI performs reasonably we ll for both point-to-point and collective communication in comparison to conve ntional non-fault-tolerant MPI variants. In addition, and more importantly FEMPI provides fault toleran ce and helps applications to successfully complete execution despite faults in th e system. In the next section, we study the performance benefits provided by FEMPI when used in systems with faults. 4.5 Performance Analysis of Failure Recovery In the previous section, we discussed the pe rformance of FEMPI on a failure-free system. In this section, we qualitatively study the perf ormance of FEMPI on system s with failures. We analytically derive the time required to recove r from failures in FEMPI and in traditional nonfault-tolerant MPI variants. We then quantitat ively analyze the recovery times using an actual application kernel in Section 6. It is assumed that the system provides chec kpointing functionalities that the application can use and hence would not have to restart from the beginning every time a failure is encountered. Checkpointing refers to the process of saving program/applica tion state, usually to a stable storage (i.e., taking the snapshot of a running applicati on for later use). Checkpointing forms the crux of rollback recovery and hence fa ult tolerance, debugging, data replication and process migration for high-performance applications As in our prototype testbed, the stable storage is generally centralized and hence could be a bottleneck. However, centralized designs

PAGE 107

107 are preferred because they are much simpler to r ealize and it is easier to coordinate the multiple processes of a parallel app lication trying to access the storage simultaneously. 4.5.1 Non-fault-tolerant MPI variants We discuss the various timing elements invol ved in the recovery of an application executing in a system with a MP I variant that does not inherently support fault tolerance. Following are steps involved in recovering an application from a failure: i. Detect the occurrence of a failure on a node ii. Forcefully stop the application processes on all other nodes executing the application iii. Redeploy the application pro cesses (assuming that the fail ure has been rectified) iv. All processes read checkpoint data from the stable storage and resume execution from the previously checkpointed stable state v. Re-perform computation performed after previ ous checkpoint and before the failure occurred (to reach the same state before the failure) The total time to recover from a failure (i.e ., the total effective computation time lost due to the failure and subsequent recovery) can be calculated as: TFR = TFD + TSA + TRA + TRC + TCL (4.1) where, TFR : Total time to recover from a failure TFD : Time to detect the failure TSA : Time to forcefully stop a pplication processes on all nodes TRA : Time to redeploy applic ation processes on all nodes TRC : Time for all application processes to read the checkpoint from stable storage TCL : Computation time lost since previous checkpoint (maximum value out of all processes)

PAGE 108

108 4.5.2 Failure Recovery Timing in FEMPI We discuss the various timing elements invol ved in the recovery of an application executing in a system with FEMPI in this section. Following are steps involved in recovering an application from a failure: i. Detect the occurrence of a failure on a node ii. Inform the other nodes in the system about the failure iii. Redeploy only the application pro cess that failed on a healthy node iv. The recovered process alone reads the checkpoint data from the stable storage and resumes execution from the previously checkpointed stable state v. The recovered process alone re-performs comp utation performed after previous checkpoint and before the failure occurred (to reach the same state before the failure) The total time to recover from a failure (i.e ., the total effective computation time lost due to the failure and subsequent recovery) can be calculated as: TFFR = TFFD + TFFI + TFRA + TFRC + TFCL (4.2) where, TFFR : Total time to recover from a failure TFFD : Time to detect the failure TFFI : Time to inform application processes on all nodes about the failure TFRA : Time to redeploy applic ation processes on one node TFRC : Time for restored pro cess to read the checkpoint from stable storage TFCL : Computation time lost by the recovere d process since previous checkpoint 4.5.3 Failure Recovery Timing: Comparative Analysis In the previous sections, we identified the timi ng elements involved in the recovery of an application in the event of a failure. In this se ction, we compare the recovery timing for FEMPI with that for non-fault-tolerant MPI variants The comparison shows the improvement in application execution time enabled by FEMPI justif ying the development of a fault-tolerant MPI.

PAGE 109

109 Comparing the corresponding timing elements in (4.1) and (4.2), the failure recovery of FEMPI will be worse than that of non-fault-tolera nt MPI implementations if at least one of the following conditions are true: TFFI is greater than TSA o This condition is not possible; TFFI being the time to just broadcast one small message is negligible while TSA is much more complex than that. Sending a message to all processes and gracefully stopping them requires more time than just sending a message. TFRA is greater than TRA o This condition is not possible; Even in worst case, TFRA can only be equal to TRA, implying that the time taken to deploy application processes does not depend on number of processes. TFRC is greater than TRC o This condition is not possible; Even in worst case, TFRC can only be equal to TRC, implying that the time taken to read ch eckpoint data does no t depend on number of processes accessing the stable storage simultaneously. TFCL is greater than TCL o This condition is not possible; Even in worst case, TFCL can only be equal to TCL, implying maximum computation time is lost at the failed process. All of the above timings elements can be wors t-case scenarios for FEMPI. Nevertheless, FEMPIs recovery time will at le ast be equal to that of non-faul t-tolerant MPI implementations, if not better. In general, a timing element in FEMPI is better than th e corresponding timing in non-fault-tolerant MPI implementations in all of the above cases. Hence, total recovery time is much shorter for FEMPI. Additionally, FEMP Is recovery method also provides additional computation hiding. For example, if comput ation time on a healthy process until the next checkpoint is longer than the time required for the failed process to recover and reach its subsequent checkpoint state, th en the failure recovery proces s has no impact on the healthy process. The computation hiding can be explained as follows:

PAGE 110

110 TFCH : Time for computation that can be perf ormed in the healthy processes while the failed process is recovering (maximum value among all processes) TFCN : Time until next checkpoint in the h ealthy processes since the time the failure occurred in another process (max imum value among all processes) TFCNR : Time until next checkpoint in the re covered process from its restarted state If, (TFCN-TFCH) > TFCNR, then effectively the recovery time can be given by (TFFR-TFCH). In general, the computation might be stalled at healthy processes while the failed process is recovering, increasing the total application executi on time by a value equal to the recovery time. Such is the case in non-fault-tolerant MPI implem entations. However, in the case of FEMPI, if healthy processes can perform co mputation while the failed pro cess is recovering, then the recovery minimally impacts the overall execution tim e of the application and hence in effect the recovery time can be perceived to be reduced. In summary, FEMPI-based recovery is faster and more effective when compared to non-faulttolerant MPIs. We studied the performance improvement analytically in this section while in the next we study the improvement quantitatively using an applicati on kernel as a case study. 4.6. Parallel Application Experiments and Results So far, we have discussed the raw performa nce of FEMPI in comp arison with other MPI implementations. However, the final target for such a communication service is for use by applications. Several space scie nce applications have been propos ed for architectures similar to ours. LU decomposition (LUD) is a typical kernel that is part of several space applications, such as Space-Time Adaptive Processing, and would benef it from a service such as FEMPI. In this section, we report the performance of FEMPI when us ed in a real application, LUD, in terms of execution time and failure recovery time. As me ntioned earlier, it is not possible to execute applications using MPICH on the pr ototype testbed due to resource c onstraints. Hence, in order

PAGE 111

111 to comparatively analyze the performance of FEMP I with MPICH, we restrict our experiments to the Xeon cluster. We study the scalability as well by executing the app lication on systems with up to 16 nodes. 4.6.1 LU Decomposition In linear algebra, the LUD is a matrix deco mposition which writes a matrix as the product of two triangular matrices. This decomposition is used in numerical analysis to solve systems of linear equations or find the inverse of a matrix. Let A be a square matrix. An LUD is a decomposition of the form, A = LU, where L and U are lower and upper triangular matrices of the same size, respectively. The decomposition implies that L has only zeros above the diagonal and U has only zeros below the diagonal. We can solve the equation Ax = b for x, where A is an n x n matrix, x an n-dimensional column vector of unknowns, and b an n-dimensional column vector, by factoring A into the product of a unit lower triangular matrix L and an upper triangular matrix U, provided such factors exist.. Then we can write Ax = b as LUx = b. To obtain the solution, we first solve Ly = b for y and then solve Ux = y for x. Figure 4-12. Data decomposition in parallel LUD We have parallelized the LUD algorithm with row-wise cyclic stripi ng and partial pivoting similar to ScaLAPACK, a standa rd high-performance system benchmark[59]. The input matrix is divided row-wise among the processing nodes, with each node operating on equal number of

PAGE 112

112 rows. Figure 4-12 shows the decomposition of a 6 x 6 matrix on to 3 processors. Processor P1 gets rows 1 and 4, P2 gets 2 and 5, while P3 gets rows 3 and 6. During each iteration, the matrix can be logically divided in to three sub-matrices X, Y, and Z as shown in Figure 4-13. Each iteration consists of three steps: (1) The column of nodes that holds X factorizes their part and di stributes the pivot information in its corresponding rows. (2) Each column of nodes exchanges the remaining part of the pivot rows. The decomposition is performed on the actual values of Y and the result is broadcasted. (3) All nodes update Z in parallel. Then the next iteration cont inues to complete the factorization of Z. L* U* X Y Z Figure 4-13. Parallel LUD algorithm 4.6.2 Failure-free Performance The execution times of the parallel LUD app lication on failure-free systems with node counts ranging from 2 to 16 are shown in Figure 414. It should be mentioned here that the execution times reported do not include the time to start up the processes on the participating nodes. FEMPI uses job management daemons (p art of the job management system developed for the prototype testbed) that reside permanen tly on each node to start the processes and hence is very fast. On the other hand, MPICH has a pr ocess in one central node remotely logging in to every node to start up the process, which take s a much longer time. Hence, for a fair comparison, we do not include the start up time in our execution time results.

PAGE 113

113 0 25 50 75 100 125 15024816Number of nodesExecution time (sec) Xeon, Execution time, MPICH Xeon, Execution time, FEMPI Figure 4-14. Failure-free execution time of para llel LUD application kernel with increasing system size The input matrix is of size 4160 x 4160. Each el ement in the matrix is of type double (8 bytes) and hence the size of the input matrix is 138.44 MB. As the number of nodes in the system increases, execution time decreases for both FEMPI and MPICH. With the matrix and hence the communication messages being large, MPICH cannot use its buffering capability as the data size exceeds the size of the buffer. With FEMPI and MPICH both without buffering, it can be seen from the figure that the applicati on executes faster with FEMPI. Lower execution time highlights the better performance of FEMPI. However, as the system size increases, the size of the sub-matrix in each node and hence the communication messages decreases explaining the decrease in the performance difference between FEMPI and MPICH. To compare the performance of FEMPI versus MPICH for shorter r unning applications with smaller communication messages, we executed the application for an input matrix size of 12 x 12. Although the execution time using FEMPI was larger than that with MPICH having buffering capability, the difference was minimal. For example, in a 2-node system, the execution time using FEMPI was 8 ms while that using MPIC H was 4 ms. Likewise, in a 3-node system, the execution time using FEMPI was 6 ms while th at using MPICH was 3 ms. In summary, in

PAGE 114

114 failure-free systems, the performance of FEMPI is comparable to MPICH for short applications and better than MPICH for long running applications. 4.6.3 Failure Recovery Performance In the previous section, we studied the perf ormance of FEMPI when used in an actual application in a failure-free system. In this se ction, we study the performance in systems with failures in terms of the failure recovery time. The parallel LUD kernel uses the RECOVER mode of FEMPI for recovery from failures, requiri ng the failed process to be re-spawned back to its healthy state either on the same or a different processing no de. When a FEMPI function is called to communicate with a failed process, th e function call is halted until the process is successfully re-spawned. When the process is restored, the communication call is completed and the control is returned back to the application. On the other hand, with MPICH, when a process fails all the other processes are stopped and the entire application is restarted. The restarting process which is the failed process in case of FEMPI and all the proc esses in case of MPICH restart from the previous stable checkpoint state stored in the stab le storage. As part of the computation, all processes checkpoint their states regularly. Figures 4-15 and 4-16 compare the recovery time (time from when the failure occurred to when the application is back to the same state be fore the failure) for the parallel LUD application with FEMPI and MPICH for varying system sizes. While Figure 4-15 shows the recovery time for a small matrix (12 x 12), Figure 4-16 shows the recovery time for a large matrix (4160 x 4160). We show the results for two different vers ions of MPICH, with rsh and without rsh. While the MPICH with rsh includes the time to st artup the processes (via remote login), MPICH without rsh does not. For the MPICH without rsh case, we assume that MPICH uses job management daemons (similar to FEMPI) and is presented for fair comparison with FEMPI.

PAGE 115

115 0 500 1000 1500 2000 2500 3000 350024612Number of nodesRecovery time (msec) Xeon, Recovery time, MPICH with rsh Xeon, Recovery time, MPICH without rsh Xeon, Recovery time, FEMPI Figure 4-15. Recovery time from a failure with increasing system size for applications with small datasets For both large and small applications, the recovery time of FEMPI is better than that of MPICH thereby improving the availability of the system. MPICH with rsh performs the worst due to the remote logging overhead. MPICH wit hout rsh is comparable to FEMPI for smaller systems, but as the system size grows the perfor mance worsens. For MPICH, the impact of all processes having to be rest arted and all of them accessi ng checkpointing simultaneously becomes pronounced as the number of processes increases. The impact is even more pronounced for larger applications where the amount that has to be read from the stable storage is large. For a 12-node system running the smalle r application in Figure 415, the recovery time for FEMPI is 108 ms while it is 383 ms for MPIC H without rsh and 3,443 ms for MPICH with rsh. For a 16-node system running the larger application in Figure 13, the recovery time for FEMPI is 128 ms, while that for MPICH without rsh is 803 ms, and 4983 ms is needed for MPICH with rsh. It should be mentioned here that the recovery times reported here ar e for recovery from one failure. If there are multiple failures during the course of the application execution, which is expected in failure-prone space systems, the overa ll time spent recovering from failures would be

PAGE 116

116 equal to the time to recover from one failure mu ltiplied by the number of failures. In summary, it can be seen from the results that FEMPI perf orms better than MPICH in terms of the execution time in failure-free systems and in terms of recovery time in systems with failures. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 500024816Number of nodesRecovery time (msec) Xeon, Recovery time, MPICH with rsh Xeon, Recovery time, MPICH without rsh Xeon, Recovery time, FEMPI Figure 4-16. Recovery time from a failure with increasing system size for applications with large datasets 4.7 Conclusions and Future Research In this phase of the research, we have taken a direct approach to provide fault-tolerance and improve the availability of the HPC system in space by investigating and proposing a new lightweight fault-tolerant mess age-passing middleware. We presented the design and analyzed basic characteristics of a new fault-tolerant lightweight, message passing middleware for embedded systems called FEMPI. This fault-tole rant middleware reduces the impact of failures on the system and is particularly important for support of applications in harsh environments with mission-critical requirements. The li ghtweight fault-tolerant MPI for low resource environments is the first of its kind to our knowledge and its design is the primary research contribution of this phase of research. The r ecovery time of the system is improved allowing unimpeded execution of applications as much as possible. A key focus of the message-passing

PAGE 117

117 middleware architecture has been toward tradeo ffs between performance and fault tolerance, with emphasis on the latter but forethought of the former. Results from experiments on performance of FEMPI on an embedded cluster system with PowerPCs and on a traditional Xeon cluster were presented. Both point -to-point and collective communication calls performed well, providi ng a maximum throughput of around 590 Mbps on a Xeon cluster with a 1 Gbps network and 31 Mbps on a cluster of PowerPC-based embedded systems with 100 Mbps network, both roughly comp arable to conventional, non-fault-tolerant MPI middleware. FEMPI also achieves relatively low synchronization times. We also presented the results of using FEMPI with an application kernel on the Xeon cluste r in terms of execution times, failure recovery time and scalability. FEMPI performs better than MPICH in terms of all the above-mentioned performance metrics for large (long running) applications while performance is comparable for sm all (short running) applications. As future work, FEMPI will be tested with several other case studies including prominent space applications, with and without injected faults, to provide additional insight on the benefits and costs of fault tolerance in this environment. Based on the requirements of these applications, more function calls will also be developed. An alysis of the impact of the addition of buffered and asynchronous communication calls on fault to lerance, along with the REMOVE mode of recovery is also a promising future work.

PAGE 118

118 CHAPTER 5 CONCLUSIONS With the fast advancement of space research in the past few decades, the demand of space missions for data returns from their resources in space has highly increased. As the traditional approach of data collection and transmission is no longer viable, there have been several research efforts to make HPC systems available in spac e in order to have enough on board processing power. Such efforts have led to the idea of us ing COTS components to provide HPC in space. The susceptibility of COTS components to SEUs d eems the development of fault-tolerant system functions to manage the resources available and improve availability of the system in space. Several techniques exist in traditional HPC to provide fault tolerance and improve overall computation rate but adapting these techniques for HPC in space is a challenge due to the resource constraints. In this dissertation, we address this challe nge by investigating appr oaches to improve and complement HPC in space. Three techniques are i nvestigated in three different phases of this dissertation to improve the effectiv e utilization and availability of HPC in space. In Phase 1, we improve the useful computation time of th e system by optimizing checkpointing-related defensive I/O. In Phase 2, we improve th e overall execution time of the application by simulatively analyzing scheduling heuristics for a pplication task scheduling. Phase 3 improves the availability of the system by designing a new lightweight, fault-to lerant, message-passing middleware. As part of the first phase of this research, the useful computation time of the system is increased by optimizing checkpointing-related de fensive I/O. We studied the growth in performance of the various technologies involv ed in high-performance computing, highlighting the poor performance growth of disk I/O compared to other technologies. Presently defensive

PAGE 119

119 I/O that is based on checkpointing is the primary driver of I/O bandwidth rather than productive I/O that is based on processor performance. Th ere are several research efforts to improve the process of checkpointing itself but ar e generally application-specific. To the best of the authors knowledge, there have been no successful attempts to optimize the rate of checkpointing to reduce the overhead of fault-tolerant mechanisms Such an approach is not applicationor system-specific and is applicable to both HP C systems in space and on ground. The primary research contribution of this pha se is the development of a ne w model for checkpointing process and in so doing identify the optim um rate of checkpointing for a given system. Checkpointing at an optimal rate reduces the risk of loosing mu ch computation on a failure while increasing the amount of useful computation time available. We developed a mathematical model to re present the checkpointing process in large systems and analytically modeled the optimum computation time within a checkpoint cycle in terms of the time spent on checkpointing, th e MTBF of the system, amount of memory checkpointed and sustainable I/O bandwidth of the system. Optimal checkpointing maximizes useful computation time without jeopardizing the applications due to failures while minimizing the usage of resources. The optimum computati on time leads to the minimal wastage of useful time by reducing the time spent on overhead task s involving checkpointi ng and recovering from failures. In order to see the impact of checkpoi nting overheads and the overhead to recover from failures on the I/O bandwidth requir ed in the systems, we analyzed the overall execution time of applications including these overhead tasks rela tive to the ideal execution time in a system without any checkpointing or failures. In order to verify our analytical model, we developed discrete-event simulation models to simulate the checkpointing process in HPC syst ems. In the simulation models, we used

PAGE 120

120 performance numbers that represent systems ranging from small embedded cluster systems (space-based) to large supercomputers (ground-ba sed). The simulation results matched closely with those from our analytical model, verifying the correctness of our analytical model. As future work, the model can be experimentally verified with real systems running scientific applications. When successfully verified, the model can be used to find the optimal checkpointing frequency for various HPC a nd HPEC systems based on their resource capabilities. As part of the second phase of this research, the overall execu tion time of applications is decreased by analyzing methodologies for effect ive task scheduling. Reducing the overall execution time of applications expos es the applications to lesser num ber of faults in the system. Recently, systems augmented with FPGAs offering a fusion of traditional pa rallel and distributed machines with customizable and dynamically reco nfigurable hardware have emerged as a costeffective alternative to traditional systems. D ynamic scheduling of large-scale HPC applications in such parallel reconfigurable computing (RC) environments is one such challenge and has not been sufficiently studied to our knowledge. We analyzed the performance of several schedu ling heuristics that can be used to schedule application tasks on parallel RC systems that can pot entially be part of HPC systems in space. Typical HPC applications and FPGA/RC resources were used as representative cases for our simulations. A performance model was also deve loped to predict the overall execution time of tasks on RC resources. The model was used by our scheduling heuristics to schedule the tasks. Based on the performance metric that is critic al for a given environment, the corresponding scheduling heuristics can be used to schedule tasks. The perf ormance model and the extensive

PAGE 121

121 simulative analysis of scheduling heuristics fo r FPGA-based RC applications and platforms is the first of its kind and is the primar y research contribution of this phase. The intent of this phase of research is to analyze several scheduli ng heuristics and compare their performance to identify th e suitable candidates for use in space-based HPC systems. We have shown the performance results for averag e makespan. But, based on the performance metric critical for a given environment, the co rresponding scheduling heuristics can be used to schedule tasks. In future, the knowledge gained so far to suggest suitable scheduling heuristics for an experimental scheduler can be applied in the job management service intended for the space-based HPC system being developed at the HC S Research Laboratory at the University of Florida. As part of the third phase of this research, we take a direct approach to provide faulttolerance and improve the availability of the HPC system in space by designing a new lightweight fault-tolerant me ssage-passing middleware. We designed an applicationindependent fault-tolerant message passing mi ddleware called FEMPI, a lightweight faulttolerant design (implementation) of the MPI standard. The light weight fault-tolerant MPI for low resource environments is the first of its ki nd to our knowledge and its design is the primary research contribution of this phase of research. The fault-tolerant middleware reduces the impact of failures on the system and is particularly important for support of applications in harsh environments with mission-critical requirements. The recovery time of the system is improved allowing unimpeded execution of app lications as much as possible. A key focus of the messagepassing middleware architecture has been toward tradeoffs between performance and faulttolerance, with emphasis on the latter but forethought of the former.

PAGE 122

122 In this phase, we focused on designing a message-passing middleware architecture that can successfully address the trade-offs between perf ormance and fault-tolerance. Several faulttolerant MPI research efforts were survey ed and their qualitative performances were comparatively analyzed. With the wide knowledge base that we gained out of the study and based on previous MPI development experience we proposed a new framework for fault-tolerant MPI as a comprehensive solution to address fault tolerance in MP I and thereby improve productivity and throughput of HPC systems in sp ace. The architecture is unique, flexible and independent of any particular MPI implementati on which would enable the use of the design and architecture by any messag e-passing middleware. The primary goal of this phase of research is to design a lightwe ight fault-tolerant middleware and provide for gracef ul degradation of applicatio ns while minimizing recovery times. We designed FEMPI and results from experiments on performance of FEMPI on an embedded cluster system with PowerPCs and on a traditional Xeon cluster were presented. We also presented the results of using FEMPI with an application kernel on th e Xeon cluster in terms of execution times, failure recovery time and scalab ility. As future work, FEMPI will be tested with several other case studies including promin ent space applications, with and without injected faults, to provide additional insight on the benefits and costs of fault tolerance in this environment. Based on the requirements of these applications, more function calls can also be developed. Space environment is harsh and prone to failu res while being limited in resources. Low overhead fault-tolerance methodologies that do not tax the space-based HPC systems is a key requirement. In this dissertation, we have addr essed solutions to adapt fault-tolerant techniques from traditional ground-based HPC systems to impr ove effective utilizatio n and availability of

PAGE 123

123 HPC in space. This research provides novel t echniques to improve the reliability of HPC in space with minimal effort to transition from HPC on the ground. The solutions are geared to be lightweight without straining the system while maintaining the quali ty and reliability of the faulttolerant schemes.

PAGE 124

124 LIST OF REFERENCES [1] J. Samson, J. Ramos, I. Troxel, R. Subramaniyan, A. Jacobs, J. Greco, G. Cieslewski, J. Curreri, M. Fischer, E. Grobe lny, A. George, V. Aggarwal M. Patel, and R. Some, High-Performance, Depe ndable Multiprocessor, Proc. IEEE/AIAA Aerospace Conference, Big Sky, MT, March 4-11, 2006. [2] W. J. Larson, J. R. Wertz, Space Mission Analysis and Design, 3rd Edition, Microcosm Press/Kluwer A cademic Publishers, 1999. [3] Scientific and Technical Information, http://www.sti.nasa.gov/products.html#pubtools (Accessed October 2, 2006) [4] J. Samson, L. Torre, P.Wiley, T. Stottlar and J. Ring, A Comparison of AlgorithmBased Fault Tolerance and Traditional Re dundant Self-Checking for SEU Mitigation, Digital Avionics Systems Conference, Daytona Beach, FL, October 14-18, 2001. [5] ASCI Purple Statement of Work, La wrence Livermore National Laboratory, http://www.llnl.gov/asci/purple/Attach ment_02_PurpleSOWV09.pdf (Accessed: October 2, 2006) [6] G. Schocht, I. Troxel, K. Farhangian, P. U nger, D. Zinn, C. Mick, A. George, and H. Salzwedel, System-Level Simulation Modeling with MLDesigner, 11th IEEE/ACM International Symposium on Modeling, A nalysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Orlando, FL, October 2003. [7] Cramming More Components onto Integrated Circuits, Electronics, Vol. 37, No. 8; April 19, 1965. [8] LINPACK, http://www.netlib.org/linpack/ (Accessed: October 2, 2006) [9] J. Dongarra, P. Luszczek, A. Petitet The LINPACK Benchmark: Past, Present, and Future, Concurrency and Computation, Practice and Experience 15, pp. 1-18, 2003. [10] Top 500 Supercomputer Sites, http://www.top500.org/ (Accessed: October 2, 2006) [11] HITACHI eyes 1 TB Desktop Drives, http://www.pcworld.com/news/article/0,aid,120279,00.asp (Accessed: October 2, 2006) [12] S. Heinze, M. Bode, A. Kubetzka, O. Piet zsch, X. Nie, and S. Blugel, Real-Space Imaging of Two-Dimensional Antiferromagnetism on the Atomic Scale, Science Magazine, Vol. 288, Issue 5472, pp. 1805-1808, 9 June 2000. [13] MRAM-Info: Magnetic RAM news, forum, articles and more, http://www.mram-info.com/ (Accessed: October 2, 2006) [14] What is MEMS Technology, http://www.memsnet.org/mems/what-is.html (Accessed: October 2, 2006) [15] .4GB 3.6MS/15000 (ULTRA 320 80PIN) 8192K 3.5"/HH, http://www.spartantech.com/product.asp?PID=ST373453LC&m1=pg (Accessed: October 2, 2006)

PAGE 125

125 [16] Seagate Barracuda 7200.8 400GB 3.5" ID E Ultra ATA100 Hard Drive OEM, http://www.newegg.com/Product/ Product.asp?Item=N82E16822148060 (Accessed: October 2, 2006) [17] Cheetah 15K.3 ST336753LC, http://www.seagate.com/cda/products/dis csales/marketing/detail/0,1081,552,00.html (Accessed: October 2, 2006) [18] James S. Plank, Youngbae Kim, and Jack Donga rra, Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing, Journal of Parallel and Distributed Computing, Vol. 43, No. 2, pp. 125-138, September 1997. [19] Los Alamos/Liv 3D Simulations, Public ation of Los Alamos National Laboratory, Vol. 3, No. 6, April 4, 2002. [20] D. Thain, T. Tannenbaum, and M. Livny, Distributed Computing in Practice: The Condor Experience, Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pp. 323-356, February-April, 2005. [21] J. Plank, M. Beck, G. Kingsley, and K.Li Libckpt: Transparent Checkpointing under Unix, Usenix Winter 1995 Technical Conference, New Orleans, LA, January, 1995. [22] G. P. Kavanaugh and W. H. Sanders, Performance analysis of two time-based coordinated checkpoi nting protocols, Pacific Rim International Symposium on FaultTolerant Systems, Taipei, Taiwan, December 15-16, 1997. [23] N. H. Vaidya, A case for two-level distributed recovery schemes, ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, Ottawa, May 1995. [24] N. H. Vaidya, Impact of checkpoint la tency on overhead ratio of a checkpointing scheme. IEEE Transactions on Computers, Vol. 46, No. 8, pp. 942, August 1997. [25] K. F. Wong and M. Franklin, Check pointing in distributed systems, Journal of Parallel & Distributed Systems, Vol. 35, No. 1, pp. 67, May 1996. [26] Fixed Point Iteration, http://pathfinder.scar.utoronto. ca/~dyer/csca57/book_P/node34.html (Accessed October 2, 2006). [27] D. F. Stanat and S. F. Weiss, Systematic Programming, Online book resources, http://www.cs.unc.edu/~weiss/COMP114/BOOK/BookChapters.html (Accessed: October 2, 2006) [28] I. Troxel, A. Jacob, A. George, R. Subramaniyan and M. Radlinski, CARMA: A Comprehensive Management Framework for High-Performance Reconfigurable Computing, Proc. of International Confer ence on Military and Aerospace Programmable Logic Devices (MAPLD), Washington, DC, September 8-10, 2004. [29] K. Compton and S. Hauck, Reconfigurable Computing: A Survey of Systems and Software, ACM Computing Surveys, Vol. 34, No.2, pp. 171-210, 2000.

PAGE 126

126 [30] R. Enzler, C. Plessl and M. Platzner, System-level Performance Evaluation of Reconfigurable Processors, Journal of Micropro cessors and Microsystems, Vol.29, pp. 63, 2005. [31] Nallatech, BenNUEY Reference Guide, Hardware Reference Guide, Glasgow, United Kingdom, 2004. [32] Celoxica, Ltd., RC1000 Hardware Refere nce Manual, Hardware Reference Manual, Oxfordshire, United Kingdom, 2001. [33] Tarari, Inc., High-Performance Compu ting Hardware Reference, Hardware Reference Manual, San Diego, CA, 2004. [34] Silicon Graphics, Inc., Reconfigurable Application-Specific Computing Users Guide, Users Guide, Mountain View, CA, 2004. [35] J. Vetter and A. Yoo, An Empirical Perfor mance Evaluation of Scalable Scientific Applications, Proc. of Supercomputing 2002, Baltimore, MD, Nov. 16-22, 2002. [36] A.A. Mirin, R.H. Cohen, B.C. Curtis, W.P. Dannevik, A.M. Dimits, M.A. Duchaineau, D.E. Eliason, D.R. Schikore, S.E. Anderson, D.H. Porter, P.R. Woodward, L.J. Shieh, and S.W. White, Very High Resolution Simu lation of Compressible Turbulence on the IBM-SP System, Proc. of ACM/IEE conference on Supercomputing, High Performance Networking and Computing Conference, Portland, OR, Nov. 13-19, 1999. [37] P.N. Brown, R.D. Falgout, and J.E. Jones, Semicoarsening Multigrid on Distributed Memory Machines, SIAM Journal on Scientific Computing, Vol. 21, No. 5, pp. 18231834, 2000. [38] A. Hoisie, O. Lubeck, H. Wasserman, F. Petr ini, and H. Alme, A General Predictive Performance Model for Wavefront Al gorithms on Clusters of SMPs, Proc. of International Conference on Parallel Processing, Toronto, Canada, August 21-24, 2000. [39] W.K. Anderson, W.D. Gropp, D. K. Kaushik, D.E. Keyes, and B.F. Smith, Achieving High Sustained Performance in an Un structured Mesh CFD Application, Proc. of ACM/IEE conference on Supercomputing, High Performance Networking and Computing Conference, Portland, OR, Nov. 13-19, 1999. [40] R.S. Tuminaro, S.A. Hutchinson, and J.N. Shadid, The Aztec Iterative Package, International Linear Algebra Iterative Workshop, Toulouse, France, June 10-13, 1996. [41] M. Smith and G. Peterson, Analytical M odeling for High Performance Reconfigurable Computers, Proc. SCS International Symposium on Performance Evaluation of Computer and Telecommunications Systems (SPECTS), San Diego, CA, July 14-19, 2002. [42] M. Maheswaran, S. Ali, H. Siegel, D. Hensgen, and R. Freund, Dynamic Matching and Scheduling of a Class of Independende nt Tasks onto Heterogeneous Computing Systems, Proc. of Eight Heterogeneous Computing Workshop, San Juan, Puerto Rico, April 12, 1999.

PAGE 127

127 [43] S. Vetter, S. Andersson, R. Bell, J. Hague, H. Holthoff, P. Mayes, J. Nakano, D. Shieh, and J. Tuccillo, RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide, IBM, 1998. [44] Jacob, I. Troxel and A. George, Dis tributed Configuration Management for Reconfigurable Cluster Computing, Proc. of International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas, NV, June 21-24, 2004. [45] Message Passing Interface Forum (1994), MPI: a message-passing interface standard, Technical Report CS-94-230, Computer Scienc e Department, University of Tennessee, April 1. [46] M. Dantas, Evaluation of a Lightweight MPI Approach for Parallel Heterogeneous Workstation Cluster Environments, Proceedings of the Euro-Par conference, Southampton, United Kingdom, September 1-4, 1998, pp. 397-400. [47] A. Agbaria, D. Kang and K. Singh, LMPI: MPI for Heterogeneous Embedded Distributed Systems, Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS ), Minneapolis, Minnesota, July 12-15, 2006, pp. 7986. [48] J. Kohout and A. George, A High-Perfor mance Communication Service for Parallel Computing on Distributed DSP systems, Parallel Computing, Vol. 29, No. 7, July 2003, pp. 851-878. [49] G. Stellner, CoCheck: Checkpointing and Process Migration for MPI, In Proceedings of the 10th International Paralle l Processing Symposium (IPPS '96), Honolulu, Hawai, pp. 526-531, 1996. [50] G. Bosilca, A. Bouteiller, F. Cappello, S. D jilali, G. Fdak, C. Germain, T. Hrault, P. Lemarinier, O. Lodygensky, F. Magni ette, V. Nri and A. Selikhov, MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes, SuperComputing 2002, Baltimore, USA, November 2002. [51] Agbaria and R. Friedman, Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations, In the 8th IEEE International Symposium on High Performance Distributed Computing, IEEE CS Press, Los Alamitos, California, pp. 167-176, 1999. [52] Ensemble project web site, http://dsl.cs.technion.ac.il/projects/Ensemble/ (Accessed: October 2, 2006). [53] S. Rao, L. Alvisi and H. Vin, "Egida: An Extensible Toolkit for Low-overhead FaultTolerance," In Proceedings of IEEE International Conference on Fault-Tolerant Computing (FTCS), pp. 48-55, 1999. [54] MPICH project web site, http://www-unix.mcs.anl.gov/mpi/mpich/ (Accessed: October 2, 2006). [55] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, Mpi-ft: Portable fault tolerance scheme for mpi, In ParallelProcessing Letters, World Scientific Publishing Company, Vol. 10, No. 4, pp. 371-382, 2000. [56] LAM/MPI project web site, http://www.lam-mpi.org/ (Accessed: October 2, 2006).

PAGE 128

128 [57] G. Fagg and J. Dongarra, FT-MPI: Fau lt Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000, Hungary, Vol. 1908, pp. 346-353, 2000. [58] T. Kaiser, L. Brieger, and S. Healy, MYMPI MPI Programming in Python, Proceedings of the International Conferen ce on Parallel and Distributed Processing and Applications, Las Vegas, Nevada, June 26-29, 2006. [59] Super LU, http://crd.lbl.gov/~xiaoy e/SuperLU/#superlu_dist (Accessed: October 2, 2006).

PAGE 129

129 BIOGRAPHICAL SKETCH Rajagopal Subramaniyan was born in Tanjore, India. He completed his schooling at Carmel Garden matriculation higher secondary sc hool, Coimbatore, India. He went on for his bachelors degree in electronics and communication engineering with a scholarship at Anna University, Chennai, which is one of the most pr estigious engineering institutions in India. He successfully completed his bachelors degree with distinction and joined University of Florida to pursue his masters in electrical and computer engineering. In Spring 2001, he joined the Highperformance Computation and Simulation (HCS) Re search Laboratory as a graduate research assistant under the guidance of Dr. Alan George. He completed his masters degree in December 2002 and stayed back at University of Florida to pursue his Ph.D. in computer engineering. His research at the HCS Research Lab forms the basis of this work. He will complete his Ph.D. in December 2006 and plans to take up a caree r in the computer engineering field.


Permanent Link: http://ufdc.ufl.edu/UFE0017361/00001

Material Information

Title: Improving Utilization and Availability of High-Performance Computing in Space
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0017361:00001

Permanent Link: http://ufdc.ufl.edu/UFE0017361/00001

Material Information

Title: Improving Utilization and Availability of High-Performance Computing in Space
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0017361:00001


This item has the following downloads:


Table of Contents
    Title Page
        Page 1
        Page 2
    Dedication
        Page 3
    Acknowledgement
        Page 4
    Table of Contents
        Page 5
        Page 6
    List of Tables
        Page 7
    List of Figures
        Page 8
        Page 9
        Page 10
    Abstract
        Page 11
        Page 12
    Introduction
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
    Optimization of checkpointing-related input/output (Phase I)
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
    Effective task scheduling (Phase II)
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
    A fault-tolerant message passing interface (Phase III)
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
    Conclusions
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
    References
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
    Biographical sketch
        Page 129
Full Text





IMPROVING UTILIZATION AND AVAILABILITY OF
HIGH-PERFORMANCE COMPUTING IN SPACE























By

RAJAGOPAL SUBRAMANIYAN


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2006

































Copyright 2006

by

Rajagopal Subramaniyan
































To my dad Mr. Subramanian and my mom Mrs. Vijayalakshmi









ACKNOWLEDGMENTS

I thank Dr. Alan George for having confidence in me and for allowing me to pursue this

research. I also thank him for his support during this work. I thank Scott Studham at Oak Ridge

National Laboratory for laying the foundation of this work and for providing great support and

guidance.

I express my gratitude to my parents, who have been with me during the ups and downs of

my life. I thank them for their moral, emotional and financial support throughout my career. I

would be nowhere without them.

I thank my friends Giridhar and Yamini for being friendly, encouraging and very

supportive in helping me to graduate. I also thank my friends Maakans, Anitha, Arun, Anand,

Thanni, PD, and Kandy for making life at UF memorable.

I thank my colleagues at the HCS Lab for their technical support and peer reviews. I

especially thank Casey, Adam, Eric and Ian for being friendly and more than just lab mates,

making my term at the lab a memorable one. Finally, I thank all the sponsors of this work. This

work was supported by NSF, ORNL, and NASA/JPL.









TABLE OF CONTENTS



A C K N O W L E D G M E N T S ..............................................................................................................4

L IST O F T A B L E S ......................................................................................................... ........ .. 7

LIST OF FIGURES ............................................. ............ ...........................8

A B S T R A C T .......................................................................................................... ..................... 1 1

CHAPTER

1 IN T R O D U C T IO N ................... ............................................................................................. 13

2 OPTIMIZATION OF CHECKPOINTING-RELATED INPUT/OUTPUT (PHASE I) ........ 19

2.1 Background on Technology Grow th Trends ....................................................................21
2.1.1 C PU P processing P ow er.......................................... ......................... ................ 22
2 .1.2 D isk C ap city ......................................................................................................... 2 3
2.1.3 D isk Perform ance G row th.................................... ....................... ................ 24
2.1.4 Sustained Bandwidth to/from Disk ................................................... 25
2.1.5 I/O Performance ............................................. ........ ...... ...............27
2.2 Optim al U sage of Input/Output ................. ............................................................ 27
2.2.1 C heckpointing A alternatives ...................................... ...................... ................ 28
2.2.2 Optimizing Checkpointing Frequency ..................................................30
2.3 Analytical Modeling of Optimum Checkpointing Frequency ....................................30
2.4 Simulative Verification of Checkpointing Model .......................................................40
2 .4 .1 Sy stem M o d el ......................................................................................................... 4 1
2 .4 .2 N o d e M o d el ............................................................................................................ 4 1
2 .4 .3 M multiple N odes M odel........................................... ......................... ................ 42
2.4.4 Network Model .................... .. ........... ............................... 43
2.4.5 Central M em ory M odel ................. .............................................................. 44
2 .4 .6 F ault G enerator M odel .......................................... ......................... ................ 44
2 .4 .7 Sim u lation P ro cess ................................................. .. ......................................... 4 5
2.4.8 V erification of A nalytical M odel ...................................................... ................ 46
2.5 Optimal Checkpointing Frequencies in Typical Systems .................................................50
2.5.1 Traditional Ground-based High-Performance Computing (HPC) Systems...........51
2 .5.2 Space-based H P C Sy stem s .....................................................................................52
2.7 C conclusions and Future R esearch....................................... ...................... ................ 54

3 EFFECTIVE TASK SCHEDULING (PHASE II) ............... ...................................57

3 .1 In tro d u ctio n ....................................................................................................................... 5 7
3 .2 B ack g rou n d ................................. .. ...... ... ........................................................................ 5 9
3.2.1 Representative Reconfigurable Computing (RC) Systems .............................. 59
3.2.2 R representative A applications .............................................................. ................ 61









3 .3 T a sk S ch e d u lin g ................................................................................................................6 2
3.3.1 Scope of the Scheduler .......................................... ......................... ................ 63
3.3.2 P perform ance M odel .. ...................................................................... ................ 63
3.3.3 Scheduling H euristics ..................................................................... ................ 66
3 .4 Sim ulativ e A naly sis ................................................... ............................................... 67
3.4.1 Simulation Setup .......... .. .............................. ......... ...... ............... 69
3.4.2 Sim ulation R results ............................................................................. ........ ................. 70
3.4.3 Simulation Results on Sm all-scale System s...................................... ................ 76
3.5 C conclusions and Future R esearch....................................... ...................... ................ 79

4 A FAULT-TOLERANT MESSAGE PASSING INTERFACE (PHASE III) ....................81

4 .1 In tro d u ctio n ....................................................................................................................... 8 2
4.2 B background and R elated R research ................ ..................... ............................................83
4.2.1 Limitations of Message Passing Interface (MPI) Standard...............................84
4.2.2 Fault-tolerant MPI Implementations for Traditional Clusters...............................85
4.3 Design of Fault-tolerant Embedded MPI (FEMPI) .............. ....................................90
4 .3.1 FE M P I A architecture ................. .................. ....................................... ................ 9 1
4.3.2 Point-to-Point Messaging (Unicast Communication) ......................................94
4.3.3 C collective C om m unication ....................................... ...................... ................ 95
4 .3.4 F failure R recovery ........................... ................................ .... .......... ............... 95
4.3.5 Covering All M PI Function Call Categories..................................... ................ 96
4.4 Performance Analysis .................................... .....................................98
4.4.1 Experim ental Setup .............. ...... ............. ................................................ 98
4 .4 .2 R results and A naly sis........................................... .......................... .............. 100
4.4.2.1 Point-to-point communication...... .... ........................ 100
4.4.2.2 Collective communication...... .... ...... ..................... 102
4.5 Performance Analysis of Failure Recovery........................................ 106
4.5.1 N on-fault-tolerant M PI variants ....... ......... ........ ...................... 107
4.5.2 Failure Recovery Timing in FEMPI.......................................... 108
4.5.3 Failure Recovery Timing: Comparative Analysis..................... .................. 108
4.6. Parallel Application Experiments and Results ....... ......... ...................................... 110
4 .6.1 L U D ecom position ................ .............. ................................................ 111
4.6.2 Failure-free Perform ance.................................... ....................... ............... 112
4.6.3 Failure Recovery Performance...... ......... ....... ..................... 114
4.7 Conclusions and Future R esearch........................................................ ............... 116

5 CONCLUSION S .................................. .. ........... .............. ......... ....... .. 118

L IST O F R E F E R E N C E S ....................................................... ................................................ 124

B IO G R A PH IC A L SK E T C H .................................................... ............................................. 129








6









LIST OF TABLES

Table page

2-1 D isk perform ance grow th over tim e ............................................................. ................ 25

2-2 Growth rate of com putting technologies................... ................................................. 27

3-1 C characteristics of applications .......................................... ......................... ................ 68

3-2 C characteristics of R C system s .......................................... ......................... ................ 68

4-1 Baseline MPI functions for FEMPI in this phase of research.......................................94

4-2 Barrier synchronization using FEMPI on prototype testbed................. ...................104









LIST OF FIGURES


Figure page

1-1 Typical high-performance computing system. ..................................................16

2-1 Supercom puter perform ance over tim e......................................................... ................ 23

2-2 H ard drive capacity over tim e........................................... ......................... ................ 24

2-3 Performance of a single disk with respect to the amount of data stored.........................26

2-4 Growth of disk drive bandw idth over tim e................................................... ................ 26

2-5 Number of disks required in future systems to maintain present performance levels.......28

2-6 Job execution w without checkpointing............................................................ ................ 31

2-7 Job execution w ith checkpointing....................................... ....................... ................ 31

2-8 Execution m odeled as a M arkov process...................................................... ................ 32

2-9 Impact of I/O bandwidth on the system execution time and checkpointing overhead
A) System with TBF = 8 hours. B) System with TMTBF= 1 day...............................38

2-10 Impact of increasing memory capacity in systems on sustainable I/O bandwidth
requirement A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day...............39

2 -1 1 S y stem m o d el ................................................................................................................... .. 4 1

2-12 N ode m odel ...................................................................................................... ........ .. 42

2-13 M multiple nodes m odel ........................................................................................................ 43

2-14 Network model ................................. .. ........... .....................................43

2-15 C central m em ory m odel .............. ....................................................................44

2-16 F ault generator m odel .................................................. .............................................. 45

2-17 Optimum checkpointing interval based on simulations of systems with 5 TB memory
A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of 50 GB/s. ....47

2-18 Numerical method solution for the analytical model....................................................49

2-19 Optimum checkpointing interval based on simulations of systems with 75 TB
memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of
5 0 G B /s. ............................................................................................................ ........ .. 5 0









2-20 Optimum checkpointing interval for various systems A) System with 5 TB memory.
B) System with 30 TB memory. C) System with 75 TB memory............................... 52

2-21 Optimum checkpointing interval for various systems in space A) System with 512
MVB memory. B) System with 2 GB memory. C) System with 8 GB memory. ................53

3-1 Task dependence using directed acyclic graphs for sample jobs..................................63

3-2 Scheduling of SPPM on various R C system s .................................................. ............... 71

3-3 Scheduling of UM T on various RC system s................................................. ................ 72

3-4 Scheduling of heterogeneous tasks on various RC systems .........................................73

3-5 Scheduling of various homogeneous tasks on a system with heterogeneous RC
m ach in es .......................................................................................................... ........ .. 7 4

3-6 Scheduling of heterogeneous tasks on a system with heterogeneous RC machines.......... 75

3-7 Batch scheduling of various heterogeneous tasks on a system with homogeneous RC
m ach in es .......................................................................................................... ........ .. 7 5

3-8 Batch scheduling of heterogeneous tasks on a system with heterogeneous RC
m ach in es .......................................................................................................... ........ .. 7 6

3-9 Scheduling of homogeneous tasks on homogeneous systems......................................77

3-10 Scheduling of heterogeneous tasks on various RC systems .........................................78

3-11 Scheduling of homogeneous tasks on heterogeneous systems .....................................78

4-1 MIPICH -V architecture (Courtesy: [50])....................................................... ................ 86

4-2 Starfish architecture (C ourtesy: [51]) ........................................................... ................ 87

4-3 Egida architecture (Courtesy: [53]) ......................................................... 88

4-4 A architecture of F E M P I ................................................... ............................................. 92

4-5 System configuration of the prototype testbed ............................................. ................ 99

4-6 Performance of point-to-point communication on a traditional cluster........................ 101

4-7 Performance of point-to-point communication on prototype testbed........................... 102

4-8 Performance of broadcast communication on a traditional cluster............................... 103

4-9 Performance of barrier synchronization on a traditional cluster................................ 104









4-10 Performance of gather and scatter communication on a traditional cluster ...................105

4-11 Performance of gather and scatter communication on prototype testbed ........................105

4-12 D ata decom position in parallel LU D ................. ...................................... ............... 111

4-13 P arallel L U D algorithm .................................................. ............................................ 112

4-14 Failure-free execution time of parallel LUD application kernel with increasing
sy stem siz e ...................................................................................................... ........ .. 1 13

4-15 Recovery time from a failure with increasing system size for applications with small
datasets .......................................................................................... ......... 115

4-16 Recovery time from a failure with increasing system size for applications with large
datasets .......................................................................................... ......... 116









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

IMPROVING UTILIZATION AND AVAILABILITY OF
HIGH-PERFORMANCE COMPUTING IN SPACE

By

Rajagopal Subramaniyan

December 2006

Chair: Alan D. George
Major Department: Electrical and Computer Engineering

Space missions involving science and defense ventures have ever-increasing demands for

data returns from their resources in space. The traditional approach of data gathering, data

compression and data transmission is no longer viable due to the vast amounts of data. Over the

past few decades, there have been several research efforts to make high-performance computing

(HPC) systems available in space. The idea has been to have enough "on-board" processing

power to support the many space and earth exploration and experimentation satellites orbiting

earth and/or exploring the solar system. Such efforts have led to small-scale supercomputers

embedded in the spacecraft and, more recently, to the idea of using commercial-off-the-shelf

(COTS) components to provide HPC in space. Susceptibility of COTS components to Single-

Event Upsets (SEUs) is a concern especially since space systems need to be self-healing and

robust to survive the hostile environment. Fault-tolerant system functions need to be developed

to manage the resources available and improve the availability of the HPC system in space.

However, resources available to provide fault tolerance are fewer than traditional HPC systems

on earth.

Several techniques exist in traditional HPC to provide fault tolerance and improve overall

computation rate, but adapting these techniques for HPC in space is a challenge due to the









resource constraints. In this dissertation, this challenge is addressed by providing solutions to

improve and complement HPC in space. Three techniques are introduced and investigated in

three different phases of this dissertation to improve the effective utilization and availability of

HPC in space. In the first phase, new model to perform checkpointing at an optimal rate is

developed to improve useful computation time. The results suggest the requirement of I/O

capabilities much superior to present systems. While the performance of several common HPC

scheduling heuristics that can be used for effective task scheduling to improve overall execution

time is simulatively analyzed in the second phase, availability is improved by designing a new

lightweight fault-tolerant message passing middleware in the third phase. Analyses of

applications developed with the fault-tolerant middleware show that robustness of the systems in

space can be significantly improved without degrading the performance. In summary, this

dissertation provides novel methodologies to improve utilization and availability in space-based

high-performance computing, thereby providing better and effective fault tolerance.









CHAPTER 1
INTRODUCTION

Space research has advanced leaps and bounds in the past few decades with numerous

satellites, probes and space shuttles launched to explore near-earth space and other astronomical

bodies including earth itself. More and more of our universe is being explored and, with

permanent space stations already in orbit around earth, space exploration is continuing to grow.

There has been a significant increase in the capability of instruments deployed in space. With

such high-tech instruments in place, space missions involving science and defense ventures have

ever-increasing demands for data returns from their corresponding resources in space. The

traditional implementation approach of data gathering, data compression, and data transmission

is no longer viable. The amount of data being generated is becoming too large to be transmitted

via available downlink channels in reasonable time. An industry-proposed solution to reduce the

demand on the downlink is to move processing onto the spacecraft [1]. The idea has been to

have enough on-board processing power to support the many space and earth exploration and

experimentation satellites orbiting earth and/or exploring the solar system. However, this

approach is mired by the limited capabilities of today's on-board processors and the prohibitive

cost of developing radiation-hardened high-performance electronics.

Microelectronics designed for environments with high levels of ionizing radiation such as

space have special design challenges. A single charged particle of radiation can knock thousands

of electrons loose, causing electronic noise, signal spikes, and in the case of digital circuits,

plainly incorrect results. The problem is particularly serious in the design of artificial satellites,

spacecraft, military aircraft, nuclear power stations, and nuclear weapons. In order to ensure the

proper operation of such systems, manufacturers of integrated circuits and sensors intended for

the aerospace markets employ various methods of radiation hardening. The resulting systems are









said to be rad(iation)-hardened or rad-hard. Most rad-hard chips are based on their more

mundane commercial equivalents, with manufacturing and design variations that reduce the

susceptibility to radiation and electrical and magnetic interference. Typically, the hardened

variants lag behind the cutting-edge commercial products by several technology generations due

to the extensive development and testing required to produce a radiation-tolerant design [2].

As mentioned earlier, the approach of moving processing onto the spacecraft is hindered

by the radiation effects influencing technological capabilities and cost of the high-performance

electronics. This situation has encouraged researchers and industry to consider the use of

commercial-off-the-shelf (COTS) components for on-board processing. Furthermore, the recent

adoption of silicon-on-insulator (SOI) technology by COTS integrated foundries is resulting in

devices with moderate space radiation tolerance [1]. Such an approach would make high-

performance computing (HPC) possible in space. However, in spite of this progress, COTS

components continue to be highly susceptible to Single-Event Upsets (SEU) that are

consequences of radiation. SEU as defined by National Aeronautics and Space Administration

(NASA) is "radiation-induced errors in microelectronic circuits caused when charged particles

(usually from the radiation belts or from cosmic rays) lose energy by ionizing the medium

through which they pass, leaving behind a wake of electron-hole pairs" [3]. High rate of SEUs

lead to many component failures making HPC in space a significant challenge and certainly

more complex than traditional HPC on ground. Technology must be developed that capitalize on

the strengths of COTS devices to realize HPC in space while overcoming their susceptibility to

SEUs without negating their benefits.

In general, computations are lost when there are failures and the full potential of the system

is lost. Failures decrease the availability of the system. Availability gives a measure of the









readiness of usage of a system (e.g., a highly available system has a low probability of being

down at any instant of time). Improving availability requires the system to be built with fault-

tolerant features. But, these features might take additional resources reducing the effective

utilization of the system. Effective utilization gives a measure of the time the system is used for

actual computation. The impact of failures is worse in space environments compared to

traditional HPC environments as the computational resources are limited. The chances of

failures are high in harsh environments and the liberty in terms of resources available to provide

fault tolerance is much less. HPC in space is not the same as traditional HPC on ground due to

several constraints including harsh environments (e.g., high radiation, high rate of SEUs, high

rate of component failures), limited resources attributed primarily to weight and volume

constraints (e.g., processing speed, storage capacity, power availability), cost and timing

constraints. Additionally, space computing also requires automated processing and repair to

recover from faults if and when they occur in the system.

The traditional approach to SEU or transient error mitigation in soft, radiation-tolerant

hardware is redundant self-checking and comparison, either in hardware or in software. In some

applications, size, weight, and power constraints of the mission may preclude the use of

redundant hardware, and the time constraints of the mission may preclude the use of redundant

computation in software. In such cases, an alternative approach must be found which can

provide adequate SEU protection with a single-string or single-execution implementation. For

example, Samson et al. compare the overhead of traditional redundant self-checking for SEU

mitigation with an application-specific fault tolerance method [4]. It is reported that self-

checking configurations consume twice the power, twice the weight, twice the size, twice the

cost, and roughly one half the reliability of the single-board solution.












Several techniques exist for traditional HPC to provide fault tolerance and improve overall


computation rate. Some of the solutions include periodic backup of data (e.g., checkpointing),


exposing computation to lesser faults by reducing overall execution time of applications (e.g.,


effective task scheduling) and reducing the impact of failure (e.g., fault-tolerant middleware).


However, adapting these solutions for HPC in space is a challenge due to the resource constraints


discussed earlier. In this dissertation, we address this challenge by investigating, developing and


evaluating techniques to improve and complement HPC in space.


Computing Node ........
r-7


SUser Application

-o a Message-
< Passing
U) Middleware


Controller Node
r -I --


.i Task Submission Interface

Message- M
= | Passing
w o Middleware I
S N oCheckpointing
Service Agent


Network Stack


.1


Network




Physical
Network


Network Stack


Checkpointing
Service Agent
w Message-
> S- Passing
(COD Q Middleware 1 01:
U3 User Application
*---___ -----------_-_--_-----


--- I Storage Node .......


CnecKpointing :
Service Agent I Storage Interface/Manager

a) | | Message- k
Stack -< Passing
............ M Middleware
Checkpoint
Manager
:TTs-


Network StacK


-- -----Network Stack
Network Stack


Checkpointing
Service Agent
n UsMessage-
Passing
CL Middleware

( User Application


....... I Computing Node .........


I Computing Node |


Figure 1-1. Typical high-performance computing system.


Figure 1-1 shows the various software agents involved in a typical HPC system either on


ground or in space. The system has three types of nodes: a Controller Node, several Computing


Nodes, and one or more Storage Nodes. The Controller Node is in charge of accepting tasks for


o: Shaded areas in each node are
addressed in this dissertation
to improve utilization and
availability of HPC

o Checkpointing in Chapter 2,
Scheduling in Chapter 3,
Message-Passing Middleware
in Chapter 4


,,


I









execution and scheduling the tasks on idle resources (i.e., Computing Nodes) and is generally

radiation hardened in space systems. The Computing Nodes perform the actual execution of the

application while the Storage Node is responsible for data storage and backup. The data could

be input, output or system states (required for restarts on failures). The application processes

executing on the Computing Nodes communicate via the Message-passing Middleware. In this

dissertation, we address fault tolerance by focusing on the shaded areas in Figure 1-1, namely

checkpointing, scheduling and message passing middleware.

Three techniques are discussed in three different phases of this research with the overall

goal of improving the effective utilization and availability of HPC in space. In Phase 1, the

useful computation time of the system is improved (increased) by optimizing checkpointing-

related defensive I/O. We model the optimum checkpointing frequency for applications in terms

of the mean-time-between-failures (MTBF) of the system, amount of memory checkpointed, and

sustainable I/O bandwidth of the system. Optimal checkpointing maximizes useful computation

time without jeopardizing the applications due to failures while minimizing the usage of

resources. In Phase 2, the overall execution time of the application is improved (reduced) by

simulatively analyzing scheduling heuristics for application task scheduling. We analyze

techniques to effectively schedule tasks on parallel hardware reconfigurable resources that would

be part of the HPC space system. Effective task scheduling reduces the overall execution time

thereby exposing the application to lesser faults while reducing the resource usage as well. In

Phase 3, availability of the system is improved by designing a new lightweight fault-tolerant

message-passing middleware. The fault-tolerant middleware reduces the impact of failures on

the system. The recovery time of the system is improved allowing unimpeded execution of









applications as much as possible. The techniques that we propose in this research are also

applicable to traditional HPC on the ground but are critical for HPC in space.

This dissertation contains a discussion of the modeling of checkpointing process and

optimization of defensive I/O. The dissertation also describes the simulative analysis of dynamic

scheduling heuristics for effective task scheduling followed by the design and evaluation of a

fault-tolerant message passing middleware. Conclusions and directions for future research are

provided finally.









CHAPTER 2
OPTIMIZATION OF CHECKPOINTING-RELATED I/O (PHASE I)

In general, computation of large-scale scientific applications can be divided into three

phases: start-up, computation, and close-down with I/O existing in all phases. A typical I/O

pattern has start-up phase dominated by reads, close-down by writes and computation phase by

both reads and writes. The primary questions of importance with relevance to I/O involved in

the three phases are when, how much and how often? These questions can be addressed by

segregating I/O into two portions: productive I/O and defensive I/O [5]. Productive I/O is the

writing of data that the user needs for actual science such as visualization dumps, traces of key

scientific variables over time, etc. Defensive I/O is employed to manage a large application

executed over a period of time much larger than the platform's MTBF. Defensive I/O is only

used for restarting a job in the event of application failure in order to retain the state of the

computation and hence the progress since the last application failure. Thus, one would like to

minimize the amount of resources devoted to defensive I/O and computation lost due to platform

failure. As the time spent on defensive I/O (backup mechanisms for fault tolerance) is reduced,

the time spent on useful computations will increase. This philosophy applies to high-

performance distributed computing in the majority of environments ranging from

supercomputing platforms to small embedded cluster systems, although the impact varies

depending on the system and other resource constraints.

The impact of productive I/O on I/O bandwidth requirements can be reduced by better

storage techniques and to some extent through improved programming techniques. It has been

observed that defensive I/O dominates productive I/O in large applications with about 75% of the

overall I/O being used for checkpointing, storing restart files, and other such similar techniques

for failure recovery [5]. Hence, by optimizing the rate of defensive I/O, we can reduce the









overall I/O bandwidth requirement. Another advantage is that the optimizations used to control

defensive I/O would be more generic and not specific to applications and platforms. However,

reducing defensive I/O is a significant challenge.

Checkpointing of a system's memory, to mitigate the impact of failures is a primary driver

of the sustainable bandwidth to high-performance filesystems. Checkpointing refers to the

process of saving program/application state, usually to a stable storage (i.e., taking the snapshot

of a running application for later use). Checkpointing forms the crux of rollback recovery and

hence fault-tolerance, debugging, data replication and process migration for high-performance

applications. The amount of time an application will tolerate suspending calculations to perform

a checkpoint is directly related to the failure rate of the system. Hence, the rate of checkpointing

(how often) is primarily driven by failure rate of the system. If the checkpointing rate is low,

less resource are consumed but the chance of high computational loss (both time and data) is

increased. If the checkpointing rate is high, resource consumption is greater but the chance of

computational loss is reduced. It is important to strike a balance and an optimum rate of

checkpointing is required. Finding a balance is a difficult problem even in traditional ground-

based HPC with fewer failures and more resources. The problem is aggravated for HPC with

embedded cluster systems in harsh environments such as space with more failures and less

resources.

In this phase of the dissertation, we analytically model the process of checkpointing in

terms of MTBF of the system, amount of memory checkpointed, sustainable I/O bandwidth and

frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on

systems with given specifications, thereby making way for efficient use of available resources

and gain maximum performance of the system without compromising on the fault tolerance









aspects. The useful computation time is increased thereby improving the effective system

utilization. Further, we develop discrete-event models simulating the checkpointing process to

verify the analytical model for optimum checkpointing. The simulation models are developed

using the Mission-Level Designer (MLD) simulation tool from MLDesign Technologies Inc. [6].

In the simulation models, we use performance numbers that represent systems ranging from

small cluster systems to large supercomputers.

The remainder of this chapter is organized as follows. Section 2.1 provides background on

the growth trends of technologies to study the effectiveness of this research to improve I/O

usage. Section 2.2 briefly highlights why checkpointing is the most common methodology of

providing fault tolerance. In Section 2.3, the checkpointing process is analytically modeled to

identify the optimum frequency of checkpointing to be used on systems with given

specifications. Section 2.4 describes the simulation models that we develop to verify our

analytical models, while the results derived from the analytical model are provided in Section

2.5. Section 2.6 provides conclusions for this chapter and directions for future research.

2.1 Background on Technology Growth Trends

In this section, we study the growth trend of technologies involved in HPC, highlighting

the poor growth of sustainable I/O bandwidth. The poor growth of I/O bandwidth compared to

other technologies substantiates our approach to reduce resource consumption and improve

useful computation time by optimizing checkpointing-related defensive I/O. It is important to

mention that although we have used performance numbers from traditional ground-based HPC

and supercomputing platforms (due to lack of any standard representative platforms for HPC in

space) to study the growth trends, the numbers and trends are representative of what would be

coming for HPC in space. The performance of all the technologies however is relatively poorer

for HPC in space due to radiation hardening.









Gordon Moore observed an exponential growth in the number of transistors per integrated

circuit and predicted this trend would continue [7]. He made this famous observation in 1965,

just four years after the first planar integrated circuit was discovered. This doubling of

transistors every eighteen months, referred to as "Moore's Law", has been maintained and still

holds true. Similar to other exponentially growing systems, Moore's law can be expressed as a

mathematical equation with a derived compound annual growth rate. Let x be a quantity

growing exponentially, in this case the number of transistors per integrated circuit, with respect


to time t as measured in years. Then, x(t) = x ,e where k is the compound annual growth rate,

dx
-=kx
and the rate of change follows dt

For Moore's law, the compound annual growth rate can be established as kmoore=0.46.

When t=-l.5, and = 2; k ln(2) = 0.46. However, this compound annual growth rate is not
x0 1.5

the same for the other technologies involved in HPC. We briefly overview the growth of several

technologies including CPU processing power, disk capacity, disk performance, and sustained

bandwidth to/from disks in the remainder of this section.

2.1.1 CPU Processing Power

Modern HPC systems are made by tightly coupling multiple integrated circuits. In

addition to being able to capitalize on the exponential growth observed in Moore's Law, these

systems have also been able to increase the average number of processors in systems to have a

peak performance that exceeds Moore's law. Figure 2-1 shows the LINPACK [8, 9] rating for

the tenth most powerful computer in the world for several years as ranked by the Top500 [10]

list. The compound annual growth rate can be established as kHpc=0.58 (values of k for the

technologies are calculated as shown for Moore's Law). We picked the tenth computer for no










special reason except that in our opinion, the computers at the very top might have been custom

tuned and hence may not provide the general trend. The benchmark used to establish the Top500

list highlights the combined performance of the processors used to build the supercomputer. It is

important to realize that other technologies such as disk performance, memory performance, and

networking performance are not fully represented in the Top500 benchmark [10]. Growing the

relative performance of these other technologies is equally important to the CPU processing

power when looking to establish a well balanced system.


100000-
#10 Supercomputer on Top500 List
10000 ---

Z 1000 -

0
S 100


E lo-
I= 10 --------------


Jun-93 Jun-96 Jun-99 Jun-02 Jun-05


Figure 2-1. Supercomputer performance over time

2.1.2 Disk Capacity

Figure 2-2 shows the capacity of a 95mm 7,200RPM disk drive over time. Areal density

of hard drives has grown at an impressive compound annual growth rate of ko cap=0.62, and has

accelerated to greater than a kio cap=0.75 rate since 1999 as shown in the figure. We are now

seeing the delivery of 120 GB/inch2 for magnetic disk technology, with demonstrations over 240

GB/inch2 routinely occurring in laboratories [11]. The first 1 TB (1,000GB) hard disk drives are

expected to ship in 2007[11]. Heinze et.al [12] demonstrate the theoretical limit of physical

media at approximately 256 TB/inch2 using a single atomic layer to form a two-dimensional










antiferromagnet. In the next five to ten years, the likelihood of perpendicular recording using a

patterned media may likely appear to further increase recording densities [11]. In addition to

continued evolutionary advancement, there are new disruptive storage technologies nearly ready

to enter the market place like Magnetic Random Access Memory (MRAM) [13] and Micro-

Electro-Mechanical-Systems (MEMS) [14]. Thus we see a good growth in hard drive capacities.


1000


C 100
U
0.
10 *



1995 1997 1999 2001 2003 2005


Figure 2-2. Hard drive capacity over time

2.1.3 Disk Performance Growth

Disk performance has not kept pace with the growth in disk capacities. More densely

packed data means fewer disk actuators for a given amount of storage. While the compound

annual growth rate of the areal density of magnetic disk recording has increased at an average of

over 60 percent annually, the maximum number of random I/Os per second that a drive can

deliver is improving at an annual compounded growth rate of less than kioerf_,o=0.20. Continual

increases in capacity without corresponding performance improvements at the drive level create

a performance imbalance that is defined by the ratio called access density. Access density is the

ratio of performance, measured in I/Os per second, to the capacity of the drive, usually measured

in gigabytes (access density = I/Os per second per gigabyte).

As seen in Table 2-1, access density has steadily declined while the capacity has increased

substantially. Access density is becoming a significant factor in managing storage subsystem









performance and the tradeoffs of using higher-capacity disks must be carefully evaluated as

lowering the cost-per-megabyte most often means lowering the performance. Future I/O-

intensive applications will require higher access densities than are indicated by the current

development roadmaps. Higher access densities may be achieved through lowering the capacity

per actuator or dramatically increasing the I/Os-per-second capabilities of the drive. The latter is

much harder to accomplish, particularly for random access applications where a seek (disk arm

movement) is required.

Vendors are making small high-performance drives such as a 15,000 RPM 73.4 GB drive

from Seagate with an advertised seek time of 3.6 ms leading to an access density of about 3.8

IO/s per GB. As of April 2006 these high performance drives cost $3.3 per GB [15], a factor of

11 higher than the $0.36 per GB for low cost commodity drives [16]. The factor of 9 increase in

performance is offset by the factor of 9 increase in cost, leaving most architects to select

commodity drives for the additional capacity.

Table 2-1. Disk performance growth over time
Year Drive Seek Time (ms) I/Os per Sec Capacity (GB) I/Os per Sec per GB
1964 2314 112.5 8.9 0.029 306.50
1975 3350 36.7 27.2 0.317 86.00
1987 3380K 24.6 40.7 1.890 21.50
1996 3390-3 23.2 43.1 8.520 5.10
1998 Cheetah-18 18.0 55.6 18.200 3.10
2001 WD1000 8.9 112.4 100.000 1.10
2002 180GXP 8.5 117.6 180.000 0.70
2004 Deskstar 7K400 8.5 117.6 400.000 0.30
2005 Deskstar 7K500 8.5 117.6 500.000 0.24

2.1.4 Sustained Bandwidth to/from Disk

Sustained bandwidth from a disk relative to capacity of the disk has also continued to

decline. The sustained bandwidth of a disk is dependent upon the physical location of the data

on the disk. Due to a fixed rotational speed, the closer to the center of the disk platter the slower










the sustained read rate. Figure 2-3 shows the sustained transfer rate in MB/s for the 15,000

RPM, 37 GB disk drive from Seagate [17]. The first 5 gigabytes of data are transferred at a rate

of 76 MB/s; meanwhile the final 3 GB of data only sustained 49 MB/s. Although vendors

provide the peak performance number in general, the average and minimum sustained

performance can be significantly lower. It is important to properly layout the data on the hard

drive to achieve consistent performance.

80
S70-'
5 60
50
40
G) 30
l 20
: 10
0, 0
0 10 20 30
Data Stored (GB)


Figure 2-3. Performance of a single disk with respect to the amount of data stored

Figure 2-4 highlights the minimum, average and maximum performance for typical disk

drives introduced between 1995 and 2004. The average sustainable bandwidth from a hard disk

drive has grown at an annual compounded growth rate of kioperfB=0.26 per year.


S100






E- Avg
-- Min
(. 1 -






Figure 2-4. Growth of disk drive bandwidth over time
Figure 2-4. Growth of disk drive bandwidth over time









2.1.5 I/O Performance

The technologies pertinent to HPC discussed in this section thus far are all growing

exponentially although some are growing substantially slower than others. Table 2-2

summarizes the rate of change for these technologies.

Table 2-2. Growth rate of computing technologies
Technology Growth rate
Transistors per integrated circuit kmoore 0.46
LINPACK on Top 10 supercomputer kHpc 0.58
Capacity of hard drives kiocap 0.62
Cost per GB of storage kio cost -0.83
Performance of hard drive in I/Os per second k~_perf o 0.20
Performance of hard drive in bandwidth k~_perf BW 0.26

2.2 Optimal Usage of I/O

The performance of disk drives measured in both I/Os per second and sustained bandwidth

is not keeping up with other technology trends. Assuming current systems meet I/O performance

needs and for the balance of bandwidth per computational power to be maintained, we can use

the formula d(t) = et(kHPc-k1'0 r) to calculate number of more disk drives that will be needed in

order to maintain that balance.

In order to show the importance of efficient usage of I/O resources, in Figure 2-5 we show

how many disks will be required to work in perfect parallel to maintain system balance in the

coming years. Given these growth trends we will need to use 46 times more disks in ten years in

order to maintain the same balance of I/Os per second on a classical supercomputer. For

example in 2004, a typical 10 TeraFLOP supercomputer may be capable of 500,000 theoretical

I/Os per second, or 50 I/Os per GigaFLOP. Systems of this size normally have disk drives in the

order of around 5,000. Given historical growth rates a comparable supercomputer would have

3.4 PetaFLOPs in 2014 and need to be capable of 170,000,000 theoretical I/Os per second to










keep the same balance. However, as disk performance is not growing at the same exponential

rate as the computer performance, it would be necessary to have 46 times more disk drives,

necessitating disk drives in the order of around 230,000. The gap might further widen in the

years to follow necessitating a fundamental change to the way we approach storage.


50
45
U 40
>. 35
"o 30
M 25
V 20
0 15
o 10
5
0
2005 2008 2011 2014
Year


Figure 2-5. Number of disks required in future systems to maintain present performance levels

Based on the growth trend of the different technologies, it can be seen that I/0

performance has fallen behind other technologies and the criticality of efficient usage of I/O

resources can be realized. In the harsh space environment that is prone with failures and where

the technologies are slower than their counterparts on ground, optimal usage of I/O resources is

even more critical. As mentioned earlier, checkpointing is a critical process that drives the need

to improve the I/O bandwidth with frequent and at times lengthy accesses to the disk. Hence,

optimizing the rate of checkpointing will optimize the usage of I/O resources. An alternative

would be improving the checkpointing process itself with new techniques to consume fewer

resources.

2.2.1 Checkpointing Alternatives

Several strategies are used as alternatives to traditional disk-based checkpointing. Many

researchers are working on diskless checkpointing (checkpoints are encoded and stored in a









distributed system instead of a central stable storage), for example [18], and suggest this strategy

as an alternative to conventional disk-based checkpointing to reduce the I/O traffic due to

checkpointing. The National Nuclear Security Administration (NNSA) issued a news release

about the first full-system three-dimensional simulations of a nuclear weapon explosion

(Crestone project), a significant achievement for Los Alamos National Laboratory and Science

Applications International Corporation [19]. The simulation is essentially a 24-hour, seven-day-

a-week job for more than seven months. For a highly important seven-month computation such

as Crestone, the notion of checkpointing and a rolling computation go hand-in-hand. The small

success stories of using diskless checkpointing fail in such cases of large applications. True

disk-based incarnations are required over heavy I/O phases of the code. Moreover, diskless

checkpointing is often application-dependent and not generic.

Other checkpointing alternatives include incremental checkpointing (i.e., checkpointing

only the information that has changed since previous checkpoint) and distributed checkpointing

(i.e., individual nodes in the system checkpoint asynchronously thereby reducing the

simultaneous load on the network). Although these emergent schemes may have their

advantages, more maturity is required for their use in large-scale, mission-critical applications.

In the current scenario, checkpoints are mostly synchronous with all the nodes writing

checkpoints at the same time. Additionally, the checkpoints ideally involve the storage of the

full system memory (at least 70% to 80% of the full memory in general) [19]. This scenario is

quite common when schedulers such as Condor [20] are used or when applications checkpoint

using libraries such as Libckpt [21]. The frequency of checkpointing can be decreased for

productive I/O to surpass defensive I/O, but not without the risk of losing more computation due

to system failures.









2.2.2 Optimizing Checkpointing Frequency

In this phase of the research, we model the overhead in a system due to checkpointing with

respect to the MTBF, memory capacity and I/O bandwidth of the system. In so doing, we

identify the optimum frequency of checkpointing to be used on systems with given

specifications. Optimal checkpointing helps to make efficient use of the available I/O and gain

the maximum performance out of the system without compromising on its fault tolerance

aspects. There have been similar efforts earlier to model and identify optimum checkpointing

frequency for distributed systems [22-25]. However, these efforts have not been simulatively or

experimentally verified, and the approaches as yet are too theoretical to be practically

implemented on systems.

It should be mentioned that optimizing the frequency of checkpointing is just one method

of reducing the impact of defensive I/O on I/O bandwidth requirements. Other methods might

include improvement of the storage system (high-performance storage system, high-performance

interconnects, etc.), novel methods and algorithms for checkpointing, etc. There are claims that

we can hide the impact of defensive I/O and work around this problem rather than tackling it.

However, there are no recorded proofs to substantiate such claims.

2.3 Analytical Modeling of Optimum Checkpointing Frequency

Distribution of the total running time t of a job is determined by several parameters

including:

1
* Failure rate 2; A = where TMTBF is the MTBF of the system
TMTBF

* Execution time without failure or checkpointing, Tx

* Execution time between checkpoints (checkpoint interval), T,

* Time to perform a checkpoint (checkpointing time), Tc









* Operating system startup time, To

* Time to perform a restart (recovery time), Tr

Figures 2-6 and 2-7 show the various time parameters involved in the execution of a job

without and with checkpointing respectively. Without checkpointing, the system has to be

failure free for a duration of Tx for the computation to complete successfully. When a failure

occurs, computation is restarted from its initial state after system recovery. With checkpointing,

the system state is checkpointed periodically. When the system is recovered from a failure,

computation is resumed from the latest stable checkpointed state on system recovery.

App. App. App. App. App.
start restart restart restart complete

I I I I I ........
1" 1" 1"

Failure Failure Failure

Figure 2-6. Job execution without checkpointing

App. App. App.
start resume complete

IT, II T, T T II IIT T, T

Failure Output

Figure 2-7. Job execution with checkpointing

The execution process with checkpointing and failures can be modeled as a Markov

process as shown in Figure 2-8 where nodes 0,1,2,...,n represent stable checkpointed states and

0',1',2',... .,n represent failed states. Let ti, t2, t3,...,tn be the random variables for the time spent

in each cycle between two checkpointed states. These random variables are represented by r in

general.










p,a p* pa na

q,b P'c qb p' q,b P'c q,b p'I



q',d q,',d q.',d q',d
Figure 2-8. Execution modeled as a Markov process

The delays associated with each event in Figure 2-8 are as follows:

a:T b:rTr
The probabilities associated with each event in Figure 2-8 are as follows:

p =e p'=e z("') q =prob(r
It should be noted that the total running time is a sum of individual random variables

representing individual checkpointing cycles. However, the random variables are similar and

hence the mean total running time (7) can be given as the sum of the mean running time of each

cycle.

S= E(t) = E(t) + E(t,) +E(t,) +.....+E(tj)

t = nx E(t1) (2.1)

The mean running time of each cycle can be found by multiplying the probabilities

associated with each path in the Markov chain and the corresponding time delay. There are

several paths in the Markov chain that the process can actually take. The process can move from

state 0 to state 1 with probability p and the delay associated with that transition is a. When there

are failures in the system, the process does not directly move from state 0 to state 1. Instead, the

process moves to state 0' with probability q and loops back in the same state with probability q'.

The delays associated with these state transitions represented by b and d respectively. b and d

are exponential random variables with upper limit of T and T+T respectively. With probability









p', the system moves from the failed state to the stable state 1 and delay associated with the

transition represented by c is equal to T+T, the upper limit of d.

E(t1)= pxa+qx p'x(b+c)+qxq'xp'x(b+c +d)+ qxq' xp'x(b+c +2d)+...


=pxa+q (b+c) + q'xd (2.2)
1-q'

where

a= T; b= e -T- +-, c=T+T'; d= e -(T+T')-- +


Substituting the corresponding time delays into Eq. 2.2 we get Eq. 2.3


E(t)= Te + (-e -e -(T+1)+e (T r)(T T)+ (1 -e 2(r +eer -(T+) ')
(2.3)

As can be seen from Eqs. 2.1 and 2.3, the expression for the mean total running time is

complicated. To avoid further complexity, we followed a different method as follows. The

Laplace transform of a functionf(x) is given by A(s) = Jf(x)e-'dx. We can find the Laplace

transform of the pdf of the total running time. The negative of the derivative of the Laplace

transform at s=0 gives the mean total running time.


(s) = f (t)e-s"dt)= (-t)f (t)e-s"dt (2.4)


(0) = (-t)f (t)dt = -E[t] = -(0) (2.5)

Laplace transform can be calculated by finding the Laplace transforms of individual

transitions in the Markov process and then combining them together. For example, the Laplace

transform on the pdf of the time required for the transition from state 0 to state 1 without a

failure is given by e-sT and the transition happens with probability p. The transform on the pdf









of the time required for transition from state 0 to state 0' that happens with probability q is given

1 T -(A+s)T
by qp(s,T) = JAe e dt =s "
1-e o0 A+s 1-e

The Laplace transform on the pdf of the time required for the transition from state 0' to

state 1 that happens with probability' is given by e-s(T+T'). If there is looping in state 0' due to

repeated failures, the transform on the pdf of the time spent returning to state 0' (transition

1 T+T' (1 S-( +s)(T+T')
probability q') is given by qp(s,T+T')= 1 JT') Ae-te- st dt (T+T')
-A(T+T') A+s[ e-T+TJ')
1-e 0 1-e

The Laplace transform for the process with failures (state transition from 0 to 1 via 0') is

found by doing a weighted sum of the Laplace transforms on the pdfs of the individual random


variables. The transform is given by q x p(s, T) p'e- (T +T')[q'xp(s, T)] where i denotes the


number of number of loops in state 0'. Hence, the Laplace transform of the pdf for one cycle of


the Markov process is be given by t1 (s) = pe-sT + q x (s, T) p'e(T +T') [q'xp(s, T)


We get Eq. 2.6 from the above and differentiating Eq. 2.6 w.r.t s and substituting s=0, we

get mean total running time given by Eq. 2.7. We verified the validity of the expressions for the

mean total running time given by Eqs. 2.4 and 2.7 by running a Monte Carlo simulation with

10000 iterations and cross checking the results. We found that the time given by the expressions

in the equations closely matched the simulation results. We will be using the expression in Eq.

2.7 for further development for simplicity reasons both in terms of representation and

computation.


t (s) = tl (s) X t2(s) x t3 (s) x......x t (s)= f t (s)\










, (s)= l"ele +eix ) (2.6)
s +_ e-(s+A)T+T )


e-A (T+;T')



-t (0)=n e%-( L n=(ez(T+ ++) -(e )) (2.7)


We find the optimum checkpointing interval T, opt that gives the minimum total running

time by differentiating the mean total running time with respect to T, and equating to zero as

shown in Eq. 2.8. We set n to be equal to the ratio of Tx and T,

a a (Tx(e )-e^) ^
T T A(2.8)

Solving for T, in Eq. 2.8, we get the optimum checkpointing interval as follows:

S ( opt T )
Sopt (2.9)

-2T
Eq. 2.9 can be represented in the form a 1=-pe-a where a = ATi _opt and 8 = e c From

this form of representation, it can be seen that Eq. 2.9 is a transcendental equation and it is

impossible to find a solution for a except by defining a new function. There is no analytic

solution to this equation to obtain a closed form solution. However, it can be seen on expansion

of e- that a is bound by the limit a < 1, which implies that T, _opt is bound by the limit T, opt <

'1 (i.e, T, opt < MTBF of the system). Also, since e-"a < we havel -_ /1e a. Thus, we have a

-11T
l-e c
lower bound on optimum checkpointing interval as Ti opt >-e Hence, we can use

numerical methods to solve the equation for optimum checkpointing interval.









Impact of Checkpointing Overhead on I/O Bandwidth

We modeled the optimum checkpointing frequency that provides the minimum overall

running time for applications. With the model developed, we study the impact of checkpointing

overhead on the sustainable I/O bandwidth of systems in this section. The representative

performance numbers used in this study are typical in HPC and supercomputing systems on

ground.

In our view, given a system with a specific memory capacity and MTBF, it would be

useful to study the I/O bandwidth requirements of the system with respect to the overhead that is

imposed by the checkpointing done in the system and the subsequent performance loss in terms

of the total execution time of the application. Eq. 2.10 obtains this performance loss as a

function of A, T, and Tc.

+T optc Tr) Z(T+T,) Z(T,+T,)
F = 7 I ---- e2- =^ -- (2.10)
TAT opt 1-e e

Let F denote the factor of increase in the total running time of the application due to

checkpointing overhead and failures while running with optimum checkpoint interval. F is given

by the ratio of t to Tx as given by Eq. 2.10.

In Eq. 2.10, To can be considered negligible compared to the other times. Also, Tr can be

considered equal to Tc as both represent the time to move the same amount of data through the

same I/O channel. T, can be given by the ratio of memory capacity to I/O bandwidth. Given a

value ofF, we can solve for T _opt by solving a quadratic equation on e' .

1 K + VK2 -4M
=- xln K- (2.11)


whereK = e + Fe- 2'T and M = Fe- 3 .









Figures 2-9(a) and 2-9(b) show T,_opt and hence the impact of checkpointing on the overall

system execution time in systems with MTBF of 8 hours and 24 hours respectively for varying

system memory capacities and I/O bandwidth. The MTBF values used in the figures are typical

in recent high-performance systems [5]. The value ofF is fixed at 1.2 in the figures. We pick

1.2 because F represents the impact of checkpointing on overall execution time and lower values

ofF (as close to 1 as possible) are certainly desirable. It can be seen from the figures that for

systems with low I/O bandwidth, optimum checkpointing is not even possible implying that the

allowable overhead (represented by F) is not achievable. As I/O bandwidth of the system is

increased, the optimum checkpointing interval also increases. Since the total execution time has

been fixed (1.2 times the actual execution time), an increase in checkpoint interval means

decrease in the number of checkpoints (n) during the course of the execution. Fewer checkpoints

implies less overhead on the system's I/O.

It can be seen from the Figure 2-9 that sustainable I/O bandwidth is a key to obtain

optimum overall execution time. As memory capacity of the system increases, so does the

requirement of higher I/O bandwidth. For systems with higher memory capacities, it is

impossible to find a solution for optimum checkpointing interval to obtain the optimum overall

execution time. For example, in a system with a MTBF of 8 hours and memory capacity of 75

TB, there is no solution for the optimum checkpointing interval until the I/O bandwidth is

increased to 29 GB/s. The impact is less in systems with higher MTBF values. For a system

with similar memory capacity but a MTBF of 24 hours, there exists a solution starting with I/O

bandwidth of 10 GB/s. However, an important factor to note is that although solutions exist for

the optimum checkpointing interval that gives the minimum total running time, the system might

be bogged down by too many checkpoints if the checkpoint interval is low. For example, in a












system with a memory capacity of 75 TB and MTBF of 24 hours, the optimum checkpointing


interval is around 10 minutes. Performing large checkpoints at such a high frequency will


certainly cause a great load on the system and is not desirable.




Optimum checkpoint interval on a system with MTBF-8 hours


-0- 1 TB
--- 5 TB
--10 TB
25 TB
-- 50 TB
--75 TB


1 5 9 13 17 21 25 29
A I/O Bandwidth (GB/s)

Optimum checkpoint interval on a system with MTBF=24 hours


-4-1 TB
--- 5 TB
-a- 10 TB
25 TB
-- 50 TB
--75 TB


1 5 9 13 17 21 25 29
B I/O Bandwidth (GB/s)



Figure 2-9. Impact of I/O bandwidth on the system execution time and checkpointing overhead
A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day.


In certain scenarios or systems, the utility of the system within each cycle can be critical.


In a given system, let R1 denote the utility in a cycle (i.e., the ratio of time spent doing useful


calculations to the overall time spent in a cycle).



R, = and when T, = T op,













1-eX( opt)
R, =- (, "(2.12)
1 -e e'i + ATf


Eq. 2.12 gives the utility in each cycle when the checkpointing is performed at the


optimum checkpointing interval. The checkpointing time Tc can be again given by the ratio of


CMEM (memory capacity) and IOBW (sustainable I/O bandwidth). Figures 2-10(a) and 2-10(b)


give the utility in a cycle for several memory capacities and I/O bandwidth for systems with


MTBF of 8 hours and 24 hours respectively. The value ofF is fixed at 1.2. The trend of I/O


bandwidth requirement is similar to that in Figure 2-9.


Optimum checkpoint interval on a system with MTBF=8 hours
16

14

S12
--1 1 TB
-1 4-U 5 TB
-a-10 TB
08 25 TB
S06 .--E- 50 TB
-- 75 TB
S04

O 02

0 .
1 5 9 13 17 21 25 29
A I/O Bandwidth (GB/s)

Optimum checkpoint interval on a system with MTBF=24 hours


-4-1 TB
-+-5 TB
--- 10 TB
25 TB
-- 50 TB
--- 75TB


1 5 9 13 17 21 25 29
B I/O Bandwidth (GB/s)



Figure 2-10. Impact of increasing memory capacity in systems on sustainable I/O bandwidth
requirement A) System with TMTBF = 8 hours. B) System with TMTBF = 1 day.









It can be seen from the figures, in most of the cases, the utility in a checkpoint cycle is

much less than 90% which implies that most of the time within a cycle is spent checkpointing

and not doing useful calculations. Although not shown in the figure, it was found that for a

system with MTBF of 8 hours and memory capacity of 75 TB, an I/O bandwidth of about 150

GB/s is required to utilize at least 90% of a checkpoint cycle on useful calculations. Such high

values of I/O bandwidth are much higher than what is available for present systems. The fact

that the system cannot even provide useful computations in many cases shows the gravity of the

situation.

We see from Figure 2-10 that the utility within a cycle almost saturates beyond a certain

I/O bandwidth. The I/O bandwidth at which the percentage utility begins to flatten out is what is

desirable for the system both present and future. For example, I/O bandwidth of around 10 GB/s

would be desirable for a system with a memory capacity of 1 TB and MTBF of 8 hours. For a

similar system with MTBF of 24 hours even a lesser I/O bandwidth (around 5 GB/s) would

suffice. But as the memory capacity of the system increases, the I/O bandwidth desirable is

dramatically higher.

2.4 Simulative Verification of Checkpointing Model

Simulation is a useful tool to observe the dynamic behavior of a system as its configuration

and components change. As preliminary verification, Monte Carlo simulations were performed

on the analytical model. The simulation and analytical results matched closely verifying the

correctness of the model mathematically. However, for more accurate verification of the model,

we develop simulation models to mimic super and embedded computing environments using

Mission-Level Designer (MLD), a discrete-event simulation platform developed by MLDesign

Technologies Inc. Section 2.4.1 presents the system model used to gather results, and Sections

2.4.2-2.4.6 provide details about each major component model.









2.4.1 System Model

Figure 2-11 displays the system model that we employed to conduct the simulation

experiments. The system consists of four key component models, multiple nodes, network,

central memory, and fault generator. Each component has associated parameters that are user-

definable in order to model different system settings. The components and their parameters are

discussed in detail in the subsequent sections.

Re. gleiStVI' *cm










2.4.2 Node Model

A node is defined as a device that processes and checkpoints data, and is prone to failures.


Figure 2-12 shows the MLDesigner node model. The behavior of the node is modeled such that
it checkpoints its entire main memory at a specified time period. The time spent between


checkpoints represents useful computation, communication, and productive I/O time used to
complete a specific task (see Computation section in Figure 2-12). After a checkpoint has been

successfully completed, the nodes can continue processing data.

The node model was designed in order to provide an abstract representation of a clustered


processing element that runs a parallel job along with the other nodes in the system. If a single

node in the system fails, all the nodes must recover from the last successful checkpoint to ensure

data integrity. The statistics gathered at the node level (see Statistics section in Figure 2-12)

include completed computation time and lost computation time due to a failure. The user can










define numerous parameters for the node model including checkpoint interval, main memory

size, and application completion time.


Statistics Computation section
:eekpalt Gn^ c-Cut n AdFIt LtIout t EIO \~r0nd^elaJ ---- Fl Coret -- -----kpanii.

T 1e~ tnmp el Get Comfmeted













A2.4.3 Multiple Nodes Modelclt EIO
CMThe multiple nodes model, illustrated in Figure 2-13, uses a capability of the MPLDesigner









tool, dynamic instantiation, that allows a single block to represent multiple instances of a model.

The node model described in Section 2.4.2 is dynamically instantiated in the multiple nodes
model. Statistics section procedure used to model large





I a41bFloat l-meStamp--
I m IGototocatio yNa




Figure 2-12. Node model

2.4.3 Multiple Nodes Model

The multiple nodes model, illustrated in Figure 2-13, uses a capability of the MLDesigner

tool, dynamic instantiation, that allows a single block to represent multiple instances of a model.

The node model described in Section 2.4.2 is dynamically instantiated in the multiple nodes

model. The technique is used to ease the design and configuration procedure used to model large

homogenous systems. The main function of the multiple nodes model is to ensure global

synchronous checkpoints and to collect statistics. The statistics gathered include completed

checkpoint time and total checkpoint time lost. That is, it records the amount of time taken to

successfully complete a checkpoint and also the amount of time lost when a failure occurs during

a checkpoint.
























Pr l 1r1inpul=2 l|

GetSm.a.tPoitieFV.F -l j- GotoLocatBon yName

Aic. r T I a 1date all A ccuWulate UnSuccessfid Terinate
GeCheckpoin t li e. a T I I C5e tEI n
SattiStarcsEsontn
I Tim Stamp>3 |

suSoat E c10 2 Gt iTrig


Statistics section


Figure 2-13. Multiple nodes model

2.4.4 Network Model

The network model is a coarse-grained representation of a generic switch- or bus-based


network. The model uses user-defined latency and effective bandwidth parameters to calculate


network delay based on the size of the transfer. Figure 2-14 illustrates the network model.


CE ]emIC"Wet eni


SFaullTMe


Figure 2-14. Network model










2.4.5 Central Memory Model

The central memory model is another coarse-grained model that is used as a mechanism to

represent the storing and restoring of checkpointed data. The model accepts each checkpoint

request and sends a reply when the checkpoint is completed. When a node fails, each node is

sent its last successful checkpoint after some user-definable recovery time which represents the

time needed for the system to respond to the failure. No actual data is stored in the central

memory since the simulated system does not actually process real data. Also, the central

memory is assumed to be large enough to hold all checkpoint data, therefore overflow is not

considered. Figure 2-15 illustrates the central memory model.

NE FaultEvent



Gen Trig Detayl 1 clu Al nodes? EIO GenFloat Constl# Checkpoint Recovery



IE Reset Done Event Crete Event

Const String Gen#1l

E] ResetStartEvent t

HandleEvent Gen Trig EIO
InputI- E 10 2 Done
Inu TrueNTl esConst Gen TrigDone



Figure 2-15. Central memory model

2.4.6 Fault Generator Model

When a failure occurs in a component, the system must recover using a predefined

procedure. This procedure starts by halting each working node followed by the transmission of

the last checkpoint by the central memory model. When the failed node recovers and all nodes

receive their last checkpoint, the system begins processing once again. The fault generator










model shown in Figure 2-16 controls this process by creating a "failure event" at some random

time based on an exponential distribution with a user-definable time (i.e., MTBF) and

orchestrating the recovery process of the components in the system. For example, when a failure

occurs, the fault generator will pass a notice to the node models to let them know that they must

recover. It also tracks the status of each node regarding their recovery status (e.g. recovered or

recovering).

Fault generation section

-W M.. .R.-; i
ETP- AddFklg M-St.l.Dnl D-- |Mt An st|t. S M ~ l EIO









5> overyEvent Syt9 -I si ,
|,c.- W I- E |



Statistics section

Figure 2-16. Fault generator model

2.4.7 Simulation Process

Total execution time of the application, MTBF of the system, sustainable I/O bandwidth,

amount of memory to be checkpointed and frequency of checkpointing are input parameters to

the simulation model and are variable. Faults are generated in the system based on the MTBF

value input. After every checkpointing period, data (equal to the size of memory specified) is

checkpointed to the central memory. The time to checkpoint is dependent on the amount of

checkpoint data and the I/O bandwidth values. While recovering from a failure, the nodes in the

system collect the data from the central memory and the transfer time is again dependent on the









data size and the I/O bandwidth value. If a failure occurs in the system during the data recovery

process, the transfer is reinitiated and the process is repeated until successful.

2.4.8 Verification of Analytical Model

In order to verify the correctness of our analytical model, we simulated the checkpointing

process in several systems with memory ranging from 5 TB to 75 TB and I/O bandwidth ranging

from 5 GB/s to 50 GB/s, running applications with execution times ranging from 15 days to 6

months (180 days). For each combination of values for the above parameters, the system was

simulated for MTBF values of 8 hours, 16 hours and 24 hours. The numbers listed above are

typical of high-performance computers found in several research labs and production centers. In

this section, we pick a few systems and show the simulation results to verify the analytical model

for optimum checkpointing frequency.

Figure 2-17 shows the results from simulations of the system running applications with

fault-free execution time of 15 days (15 x24=360 hours). For a given value of checkpointing

interval (Ti), the y-axes in the figure give the application completion time (i.e., total time for the

application to complete with faults occurring in the system). Hence, the optimum checkpoint

interval is the value for which the application completion time is the least with the given system

resources. Based on the completion time given by Figure 2-17(a), for a system with 5 TB

memory and 5 GB/s sustained I/O bandwidth, the optimum checkpoint intervals are around 2

hours, 3 hours and 4 hours respectively when the MTBF of the system is 8 hours, 16 hours and

24 hours. Similarly, for a system with 5 TB memory and 50 GB/s sustained I/O bandwidth, the

optimum checkpoint intervals as shown by Figure 2-17(b) are 0.5 hours, 1 hour and 1 hour

respectively when the MTBF values are 8 hours, 16 hours and 24 hours.

In order to verify the correctness of the analytical model, Eq. 2.9 needs to be solved to find

T _opt given the same system parameters used for simulation in Figure 2-17. We use the fixed












point numerical method [26] to solve Eq. 2.9. Such strategies are quite common in solving


equations as not all problems yield to direct analytical solutions [27]. Fixed point iteration is a


method for finding roots of an equationf(x) = 0 if it can be cast into the form x = g(x) as is Eq.


2.9. A fixed point of the function g(x) is the solution of the equation x = g(x) and the solution is


found using an iterative technique. A value is supplied for the unknown variable in the equation


and the process is repeated until an answer is achieved.





System with 5 TB memory, 5 GB/s I/O bandwidth
1000 /
----MTBF= 8 hours
900 ---MTBF= 16 hours
800 -A-MTBF=24hours
700
E 600
o 500
400
300
200
002 05 1 2 4 8 16 24 36

A Checkpointing Interval (hours)
System with 5 TB memory, 50 GB/s I/O bandwidth
1000 -
-9- MTBF = 8 hours
| 900 --- MTBF = 16 hours
800 -A- MTBF = 24 hours
700
0 0 600
o 500
400 0
4 300
200
002 05 1 2 4 8 16 24 36
B Checkpointing Interval (hours)



Figure 2-17. Optimum checkpointing interval based on simulations of systems with 5 TB
memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of
50 GB/s.


In order to show the validity of using numerical method to solve our analytical model, we


show the results from the fixed point iterative method that we used in Figure 2-18. The system


used for illustration in the figure is the one with 5 TB of memory and 5 GB/s of I/O bandwidth.


For Eq. 2.9 to have a solution (i.e., for a given checkpoint interval to be the optimum), the left-









hand side (LHS) is equal to the right-hand side (RHS) of the equation. For a given value of

checkpointing interval (Ti), the y-axis in Figure 2-18 referred to as the error in solution give the

difference between the LHS and the RHS of Eq. 2.9. Hence, the optimum checkpoint interval is

the one for which the error is zero. The checkpointing interval values used in our iterative

technique are bound by the lower and upper limits discussed in Section 2.3. The optimum

checkpoint intervals identified using the numerical method in Figure 2-18 for the system are 2

hours, 3 hours and 3.5 hours respectively when the MTBF is 8 hours, 16 hours, and 24 hours.

The results of the analytical model are then cross compared with the solutions from the

simulation model shown in Figure 2-17. It can also be seen that the solutions are well below the

upper bound (MTBF) derived in Section 2.3. Section 2.5 provides the results from the

analytical model for a wide range of systems.

Solving our analytical model for a system with 5 TB of memory and 5 GB/s of I/O

bandwidth, the optimum checkpointing intervals are around 2 hours, 3 hours and 3.5 hours

respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. For a similar

system with 5 TB of memory but 50 GB/s of I/O bandwidth, the optimum checkpointing

intervals are around 0.5 hours, 0.75 hours and 1 hour respectively when the MTBF of the system

is 8 hours, 16 hours and 24 hours.

Based on the simulation results for optimum checkpoint intervals in Figure 2-17, it can be

seen that the analytical results match that of simulation. According to both analytical and

simulation models, for a system with 5 TB of memory and 5 GB/s of I/O bandwidth, the

optimum checkpointing intervals are around 2 hours, 3 hours and 4 hours respectively when the

MTBF of the system is 8 hours, 16 hours and 24 hours. Likewise for a system with 5 TB of

memory but 50 GB/s of I/O bandwidth, the optimum checkpointing intervals are around 0.5










hours, 1 hour and 1 hour respectively when the MTBF of the system is 8 hours, 16 hours and 24

hours. The analytical model is thus verified by the simulation model.


System with 5 TB Memory, 5 GB/s I/O Bandwidth
18
-1-m- MTBF = 8 hours
16
-g- MTBF = 16 hours
14
-A- MTBF = 24 hours



8
0
6
4
2
0
18 36 54 72 90 108 126 144 162 180 198
Checkpointing Interval (minutes)


Figure 2-18. Numerical method solution for the analytical model

Eq. 2.9 is independent of the total execution time of the application. In order to verify this

behavior of the analytical model, we ran simulations for applications with larger execution times.

Figure 2-19 shows the results from simulations of the system running applications with fault-free

execution time of 180 days (180x24=4320 hours). We also wanted to verify that the analytical

model holds true for systems with parameters different from those used. Hence the system

shown in Figure 2-19 has 75 TB of memory which is close to the high end of presently existing

supercomputers.

Simulation results in Figure 2-19(a) show that for a system with 75 TB of memory and 5

GB/s of I/O bandwidth, the optimum checkpointing intervals are around 6 hours, 9 hours and 12

hours respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. For the

same system, the analytical model also provides the same optimum checkpointing intervals as

can be seen from figures in Section 2.5.












Simulation results in Figure 2-19(b) show that for a system with 75 TB of memory but 50


GB/s of I/O bandwidth, the optimum checkpointing intervals are around 2 hours, 3 hours and 4


hours respectively when the MTBF of the system is 8 hours, 16 hours and 24 hours. Analytical


results for the same system matches the simulation results. Thus, it can be seen from Figures 2-


17 through 2-19 that the analytical results match closely with the simulation results, further


verifying the correctness of the analytical model.


System with 75 TB memory, 5 GB/s I/O bandwidth
18000 //
000 --MTBF= 8 hours
16000 ---MTBF = 16 hours
14000 -*-MTBF =24 hours
S12:1 000
10000

8000
< 6000
4000 -
002 05 1 2 4 8 16 24 36

A Checkpointing Interval (hours)
System with 75 TB memory, 50 GB/s I/O bandwidth
18000
-.-MTBF= hours
16000 MTBF = 16 hours
14000 -A-MTBF = 24 hours
2 12000
a -10000
o o 8000
o 6000
4000
2000

002 05 1 2 4 8 16 24 36
Checkpointing Interval (hours)
B


Figure 2-19. Optimum checkpointing interval based on simulations of systems with 75 TB
memory A) System with /O bandwidth of 5 GB/s. B) System with I/O bandwidth of
50 GB/s.


2.5 Optimal Checkpointing Frequencies in Typical Systems


The results shown in Section 2.5 were primarily used to verify the analytical model. In


this section, we show results derived from the analytical model for a wide range of systems with


parameter values representing large supercomputers traditionally developed for HPC. The


results provide insight about the checkpoint intervals that should be used in typical scenarios. In









order to analyze the applicability of our model to a broader spectrum of systems, we also studied

small systems intended for high-performance embedded computing (HPEC) in space. The

results provide insight about the checkpoint intervals that should be used in typical scenarios.

It should be mentioned here that at present there are no existing systems for HPC in space.

However there are initiatives for developing such systems such as the one at the High-

performance Computing and Simulation Research Laboratory at UF in collaboration with NASA

and Honeywell Inc [1]. The values supplied for the parameters (resources) in our studies are

inspired by the system at University of Florida and hence fall in that range.

2.5.1 Traditional Ground-based HPC Systems

We show the optimal checkpointing frequency results for three different cases with

parameter values representing traditional ground-based HPC systems in Figure 2-20. The

optimal checkpointing frequencies are calculated using the same method discussed in the

previous section (i.e., fixed point iteration method). Figure 2-20(a) gives the results for a system

at the lower end with 5 TB memory while Figure 2-20(b) gives the results for a system at the

other extreme with 75 TB memory. These higher-end resources are similar to what an

application such as the Crestone project discussed earlier would solicit. The results for a

midrange system with 30 TB memory is shown in Figure 2-20(c). The optimum checkpointing

interval for typical HPC systems is in the order of hours.

Increase in MTBF implies lesser chances of faults and hence it is not required to

checkpoint often. It can be seen from Figure 20 that as the MTBF of the system increases the

optimum checkpointing interval also increases. However, the difference is much pronounced

only in systems with less I/O bandwidth. As the I/O bandwidth of the system is increased, the

difference in optimum checkpointing intervals for varying MTBF decreases. It can also be

observed from the figure that for a system with a given memory capacity and MTBF, the












optimum checkpointing interval decreases as the I/O bandwidth of the system is increased. This


trend implies that it is better to checkpoint at a higher frequency if the time to checkpoint lowers.


This observation can also be ascertained by the fact that for the same I/O bandwidth the optimum


checkpointing interval increases as the memory capacity of the systems increases (implying that


it is optimal to checkpoint at a lesser frequency if checkpointing time increases).


System with 5 TB Memory System with 30 TB Memory
300 500
-3-MTBF =8 hours 450 -MTBF= 8 hours
250 MTBF= 16 hours 400 --MTBF= 16 hours
MTBF=24hours 350 --MTBF 24 hours
20000













7 i-MTBF= 16 hours
0200
o 100 U
E 50 E 100
0 50

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

A I/O Bandwidth (GB/s) 1/0 Bandwidth (GB/s)
System with 75 TB Memory
800

2.5.2 Space-basedMTBF hours
5700


Figure 2-21 shows the o00ptimal checkpointing frequencies for systems with parameter
500


E 200
S 100
0
5 10 15 20 25 30 35 40 45 50

C 1/0 Bandwidth (GB/s)


Figure 2-20. Optimum checkpointing interval for various systems A) System with 5 TB
memory. B) System with 30 TB memory. C) System with 75 TB memory.


2.5.2 Space-based HPC Systems


Figure 2-21 shows the optimal checkpointing frequencies for systems with parameter


values in the range of HPC systems in space. Figure 2-21(a) gives the results for a system at the


lower end with 512 MB memory while Figure 2-21(b) gives the results for a system at the other


extreme with 8 GB memory. The results for a midrange system with 2 GB memory is shown in












Figure 2-21(c). The optimum checkpointing interval for typical HPEC systems is in the order of


minutes.


System with 512 MB Memory System with 2 GB Memory
-00 ------------------------------------ -1000
-5 --iTI- ,.........- 900 --- MTBF= 5 minutes
T = ............ 800 \ -B-- MTBF= 30 minutes
I I .F 60 m700inu
600

0 400

E E 200
100 300










O1200
1000
0 6 0 100
120
100 600 1100 1600 2100 2600 3100 3600 100 600 1100 1600 2100 2600 3100 3600

A I/O Bandwidth (Mb/s) 1/0 Bandwidth (Mb/s)
System with 8GB Memory

16incre00 s --MTBF the MTBF value is increased. Likewise for a system with a given MTBF and I/Ote
bandwidth, the optimum checkpoin--MTBF interval increases as the memory of the system increatesses.


optimum 1000
800
o 600
400


100 600 1100 1600 2100 2600 3100 3600

C 1/0 Bandwidth (Mb/s)


Figure 2-21. Optimum checkpointing interval for various systems in space A) System with 512
MIB memory. B) System with 2 GB memory. C) System with 8 GB memory.


It can be seen from Figure 2-21 that the behavior of the space-based HPEC systems is


similar to the ground-based HPC systems shown in Figure 2-20 although HPEC systems have


much less resources and high failure rate. For a given system, the optimal checkpoint interval


increases as the MTBF value is increased. Likewise for a system with a given MTBF and I/O


bandwidth, the optimum checkpoint interval increases as the memory of the system increases.


As 1/0 bandwidth of the system is increased with other system parameters kept constant the


optimum checkpoint interval decreases.









2.7 Conclusions and Future Research

In this phase of the research, the useful computation time of the system is improved

(increased) by optimizing checkpointing-related defensive I/O. We studied the growth in

performance of the various technologies involved in high-performance computing, highlighting

the poor performance growth of disk I/O compared to other technologies. Presently, defensive

I/O that is based on checkpointing is the primary driver of I/O bandwidth, rather than productive

I/O that is based on processor performance. There are several research efforts to improve the

process of checkpointing itself but they are generally application-specific. To our knowledge,

there have been no successful attempts to optimize the rate of checkpointing, which is our

approach to reduce the overhead of fault-tolerance. Such an approach is not application- or

system-specific and is applicable to both HPC systems in space and on ground. The primary

contribution of this phase of research is the development of a new model for checkpointing

process and in so doing identify the optimum rate of checkpointing for a given system.

Checkpointing at an optimal rate reduces the risk of loosing much computation on a failure while

increasing the amount of useful computation time available.

We developed a mathematical model to represent the checkpointing process in large

systems and analytically modeled the optimum computation time within a checkpoint cycle in

terms of the time spent on checkpointing, the MTBF of the system, amount of memory

checkpointed and sustainable I/O bandwidth of the system. Optimal checkpointing maximizes

useful computation time without jeopardizing the applications due to failures while minimizing

the usage of resources. The optimum computation time leads to the minimal wastage of useful

time by reducing the time spent on overhead tasks involving checkpointing and recovering from

failures.









The model showed that the optimum checkpointing interval is independent of duration of

application execution and is only dependent on system resources. In order to see the impact of

checkpointing overheads and the overhead to recover from failures on the I/O bandwidth

required in the systems, we analyzed the overall execution time of applications including these

overhead tasks relative to the ideal execution time in a system without any checkpointing or

failures. Results pertaining to the time spent on useful calculations between checkpoints suggest

the need for a very high sustainable I/O bandwidth for future systems. A system with a projected

MTBF of 1 day (24 hours) and a memory capacity of 75 TB, which may be typical for a system

in the near future, would require a sustainable I/O bandwidth of about 53 GB/s to effectively use

90% of time on useful computations. For a similar system with a projected MTBF of 8 hours,

the required I/O bandwidth would be 150 GB/s for the same performance.

In order to verify our analytical model, we developed discrete-event simulation models to

simulate the checkpointing process. In the simulation models, we used performance numbers

that represent systems ranging from small embedded cluster systems (space-based) to large

supercomputers (ground-based). The simulation results matched closely with those from our

analytical model. Finally, we also showed the optimum checkpointing interval for a wide range

of HPC and HPEC systems. The results derived from the analytical model showed that

irrespective of the duration of the application's fault-free execution, the optimum checkpointing

interval is in the orders of hours for HPC systems and in the order of minutes for HPEC systems.

For a system with a given MTBF, the checkpointing time determined by the ratio of memory

capacity and I/O bandwidth is the primary factor influencing the optimum checkpointing

interval.









As future work, the model can be experimentally verified with real systems running

scientific applications. When successfully verified, the model can be used to find the optimal

checkpointing frequency for various HPC and HPEC systems based on their resource

capabilities.









CHAPTER 3
EFFECTIVE TASK SCHEDULING (PHASE II)

Several techniques exist for traditional HPC to provide fault tolerance and improve overall

computation rate. We discussed the option of periodic backup of data at an optimum frequency

during the first phase of this dissertation. In this next phase of the dissertation we focus on

exposing computation to lesser faults by reducing overall execution time of applications through

effective task scheduling of applications for HPC systems in space. We propose to analyze

techniques to effectively schedule tasks on parallel hardware reconfigurable resources that could

be part of the HPC space system.

3.1 Introduction

Recently, the processing speed and efficiency requirements of HPC applications have

driven research efforts in a different direction toward systems augmented with customizable

hardware such as Field-Programmable Gate Arrays (FPGAs). Application-Specific Integrated

Circuits (ASICs) almost always outperform General-Purpose Processors (GPPs) in terms of raw

performance for a given application but tend to have a much higher development cost and cannot

be adapted once developed. By contrast, RC offers flexibility at multiple levels by exposing the

raw building blocks to form complex logic operations optimized for the task at hand. In

addition, when a task is completed, the same hardware can be reconfigured to perform a

completely different task. FPGA-based dynamically reconfigurable computing can be more

cost-effective and flexible than ASIC solutions and simultaneously faster than GPP-based

conventional computing. RC has gained in popularity recently and there are numerous ongoing

RC-related research efforts at various institutions in addition to a large following in the HPC

vendor community with products from Cray, SRC and SGI already deployed. Empowering

clusters and supercomputers with customizable and dynamically reconfigurable hardware









resources is a potential solution to address the requirements of HPC applications in terms of

enormity, processing speed and efficiency. The strengths of GPPs and FPGAs compliment each

other and an efficient merger can produce a powerful combination. However, this area of

research is still relatively new and much research is required to bring the tools and runtime

environment to a level of maturity seen in traditional HPC systems.

HPC systems in space might potentially involve such FPGA-based reconfigurable

computing alongside traditional HPC based on GPPs. The reason is two-fold: the first one being

that FPGAs fit well with the embedded systems mentality of space computing and the second

being the improved speedup in the execution of tasks. We followed a modeling-based approach

in the first phase to improve the useful computation time. In this phase, we reduce the execution

time of tasks thereby exposing them to lesser number of faults. However, reducing execution

time would require effective scheduling of tasks. Since FPGA-based parallel computing is a

relatively new field of research, there are not many standardized and well researched scheduling

mechanisms for scheduling RC applications.

In this phase of the dissertation, we explore some initial steps toward the goal of effective

task scheduling in space-based HPC systems enhanced with reconfigurable hardware

accelerators. Developing a robust service for automated scheduling, execution, and management

of tasks including RC tasks on network-based clusters, or any scalable distributed system, is a

major challenge. Performance modeling, dynamic scheduling, and management of distributed

parallel RC applications and systems have not been well studied to our knowledge. We take a

first step towards the solution of automated job management by analyzing via simulation the

performance of several traditional scheduling heuristics on the computing resources as part of a

future automated job management service for parallel RC systems such as [28]. We test these









heuristics by executing typical applications used in the evaluation of Department of Energy

(DOE) systems on simulated RC resources in space. We also develop a performance model to

predict the overall execution time of tasks on RC resources used by our scheduling heuristics to

schedule these tasks. With this analysis, we progress towards our primary goal of fault tolerance.

Additionally studying the scheduling of tasks on RC processing elements has similarities to and

hence helps task scheduling on traditional processing elements involved in HPC in space.

The rest of the chapter is organized as follows. A brief background on RC and information

on representative RC resources and applications used in our simulations is provided in the

second section. In the third section, we develop the performance model used for our scheduling

heuristics and the various heuristics that we compare in this phase of research. We discuss the

simulation results and compare the different heuristics in the fourth section. Finally, the

conclusions and contributions are presented in the fifth section.

3.2 Background

A general background of RC is not included here. The topic has been well presented by

many sources such as Compton and Hauck [29] and Enzler, Plessl and Platzner [30]. In this

section, we provide background information on the RC resources and HPC applications used to

drive our simulations.

3.2.1 Representative RC Systems

As part of this work, we studied several RC systems including PCI-based boards in clusters

of workstations, the Cray XD1 with a fast message-passing interconnect between processors and

FPGAs, and SGI's RASC for the Altix350 which features a global shared-memory system

between all system components. The speedups that the RC systems provide in our simulations

are based on empirical measurements of their characteristics and performance on sample

applications. A brief description of the RC systems studied follows.









We studied three boards attached to host machines as peripheral components via a PCI

slot: BenNUEY, RC1000, and CPX2100. The BenNUEY [31] RC board from Nallatech hosts a

Virtex-II 6000 FPGA and an attached BenBLUE-II daughter card with two Virtex-II 6000s for a

total of three Virtex-II 6000s. For a PCI-based board, the BenNUEY provides a fairly

substantial amount of processing power. The architecture allows for DIME module expansion of

up to six Virtex-II 8000s in addition to the FPGA on the BenNUEY in a tightly-coupled package.

The RC1000 from Celoxica [32] contains a Virtex 2000E FPGA. This PCI-based board includes

a front-side memory interface that can provide significant storage and direct-memory access

capability while using minimal FPGA resources in contrast with boards having off-chip memory

behind the FPGA. The CPX2100 [33] from Tarari contains three Virtex-II 2000s, although only

two are available for user designs. The third FPGA is used for memory management, PCI bus

management, and other control processes allowing the other two FPGAs to be used for user

designs more efficiently.

SGI's Altix family, built upon Intel Itanium 2 processors and the Linux OS, scales up to

thousands of processors via their NUMAflex global shared-memory architecture. The Altix line

also employs the NUMAlink message-passing interconnect with 1 [is MPI short-message latency

and 3.2 GB/s unidirectional bandwidth per link. SGI has integrated a Virtex-II 6000 FPGA,

known as the Reconfigurable Application-Specific Computing (RASC) module [34], directly to

the NUMAlink fabric, allowing the RASC hardware to integrate to systems up to 512 processors

and beyond. Cray has also created a system with RC capabilities in the XD1 line by integrating

an FPGA directly onto each 2- or 4-way symmetric multiprocessor node's local RapidArray

connection. Two system flavors exist today, one including Virtex-II Pros and another including

Virtex-4 FPGAs. The FPGA considered in our simulations belongs to the Virtex-4 category.









3.2.2 Representative Applications

The representative applications that were used in our simulations were chosen to possess

heterogeneous characteristics in terms of execution time of individual tasks, memory locality,

computational intensity (ratio of computation to memory access), etc. The choice of applications

and their performance numbers are based on the performance evaluation performed by Vetter et

al. [35]. It should be mentioned that we do not consider the in-depth characteristics of the

applications and only consider their comparative performance pattern. Also, the applications

chosen are traditional HPC applications that have well studied before and are not specific to

space computing. The reason is that space applications on HPC and RC environments are not

well studied yet and there are no standard performance evaluations of these applications to our

knowledge. We use seven benchmark applications (many of them are used on ASC1 platforms

of the Department of Energy) including sPPM, SMG2000, SPHOT, IRS, UMT, AZTEC, and

SWEEP3D. The applications are truly scalable, scaling to thousands of processors, and some

have executed on platforms using as many as 8000 processors. A coarse-grained, distributed

memory model is used by all the applications. Below is a brief description of the applications as

described by [35].

* sPPM [36] solves three-dimensional gas dynamics problem (compressible Euler
equations) using a simplified Piecewise Parabolic Method (PPM). PPM is a finite volume
technique in which each grid point uses the information at the four nearest neighbors along
each spatial dimension to update the values of its variables.

* SMG2000 [37] is a parallel semicoarsening multigrid solver for linear systems arising
from finite difference, volume, or element discretizations of the diffusion equation on
logically rectangular grids.





1. Advanced Simulation and Computing (ASC) Program earlier known as Accelerated Strategic Computing Initiative (ASCI) is
the National Nuclear Security Administration (NNSA) collaborative program among DOE's several national laboratories.









* SPHOT is a two-dimensional photon transport code to track particles through a logically
rectangular 2-D mesh. Monte Carlo transport solves the Boltzmann transport equation by
directly mimicking the behavior of photons.

* Sweep3D [38] solves a three-dimensional, time-independent, particle transport equation on
an orthogonal mesh using a multidimensional wavefront algorithm for deterministic
particle transport simulation.

* IRS [39] is an implicit radiation solver code to solve radiation transport by the flux-limited
diffusion approximation using an implicit matrix solution.

* UMT is a three-dimensional, deterministic, multigroup, photon transport code for
unstructured meshes solving first-order form of the steady-state Boltzmann transport
equation.

* AZTEC [40] is a parallel iterative library for solving linear systems.

3.3 Task Scheduling

Scheduling is the process of assigning tasks to required resources. Efficient task

scheduling is a critical part of automated management and is imperative to maximize application

performance and system utilization. Scheduling can be broadly classified as static scheduling

wherein tasks are assigned resources prior to the start of application execution, and dynamic

scheduling wherein tasks are assigned resources dynamically while applications are executing.

In general, dynamic scheduling often implies that tasks are migrated between resources while

executing based on the availability and performance of the various resources at hand. Static

scheduling typically simplifies runtime management requirements but often leads to inefficient

resource utilization. Dynamic scheduling can improve overall application execution times but

requires a more complicated runtime management mechanism, especially in a heterogeneous

environment. Another form of scheduling which blends the two previous schemes schedules

tasks as and when they arrive (no fixed static schedules) but without any dynamic task migration

capabilities. We choose this intermediate approach for our simulations to provide the balance

between efficiency and complexity.










Task scheduling can be further categorized into divisible workload scheduling and

independent task scheduling. In divisible workload scheduling, the workload can be divided into

arbitrary sized chunks for scheduling. This type of division of workload enables flexible

scheduling and hence can be more efficient. However, in general, it may be difficult to divide a

workload into chunks of any arbitrary size. By contrast, independent task scheduling schedules

the workload that has been already divided into independent tasks and this is the scheme we

adopt for our simulations. Our simulations only focus on resource-level scheduling (i.e., a task

scheduled to a resource owns the resource until completed, at which time other tasks may use the

resource).

3.3.1 Scope of the Scheduler

Jobs represented by a task-data dependency graph such as a directed acyclic graph (DAG)

are submitted to a Job Manager (JM) component of the automated management service. Figure

3-1 shows DAGs for two sample jobs.

JOBJ JOBM

Ji M1


J2 M2a M2b M2c


J3a J3b
M3a M3b

J4


Figure 3-1. Task dependence using directed acyclic graphs for sample jobs

The JM provides the scheduler with the task details (i.e., performance requirements and the

order of execution of tasks). The tasks are scheduled in the same order as they are represented in

the DAG. If multiple tasks can be executed simultaneously (i.e., if the tasks are in the same level









on the DAG), then the tasks can be scheduled in a batch mode. The communication

dependencies between the tasks (that might be executing simultaneously) are not considered in

the performance model that we use to schedule the tasks.

3.3.2 Performance Model

Tasks are scheduled on a set of machines (we use the terms machines and resources

interchangeably) based on the performance of the tasks on the machines. Generally, in any

scheduler, the job of predicting task performance is based on a model which when supplied with

the parameters representing both tasks and machines gives the expected performance of the tasks

on the machines. Many such performance models exist for traditional HPC systems but few for

RC systems. The models that exist are too complex for simulative studies of scheduling

heuristics [41]. Hence, in this section, we develop a performance model to predict the overall

execution time of a task on a machine given the characteristics of the task and the machine. We

use this performance model for our simulations. Eq. 3.1 through 3.5 give the performance model

that we develop for our system.

OET = THCOMM + THCOMP + TBCOMM + TBCOMP (3.1)

OET: Overall execution time
THCOMM, THCOMP : Communication and computation time involving the host machine
TBCOMM, TBCOMP : Communication and computation time involving the RC board

TBCOMP = TCONF BOARDD (3.2)
TBCOMM TBCF + (SDA TA x TRANSFER) (3.3)
THCOMP = THCOMP NO (3.4)
THCOMM = THCF + TEXE + THDATA (3.5)

TCONF: Board configuration time
TBOARD: Board execution time
TBCF : Configuration file transfer time to the board from the host
SDATA : Average size per data transfer b/w board and host
NTRAVSFER : Average number of data transfers per execution
THOMP NO : Computation time in hosts that is non-overlapping with computation in board









THCF: Configuration file transfer time to host
TEXE : Host execution file transfer time to host
THDATA : 10 data transfer time to host


In general, a portion of the task executes on the GPP machine which hosts the FPGA

resource and controls the FPGA portion of the task. We call applications with such tasks dual-

paradigm applications, one paradigm being computing with the GPP and the other the FPGA-

based RC. The overall execution time of a dual-paradigm application's task can be broadly

divided into computation time and communication time corresponding to the host machine and

the RC board. Communication time corresponding to the host includes the time to move the host

executable, configuration file for the FGPA and I/O data to the host from their respective source

(remote) locations. Computation time corresponding to the host involves modeling the host and

has been extensively researched by many conventional computing researchers. For our

simulations, we are interested only in the RC part of the tasks. Hence we ignore the

communication part of the host by assuming that the executables and I/O data are available

locally in the host and the computation part of the host by assuming that the host part of

execution overlaps with the RC (hence hidden).

The communication time corresponding to the RC board includes the time to move the

configuration file and I/0 data from the host to the RC board. The computation time

corresponding to the RC board includes the time to configure the FPGA and the actual execution

time of the task. The communication time (corresponding to both the host and the RC board)

involved in the overall execution time is critical only when the data transfer time is very large.

Such cases are common in computational grids or other such distributed computing

environments over long-haul networks. Our assumption to neglect the communication time









corresponding to the host is made because we are only interested in systems that are not

distributed over wide-area networks in this research.

3.3.3 Scheduling Heuristics

In general, scheduling decisions are affected by the relations between the tasks scheduled,

heterogeneity of the resources on which the tasks are scheduled, and the performance model that

is used to predict the performance of the tasks on the available resources. The quality of

scheduler is greatly impacted by the performance model. We discussed the performance model

developed for our simulations previously. Given a set of independent tasks and a set of available

resources, our goal is to minimize the overall execution time of the entire task set. Based on the

model, we can predict the time it takes for each task to execute on each resource. Assuming

there are N tasks to be scheduled and there are M machines available, an N x M matrix can be

created with each element in the matrix representing the performance of a task on the

corresponding machine in terms of a certain metric. The metric could be anything required by

the user such as execution time, memory used, etc. However, it is not always possible to assign

the best machine to each task due to resource contention. So we use heuristics to identify the

machine to which each task should be assigned.

In this dissertation, we compare the performance of six such heuristics for scheduling of

tasks [42] on parallel RC systems. The heuristics compared in this work include opportunistic

load balancing (OLB), minimum execution time (MET), minimum completion time (MCT),

switching algorithm (SA), MIN-MIN, and MAX-MIN. These heuristics have been studied

widely before for general-purpose systems but not RC systems. The following is a brief

description of each heuristic.

* MCT: Assign each task to the machine with minimum completion time (machine available
time + estimated time of computation).









* MET: Only consider the expected execution time of each task on the machines and select
the machine that provides the minimum execution time for the task.

* SA: Switch between MCT and MET based on load imbalance (if load imbalance increases
above a threshold, revert back to MCT from MET). The ratio of completion time when
scheduled using MCT to that when scheduled using MET is used to switch between the
two heuristics.

* OLB: Assign each task to the next available machine without considering the expected
execution time on that machine. The performance of the machine for the task being
scheduled is not considered.

* MIN-MIN: First, select a "best" MCT machine for each task. Second, from all tasks,
schedule the one with minimum completion time (send a task to the machine which is
available earliest and executes the task fastest).

* MAX-MIN: Same first step as MIN-MIN but schedule the task with maximum completion
time (tasks with long completion time are scheduled first on the best available machines
and executed in parallel with other tasks, hence better load-balancing).

The first four heuristics (i.e., MCT, MET, SA, and OLB) are categorized as inline

scheduling heuristics because they are used for scheduling individual tasks as they arrive. The

last two heuristics (i.e., MIN-MIN and MAX-MIN) are categorized as batch scheduling as they

are used to schedule multiple tasks simultaneously. The batch heuristics are commonly used to

schedule massively parallel applications with many parallel tasks.

3.4 Simulative Analysis

The characteristics of the applications used in our simulations are given in Table 3-1 based

on the performance evaluation performed by Vetter et al.[35]. The tests in [35] were run on an

IBM SP system, located at Lawrence Livermore National Laboratory and composed of 68 IBM

RS/6000 NightHawk-2 16-way SMP nodes using 375 MHz IBM 64-bit POWER3-II CPUs [43].

The POWER3 provides unit-stride hardware prefetching and hence applications with higher

memory locality benefit more in this architecture. The computational intensity gives the ratio of

operations to number of memory accesses (loads and stores).










Based on our study of the characteristics of the RC systems and the speedups that they

provide for applications in general, we have summarized the characteristics of the RC systems

used in our simulations in Table 3-2. It should be noted that the base speedup given in the table

is relative and not absolute speedup. For example, if a task speedup is 1.3 times on the RC1000

compared to a GPP machine, then the speedup for the same task is 1.8 times on the Tarari. The

speedup numbers are merely representative and may vary widely depending on the specific

algorithm but these values represent a form of general behavior for each system.

Table 3-1. Characteristics of a )plications
Execution Time Computational Memory Locality
Application (sec) Intensity (Spatial)
sPPM 19.0 1.45 VERY HIGH
SMG2000 4.0 0.08 LOW
SPHOT 14.0 1.70 HIGH
Sweep3D 3.5 0.75 MEDIUM
IRS 135.0 0.45 VERY HIGH
UMT 350.0 1.50 HIGH
Aztec 14.0 0.45 VERY HIGH

Table 3-2. Characteristics of RC s stems
Board/System Estimated Base Configfile Size Bus Speed
Speedup (KB)
RC1000 1.3 1200 PCI(64,66)
TARARI 1.8 500 PCI(64,66)
NALLATECH 2.5 2600 PCI(64,33)
SGI 2.8 2733 NumaLink
CRAY 3.0 2377 RapidArray

We know the performance of the tasks on a GPP from Table 3-1. For the performance

model that we use for task scheduling in our simulations, we require the execution time

(predicted) of the tasks on the RC systems. We calculate this execution time as

RCExecutionTime = BaseExecutionTimex MemoryLocality where BaseExecution Time,
BaseSpeedup x Computationallntensity

MemoryLocality and Computationallntensity are provided in Table 3-1 while BaseSpeedup is









provided in Table 3-2. We assume that the speedup of a task is higher in an RC system when the

computational intensity is higher. If the memory locality of a task is higher, it implies that

POWER3's unit stride hardware prefetching has benefited the task. The task will not be able to

take advantage of this facility in RC system and hence the speedup will decrease. The numerical

values used for MemoryLocality in the calculation of RCExecutionTime are as follows: 0.99 for

VERY HIGH, 0.94 for HIGH, 0.87 for MEDIUM, and 0.80 for LOW.

3.4.1 Simulation Setup

Scheduling is mainly influenced by the execution time of the tasks on the specific RC

resource. Hence the other parameters in our performance model are kept constant.

Configuration file management is a research area in itself that tries to identify the best ways to

store a configuration file, location of storage, replication of heavily accessed files, etc. to

optimally retrieve files and configure the FPGAs [44]. Analyzing these and many such tradeoffs

in detail would be good research for the future. In this research, we restrict our scope to the

analysis of the scheduling heuristics alone. We assume that the configuration files specific to

any application task are available on the machine hosting the RC resource. Only the time to

move the configuration file from the host machine to the FPGA and configure the FPGA is

considered in our simulations.

The following describes the four different simulation setups investigated:

* Homogeneous tasks on homogeneous machines: All the machines in the cluster are the
same. The tasks that arrive for scheduling are the same.

* Heterogeneous tasks on homogeneous machines: All the machines in the cluster are the
same. The tasks that arrive are a mix of the 7 tasks listed earlier. To maintain complete
heterogeneity in the simulation the tasks are chosen in a round-robin fashion.

* Homogeneous tasks on heterogeneous machines: The cluster is composed of a mix of the
RC resources listed. In a heterogeneous setup with 12 machines in the system, there are 3
RC 1000s and Tararis each, and 2 Nallatechs, SGIs and Crays each. The tasks that arrive to be
scheduled are the same.









* Heterogeneous tasks on heterogeneous machines: The cluster is composed of a mix of the
RC resources as described in the previous setup. The tasks that arrive are also a mix of the
seven tasks listed above.

3.4.2 Simulation Results on Large-Scale Systems

In this section, we study the simulation results on large-scale systems typically found in

HPC systems on ground. The task arrival to the scheduler is modeled as a Poisson distribution.

The total time for which the tasks arrive is fixed at 5000 seconds. The mean value of the Poisson

distribution is fixed at 5 sec, 10 sec, and 20 sec to simulate the arrival of 1000 tasks, 500 tasks

and 250 tasks, respectively. The number of machines in the system is fixed at 12 for all

simulations. The average makespan (i.e., time taken to complete all the tasks that arrive) and

average sharing penalty (i.e., difference between the completion time for a task in the present

simulation setup and when the task is executed alone without any other task in the system) are

measured. However, only makespan is shown in this dissertation. The results are an average of

five trials.

Figure 3-2 shows the performance of inline scheduling heuristics when homogeneous tasks

(all SPPM in this case) are scheduled on a homogeneous system. There is no difference in the

makespan of the tasks for MCT and OLB as long as the tasks and machines are homogeneous.

Since the tasks execute very fast, the makespan is driven only by the arrival rate of the tasks.

Since there are 12 machines in the system, it will take at least 60 seconds on average, with the

fastest arrival rate (mean of 5 sec), for a machine to be scheduled with the next task. But the

SPPM task does not execute for 60 sec on any of the machines. Hence the difference in speedup

between the machines does not affect the makespan of the tasks across the machines for MCT

and OLB. However, for the MET heuristic, all the tasks can potentially be scheduled on the first

machine since MET schedules a task on a machine that provides the least execution time for the

task. With a homogeneous system, each machine provides the same execution time and hence













the first machine is always picked, hence the large performance difference between MET, and


MCT and OLB. As the machine's speed increases, the difference between MET, and MCT and


OLB decreases.


Homogeneous Tasks (SPPM) on Homogeneous Homogeneous Tasks (SPPM) on Homogeneous
12000 Machines (RC1000) 10000 Machines (Tarari)
10000 8000
12000 10000


g 6000

0 2000 2000

0 0


250 500 1000
Num ber of Tasks

Homogeneous Tasks (SPPM) on Homogeneous


250 500
Num ber of Tasks

Homogeneous Tasks (SPPM) on Hor
Machines (SGI)


1000


mogeneous


250 500 1000
Number of Tasks

Homogeneous Tasks (SPPM) on Homogeneous
Machines (Cray)


7000
6000
5000
4000
u 3000
A 2000
S1000
0


250 500 1000
Num ber of Tasks


250 500 1000
Num ber of Tasks


Figure 3-2. Scheduling of SPPM on various RC systems


Figure 3-3 shows the performance of inline scheduling heuristics for the same setup as


Figure 3-2 but for UMT, which has a longer execution time compared to SPPM. Both IRS and


UMT have long execution times but we have shown only the results for UMT here due to space


restrictions. It can be seen from Figure 3-3 that for tasks with longer execution times, the


average makespan is driven by the actual execution times and not by the arrival rate of tasks.












Makespan increases with the number of tasks and decreases with faster machines in the system.


Again, as the tasks and machines are homogeneous, MCT and OLB perform the same. MET


performs poorly compared to MCT and OLB because, with MET, there is a possibility of only a


few machines that perform well being loaded with tasks. With MCT and OLB, the tasks are well


distributed among the machines in general.


Homogeneous Tasks (UMT) on Homogeneous Homogeneous Tasks (UMT) on Homogeneous
Machines (RC1000) Machines (Tarari)
20000 20000
16000 -5 0 15000
S12000000

0I 8 000 -3 A'- 2 o 10000
w0 8000 0''
4000 | 5000


250 500 1000 250 500 1000
Number of Tasks Number of Tasks

Homogeneous Tasks (UMT)onHomogeneous Homogeneous Tasks (UMT) on Homogeneous
Machines (Nallatech) Machines (SGI)
20000
20000
1600016000

12000- m MET
10002000
i 8000 r i 8000 E3 SA
> 4000 4000


250 500 1000 250 500 1000
Num ber of Tasks Number of Tasks

Homogeneous Tasks (UMT) on Homogeneous
Machines (Cray)
20000

16000
S12000
8000 -

4000


250 500 1000
Number of Tasks


Figure 3-3. Scheduling of UMT on various RC systems


Figure 3-4 shows the performance of inline scheduling heuristics when heterogeneous


tasks are scheduled on a system with homogeneous machines. Since the machines are


homogeneous, MCT and OLB perform the same. The makespan for MCT and OLB does not












differ between machines for the same reason previously stated (i.e., not enough load on the


machines).


Heterogeneous Tasks on Homogeneous Machines Heterogeneous Tasks on Homogeneous Machines
30000 (RC1 000) 30000 (Tarari)
25000 = 25000
20000 F 20000F -
E 15000 15000
U Ci 0 113
10000 *10000
5000 5000
0 0
250 500 1000 250 500 1000
Number of Tasks Number of Tasks

Heterogeneous Tasks on Homogeneous Machines Heterogeneous Tasks on Homogeneous Machines
30000 (Nallatech) 30000 (SGI)
25000 25000
S20000 20000CT


10000 0 10000
S 5000 > 5000
0 0
250 500 1000 250 500 1000
Number of Tasks Number of Tasks

Heterogeneous Tasks on Homogeneous Machines
30000 (Cray)

25000
S20000 MCT
2 0 15000




250 500 1000
Number of Tasks


Figure 3-4. Scheduling of heterogeneous tasks on various RC systems


Figure 3-5 shows the performance of inline scheduling heuristics when homogeneous tasks


are scheduled on a system with heterogeneous machines. There is no difference in performance


between MCT and OLB for tasks that do not have long execution times. However, when the


machines are highly loaded, as in the case of 1000 tasks of UMT and IRS, MCT performs


slightly better than OLB. Thus we can say that for the difference in performance between MCT


and OLB to be realized, the machine workload must be high. In our opinion, such heavy loading











of machines is difficult to realize in reality as it would require a large number of long-running


tasks.


Homogeneous Tasks (SPPM) on Heterogeneous Homogeneous Tasks (SMG) on Heterogeneous
Machines Machines





812000 35060 -0------------
10000 30000
S000 5000 :
S 000 I 4 20000








0 49606000
SZ, LP 15000
4000 ~Q K-
400 10000
< 2000 < 5000-

250 500 1000 250 500 1000
Number of Tasks Number of Tasks

Homogeneous Tasks (SPHOT) on Heterogeneous Homogeneous Tasks (SWUMTD) on Heterogeneous
Machines Machines
7000 5050 -
6000 500
5000 NI 5030 i P_-T












250 500 1000 250 500 1000






Num ber of Tasks Number of Tasks


Figure 3-5. Scheduling of various homogeneous tasks on a system with heterogeneous RC
machines
performance between MCT and OLB. In order to find a performance difference between MCT
3 1000 0 u 5000 E[VK-












and OLB, we also ran simulations for a system with 6 machines. Even for such a setup (results
not shown in figure), the system load was not high enough to widely differentiate the
250 500 1000 250 500 1000
Num ber of Tasks Num ber of Tasks

Homogeneous Tasks (IRS) on Heterogeneous Homogeneous Tasks (UMT) on Heterogeneous
Machines Machines




an 10000 OL w 10000 r























not shown in figure), the system load was not high enough to widely differentiate the














performance between MCT and OLB (e.g. the makespan for 1000 tasks was 7186 sec for MCT



and 7301 sec for OLB).


Heterogeneous Tasks on Heterogeneous Machines

30000

25000

20000 I-T

g 15000 ii

10000

5000


250 500 1000
Num ber of Tasks



Figure 3-6. Scheduling of heterogeneous tasks on a system with heterogeneous RC machines


Heterogenous Tasks on Homogeneous Machines

12000 (RC1000)

10000

S8000











000(Nallatech)
10000
> 2000

250 500 1000

9000
8000
7000




2000
1000
0


250 500
Num ber of Tasks


1000


Heterogenous Tasks on Homogeneous Machines
(Tarari)


I I iiIi
l i iii I. I iii I


250 500
Num ber of Tasks


1000


Heterogenous Tasks on Homogeneous Machines
(SRC)
10000 i :
9000
8000
7000

g 5000 ol
u 4000 mill
E 3000
2000
1000
0


250 500
Num ber of Tasks


- I I II I


1000


Heterogenous Tasks on Homogeneous Machines
(Cray)


8000
7000
a -
' 6000
. x5000
0
S84000
.3000
2000
1000-
0-


250 500
Num ber of Tasks


I I I- I I
M


Figure 3-7. Batch scheduling of various heterogeneous tasks on a system with homogeneous RC

machines










The results shown until now were for inline scheduling heuristics. Figure 3-7 shows the

performance of batch scheduling heuristics for heterogeneous tasks on a system with

homogeneous machines. The results for setups when homogeneous tasks are scheduled are not

shown because MAX-MIN and MIN-MIN heuristics perform the same as long as the tasks are

homogeneous. The results show that MAX-MIN performs better than MIN-MIN in terms of the

average makespan. MAX-MIN also improves with the number of tasks being scheduled.

For heterogeneous tasks on a system with heterogeneous RC machines (shown in Figure 3-

8), both batch scheduling heuristics show the same behavior seen in Figure 3-7. MAX-MIN

performs better than MIN-MIN in all trials.


Heterogeneous Tasks on Heterogeneous Machines
10000
= 8000
6000 -
C 4000 i
2000

250 500 1000
Num ber of Tasks


Figure 3-8. Batch scheduling of heterogeneous tasks on a system with heterogeneous RC
machines

3.4.3 Simulation Results on Small-scale Systems

In the previous section, we studied the performance of the heuristics on relatively large

systems executing in the order of several hundred tasks. However, the space systems might be

smaller in terms of the size of the system and the number of tasks executing on them as well.

Hence, performance of the heuristics is briefly studied for small-scale systems in this section to

compare with the results presented for large-scale systems in the previous section.

The task arrival to the scheduler is again modeled as a Poisson distribution. The total time

for which the tasks arrive is fixed at 200 seconds. The mean value of the Poisson distribution is










fixed at 5 sec, 10 sec, and 20 sec to simulate the arrival of 40 tasks, 20 tasks and 10 tasks,

respectively. The number of machines in the system is fixed at 3 for all simulations including a

RC1000, a Tarari and a SRC. The average makespan (i.e., time taken to complete all the tasks

that arrive) is measured for the same simulation setups described in the previous section namely

homogeneous tasks on homogeneous machines, heterogeneous tasks on homogeneous machines,

homogeneous tasks on heterogeneous machines and heterogeneous tasks on heterogeneous

machines.

Figure 3-9 shows the performance of inline scheduling heuristics when homogeneous tasks

are scheduled on a homogeneous system. The performance of the heuristics in the small-scale

systems was similar to that in large-scale systems and hence only two cases namely SPPM on

RC1000 and UMT on SGI are shown here. There is no difference in the makespan of the tasks

for MCT and OLB as long as the tasks and machines are homogeneous. Since the tasks execute

very fast, the makespan is driven only by the arrival rate of the tasks for the case of SPPM.

However, for UMT with longer execution times, makespan increases with the number of tasks

and decreases with faster machines in the system.

Homogeneous Tasks (SPPM) on Homogeneous Homogeneous Tasks (UMT) on Homogeneous
Machines (RC1000) Machines (SGI)
500 3500
400 3000
IT 2500 -T
300 m 2000
200 0 W U 1500
> 100 5000

10 20 40 10 20 40
Num ber of Tasks Number of Tasks


Figure 3-9. Scheduling of homogeneous tasks on homogeneous systems

Figure 3-10 shows the performance of inline scheduling heuristics when heterogeneous

tasks are scheduled on a system with homogeneous machines. Since the machines are

homogeneous, MCT and OLB perform the same. The makespan for MCT and OLB does not











differ much between machines for the same reason previously stated (i.e., not enough load on the


machines).

Heterogeneous Tasks on Homogeneous Machines Heterogeneous Tasks on Homogeneous Machines
10000 (RC1000) 10000 (SGI)

| 1000 1000





10 20 40 10 20 40
Num ber of Tasks Num ber of Tasks


Figure 3-10. Scheduling of heterogeneous tasks on various RC systems

Figure 3-11 shows the performance of inline scheduling heuristics when homogeneous


tasks namely SMG and IRS are scheduled on a system with heterogeneous machines. The trend


is similar to that in large-scale systems. There is no difference in performance between MCT


and OLB since the machines are not overloaded with tasks.


Homogeneous Tasks (SMG) on Heterogeneous Homogeneous Tasks (IRS) on Heterogeneous
700 Machines 2500 Machines
600
C a 2000
500
S 400 1500 I
u 300 U 1000 IF i

1 0> 500

10 20 40 10 20 40
Number of Tasks Number of Tasks


Figure 3-11. Scheduling of homogeneous tasks on heterogeneous systems

The performance of the heuristics was also studied in a heterogeneous system when


heterogeneous tasks were scheduled and the trend was similar to that seen in large-scale systems


and hence those results are not shown here. In summary, the heuristics perform similarly in both


large-scale and small-scale systems.









3.5 Conclusions and Future Research

Recently, systems augmented with FPGAs offering a fusion of traditional parallel and

distributed machines with customizable and dynamically reconfigurable hardware have emerged

as a cost-effective alternative to traditional systems. However, providing a robust runtime

environment for such systems to which HPC users have become accustomed has been fraught

with numerous challenges. Dynamic scheduling of HPC applications in such parallel

reconfigurable computing (RC) environments is one such challenge and has not been sufficiently

studied to our knowledge.

In this phase of research, we to improve the overall execution time of applications by

analyzing methodologies for effective task scheduling. Reducing the overall execution time of

applications exposes the applications to lesser number of faults in the system. We analyzed the

performance of several scheduling heuristics that can be used to schedule application tasks on

HPC systems. Typical HPC applications and FPGA/RC resources were used as representative

cases for our simulations. A performance model was also developed to predict the overall

execution time of tasks on RC resources. The model was used by our scheduling heuristics to

schedule the tasks. The performance model and the extensive simulative analysis of scheduling

heuristics for FPGA-based RC applications and platforms is the first of its kind and is the

primary research contribution of this phase.

Among the inline scheduling heuristics, MCT and OLB have similar performance unless

the system is nearly fully utilized. When the system is very heavily loaded, MCT outperforms

OLB, however such high loads may not be realistic. In other cases, the performance of OLB

equals MCT. MET performs poorly in all trials. SA which switches between MCT and MET

has a performance better than MET but not as good as MCT and OLB. Among the batch

scheduling heuristics, MAX-MIN outperforms MIN-MIN in terms of average makespan. Even









with a small 12-machine system, the load on the system was poor. With larger systems, the

impact of poor load will further diminish the performance difference between the scheduling

heuristics. In a smaller system with 3 machines, typical of what would be available in space

systems, the performance trend was similar to that of the 12-machine system. Also, the space

systems are not going to be continuously heavily loaded with tasks and hence a simple

scheduling heuristic such as the OLB might suffice for scheduling tasks.

The intent of this phase of research is to analyze several scheduling heuristics and compare

their performance to identify the suitable candidates for use in space-based HPC systems. We

have shown the performance results for average makespan. Based on the performance metric

that is critical for a given environment, the corresponding scheduling heuristics can be used to

schedule tasks. In future, the knowledge gained so far to suggest suitable scheduling heuristics

for an experimental scheduler can be applied in the job management service intended for the

space-based HPC system being developed at the HCS Research Laboratory at University of

Florida. Another interesting future research would be to use typical parallel RC applications to

study the validity of the simulative analysis by comparing with the effectiveness of task

scheduling (in terms of makespan) of the job management service.









CHAPTER 4
A FAULT-TOLERANT MESSAGE PASSING INTERFACE (PHASE III)

Fault tolerance is a critical factor for HPC systems in space to meet the emerging high-

availability and reliability requirements. Recovery from failure needs to be faster and automatic

while the impact of failures on the system as a whole should be minimal. In the previous phases

of this dissertation, we addressed the issue of minimizing the impact of failures through indirect

approaches, mechanisms that do not address direct recovery from faults. The indirect

approaches certainly avoid computation loss but in order to enable applications to meet high-

availability and high-reliability requirements we need to consider other options. Some of the

options include: incorporating fault-tolerant features directly into the applications, developing

specialized hardware that is fault-tolerant, making use of and enhancing the fault-tolerant

features of the operating system, and developing application-independent middleware that would

provide fault-tolerant capabilities. Among these options, developing application-independent

middleware has the minimal intrusion in the system and can support any general application

including legacy applications that fall into the umbrella of the corresponding middleware model.

In this phase of the dissertation, we investigate, design, develop and evaluate a fault-

tolerant, application-independent middleware for embedded cluster computing known as FEMPI

(Fault-tolerant Embedded Message Passing Interface). We also present performance results with

FEMPI on a traditional PC-based cluster and on a COTS-based, embedded cluster system

prototype for satellite payload processing in development at Honeywell Inc. and the University

of Florida for the Space Technology 8 (ST-8) mission of NASA's New Millennium Program.

We take a direct approach to provide fault-tolerance and improve the availability of the HPC

system in space. FEMPI is a lightweight fault-tolerant variant of the Message Passing Interface

(MPI) standard.









4.1 Introduction

Because of its widespread usage, MPI [45] has emerged as the de-facto standard for

development and execution of high-performance parallel applications. By its nature as a

communication library facilitating user-level communication among a group of processes, the

MPI library needs to maintain global awareness of the processes that collectively constitute a

parallel application. An MPI library consequently emerges as a logical and suitable place to

incorporate selected fault-tolerant features in order to enable legacy and new applications to meet

the emerging high-availability and reliability requirements of HPC systems in space. However,

fault tolerance is absent in both the MPI-1 and MPI-2 standards. To the best of our knowledge,

no satisfactory products or research results offer an effective path to providing scalable

computing applications for cluster systems and specifically resource-constrained embedded

systems such as in space with effective fault-tolerance. However, there have been a few efforts

to develop lightweight MPI [46] and more specifically for embedded systems [47, 48] but

without fault-tolerance. In this dissertation, we present the design and analyze the characteristics

of FEMPI, a new lightweight, fault-tolerant message passing middleware for clusters of

embedded systems. The scope of this paper is focused upon a presentation of the design of

FEMPI and performance results with it on a COTS-based, embedded cluster system prototype

for space. Considering the small scale of the embedded cluster, we also provide results from

experiments on an Intel Xeon cluster to show scalability of FEMPI. The experiments on the

Xeon cluster highlight the compatibility of FEMPI across platforms and also permit performance

comparisons with conventional MPI middleware. Finally, we also study the performance of

FEMPI in comparison to conventional MPI variants with a real application.

Performance and fault-tolerance are in general competing goals in HPC. Achieving a

sufficient level of fault-tolerance for a large set of practical environments with minimal or no









impact on performance is a significant challenge. Providing fault-tolerant services in software

inevitably adds extra processing overhead and increases system resource usage. In this phase of

the research, we focus on developing a message-passing middleware architecture that can

successfully address both fault tolerance and performance and if conflicting then fault tolerance

is given priority. We address fault-tolerance issues in both the MPI-1 and MPI-2 standards, with

attention to key application classes, fault-free overhead, and recovery strategies. When

developed, the architecture would provide key new capabilities to parallel programs,

programmers, and cluster systems, including the enhancement of existing commercial

applications based on MPI. The optimizations targeted at the popular recovery mechanisms, for

key classes of applications, can be applied to any middleware and hence would result in

improving the performance of applications in general.

The rest of this chapter is organized as follows. Section 4.2 provides background on

several existing fault-tolerant MPI implementations for traditional general-purpose HPC systems.

Section 4.3 discusses the architecture and design of FEMPI. Failure-free performance results are

presented in Section 4.4 while failure recovery performance is analyzed in Section 4.5.

Performance of FEMPI when used in a real application is discussed in Section 4.6. Section 4.7

concludes the paper and summarizes insight and directions for future research. The final section

provides the summary and research scope.

4.2 Background and Related Research

In this section, we provide an overview of MPI and its inherent limits relative to fault

tolerance. Also included is a brief survey and summary of existing tools with features for

bringing fault tolerance to MPI, albeit primarily targeting conventional, resource-plentiful HPC

systems and their applications instead of embedded, mission-critical systems that are the

emphasis of our work.









4.2.1 Limitations of MPI Standard

The MPI forum released the first MPI standard, MPI-1, in 1995, with drafts provided over

the previous 18 months. The main goals of the standard in this release were high performance

and portability. Achieving reliability typically includes utilization of additional resources and

methodologies. This additional utilization conflicts with the main goal of high performance and

supports the MPI Forum's decision for limited reliability measures. Emphasis on high

performance thus led to a static process model with limited error handling. The success of an

MPI application is guaranteed only when all constituent processes finish successfully. "Failure"

or "crash" of one or more processes leads to a default application termination procedure which is

in fact required standard behavior. Subsequently, current designs/implementations of MPI suffer

inadequacies in various aspects to providing reliability.

Fault Model: MPI assumes a reliable communication layer. The standard does not provide

methods to deal with node failures, process failures, and lost messages. MPI also limits faults

recognized in the system to incorrect parameters in function calls and resource errors. This

coverage of faults is incomplete and insufficient for high-scale parallel systems and mission-

critical systems that are affected by transient faults such as SEUs.

Fault Detection: Fault detection is not defined by MPI. The default mode of operation of MPI

treats all errors as fatal and terminates the entire parallel job. MPI provides for limited fault

notification in the form of return codes from the MPI functions. However, critical faults, such as

process crashes, may preempt functions from returning these return codes to the caller. The

ability to continue execution after the return of certain error codes is completely ambiguous, and

left as implementation-specific properties and hence not portable.

Fault Recovery: MPI provides users with functions to register error-handling callback functions.

These callback functions are invoked by the MPI implementation in the event of an error in MPI









functions. Callback functions are registered on a per communicator (communication context

defined in MPI for groups of processes) basis and do not allow per-function based error handlers.

Callback functions provide limited capability and flexibility and cannot be invoked in case of

process crashes and hangs.

The MPI Forum released the MPI-2 standard in 1998, after a multi-year standards process.

MPI-2 consists of extensions in the areas of process creation and management, one-sided

communications, extended collective operations, and parallel I/O. A significant contribution of

MPI-2 is Dynamic Process Management (DPM), which allows user programs to create and

terminate additional groups of processes on demand. DPM may be used to compensate for the

loss of a process while MPI I/O can be used for check pointing the state of the applications.

However, the lack of failure detection precludes the potential for added reliability.

4.2.2 Fault-tolerant MPI Implementations for Traditional Clusters

Several research efforts have been undertaken to make MPI more reliable for traditional

HPC systems. This section introduces some of these efforts and analyzes their approaches in

providing a reliable MPI middleware.

CoCheck [49] from the University of Germany, Munich, is a checkpointing environment

for parallel applications and is one of the earliest efforts to make MPI more reliable. CoCheck

was primarily targeted for process migration, load balancing, and stalling long-running

applications for later resumption. CoCheck extends the single process checkpoint mechanisms

to a distributed message-passing application. Unlike most checkpointing middleware, CoCheck

sits on top of the message passing system and provides checkpointing transparent to the

application. CoCheck incurs a large overhead by checkpointing entire process state and requires

a centralized coordinator. Recovery of a dead process is achieved by a recovery function run at










the user level. The status of inconsistent internal data structures in message-passing middleware

is not addressed. Thus, CoCheck provides coarse reliability measures for the MPI.

MPICH-V [50] from University of South Paris, France, is an MPI environment that is

based upon uncoordinated checkpointing/rollback and distributed message logging. The

architecture as shown in Figure 4-1 relies on channel memories and checkpoint servers. It is

assumed that on a failure, a node is no more reachable and the computations by the failed node

will have no impact on the eventual results. Channel memories are special nodes doing the job

of a middleman through which all the communications pass through and hence are logged. The

dispatcher is a coordinating node that schedules tasks to computing nodes and coordinates

resources as well. On a failure, the coordinator detects the failure and, with the help of the

channel memories and checkpointing schedulers, the task is rescheduled on fault-free nodes.

Although MPICH-V suffers from single points of failure it is suitable for Master/Worker types of

MPI applications.

MPICH-V1 MPICH-V2
Channel Memory Event Logger

4 Checkpoint I Checkpoint
Coordinator Server I Dispatcher T Server


Network Newr
I Checkpoint ii
Scheduler
Communication deamons
Computing Nodes Computing Nodes


Figure 4-1. MPICH-V architecture (Courtesy: [50])

Starfish [51] from the Technion University, Israel, is an environment for executing

dynamic MPI-2 programs. Figure 4-2 illustrates the architecture of Starfish. Every node

executes a starfish daemon and many such daemons form a process group using the Ensemble

group communication toolkit [52].










Lightweight Endpoint Modules


Candidate
for I/O N -- 2 1
ConnectorT

Lightweight Membership
Management Module
Group Communication
(Ensemble) Module
Starfish Daemon
(one per machine)


Figure 4-2. Starfish architecture (Courtesy: [51])

The daemons are responsible for interacting with clients, spawning MPI programs and

tracking and recovering from failures along with group membership. Starfish uses an event

model that requires the processes and components to register to listen on events. The event bus

that provides a fast data path supplies messages to reflect cluster changes and process failures.

Each application process includes a group communication handler module to interact with the

daemon, an application module that includes the user supplied MPI code, a checkpoint/restart

module, an MPI module and a virtual network interface (VNI). The architecture allows for any

checkpoint/restart protocol implementation. Likewise, the architecture can be ported to any

network by providing a thin layer inside the VNI pertaining to the particular network.

Egida [53] from the University of Texas, Austin, is an extensible toolkit to support

transparent rollback recovery. Egida as shown in Figure 4-3 is built around a library of objects

that implement a set of functionalities that are the core of all log-based rollback recovery. Any

arbitrary rollback protocol can be specified and Egida can synthesize an implementation. Egida

also allows for the coexistence of multiple implementations. Egida has been ported to MPICH

[54], a widely used and free implementation of MPI. Egida shares some of its drawbacks with













CoCheck. Egida checkpoints the state of both processes and messages and may lead to large


overheads in some cases.


PiggybackCheckpointing PiggybackLogging
GetPiggyback OrphanDetection GetPiggyback CollectDeterminants
ProcessPiggyback DetectOrphans ProcessPiggyback RetrieveDocuments




API I I
Network FailureDetector
Event Handler
Timer
HowToOutputCommit
OutputCommit

Checkpoint
TakeCheckpoint
RestoreFromCheckpoint
Determinant
LogEventinfo LogEventDeterminant CreateDeterminant
Log Log
Retrieve Retrieve .
Flush Flush -
GarbageCollect GarbageCollect I


IHowToLog
HowToCheckpoint Read
Read Write
Write

WhereToLog
Read
Write

StableStorage VolatileStorage
Read Read
Write Write


Figure 4-3. Egida architecture (Courtesy: [53])


Implicit FT-MPI [55] from the University of Cyprus takes an approach similar to CoCheck


but is a more stripped-down version. Implicit FT-MPI targets only the master/slave. A separate


observer process to which the master node sends all the messages is responsible for coordination


of all the processes and also records the status of all the processes. The observer is also


responsible for creation of new processes on failure of slave nodes and, if the master node fails,


the observer takes up the role of the master itself. Implicit FT-MPI is very simple in terms of


features available and suffers from single point of failures.









LAM/MPI [56], an implementation of MPI from Indiana University and Ohio

Supercomputing Center also has some fault-tolerant features built into it. Special functions such

as lamgrow and lamshrink are used for dynamic addition and deletion of hosts. A fail-stop

model is assumed and, whenever a node fails, it is detected as dead and the resource manager

removes the node from the host lists. All the surviving hosts are notified asynchronously and the

MPI library invalidates all the communicators that include the dead node. Pending

communication requests are marked as errors. Since attempts to use invalid communicators raise

errors, applications can detect these errors and free the invalid communicators and new

communicators can be created if necessary.

FT-MPI [57] from University of Tennessee, Knoxville, attempts to provide fault-tolerance

in MPI by extending the MPI process states and communicator states from the simple {valid,

invalid} as specified by the standard to a larger number of states. A communicator is an

important scoping and addressing data structure defined in the MPI standard that defines a

communication context, and a set of processes in the context. The range of communicator states

specified by FT-MPI helps the application with the ability to decide how to alter the

communicator, its state and the behavior of the communication between intermediate states on

occurrence of a failure. In case of a failure, FT-MPI returns a handle back to the application.

MPI communicator functions specified in the MPI-2 standard are used to shrink, expand or

rebuild communicators. FT-MPI provides for graceful degradation of applications but has no

support for transparent recovery from faults. The design concept of FEMPI is also largely based

upon FT-MPI, although there are significant differences given that FEMPI is designed to target

embedded, resource-limited, and mission-critical systems where faults are more commonplace,

such as payload processing in space.









4.3 Design of FEMPI

As described in the previous section, several efforts have been made to develop fault-

tolerant MPI implementation. However, all the designs described target conventional large-scale

HPC systems with sufficient system resources to cover the additional overhead for fault

tolerance. More importantly, very few among those designs have been successfully implemented

and are mature enough for practical use. Also, the designs are quite heavyweight primarily

because fault tolerance is based on extensive message logging and checkpointing. Many of the

designs are also based on centralized coordinators that can incur severe overhead and become a

bottleneck in the system. An HPC system in space requires a reliable and lightweight design of

fault-tolerant MPI. Among those that exist, FT-MPI from University of Tennessee [57] was

viewed to be the one with the least overhead and is the most mature design in terms of

realization. However, FT-MPI is built atop a metacomputing system called Harness that can be

too heavy for embedded systems to handle.

For the design of FEMPI, we try to avoid design and performance pitfalls of existing HPC

tools for MPI but leverage useful ideas from these tools. FEMPI is a lightweight design in that it

does not depend upon any message logging, specialized checkpointing, centralized coordination

or other large middleware systems. Our design of FEMPI resembles FT-MPI. However, the

recovery mechanism in FEMPI is different in that it is completely distributed and does not

require any reconstruction of communicators. On the other hand, FT-MPI requires the

reconstruction of communicators on a failure for which the system enters an election state and a

voting-based leader selection is performed. The leader is responsible for distributing the new

communicator to all the nodes in the system.









4.3.1 FEMPI Architecture

Fault tolerance is provided through three stages including detection of a fault, notification

of the fault, and recovery from the fault. In order to reduce development time and to ensure

reliability, FEMPI is built atop a commercial high-availability (HA) middleware called Self-

Reliant (SR) from GoAhead Inc. whose services are used to provide detection and notification

capabilities. The primary functions of the HA middleware are resource monitoring, fault

detection, fault diagnosis, fault recovery, fault reporting, cluster configuration, event logging,

and distributed messaging. SR is based on a small, reliable, cross-platform kernel that provides

the foundation for all standard services, and its extensions. The kernel also provides a portability

layer limiting user dependencies on the underlying operating system and hardware.

SR allows processes to heartbeat through certain fault handlers and hence has the potential

to detect the failure of processes and nodes, enabled by the Availability and Cluster Management

Services (AMS and CMS) in Figure 1. CMS manages the physical nodes or instances of SR,

while AMS manages the logical representation of these and other resources in the availability

system model. The fault notification service for application processes is developed as an

extension to SR. Application heartbeats are managed locally within each node by the service and

only health state changes are reported externally by a lightweight watchdog process. In the

system, the watchdog processes are managed in a hierarchical manner by a lead watchdog

process executing on a controller node. State transition notifications from any node may be

observed by agents executing on other nodes by subscribing to the appropriate multicast group

within the reliable messaging service. Also, SR is responsible for discovering, incorporating,

and monitoring the nodes within the cluster along with their associated network interfaces. SR

also guarantees reliable communication via network interface failover capabilities and in-order

delivery of messages between the nodes in the system through its Distributed Messaging Service










(DMS). It should be mentioned here that a lightweight version of SR is used for our prototype

with just enough services for cluster management. Moreover, FEMPI only uses minimal

services of SR such as failure detection and DMS. Hence, FEMPI with SR would be less

stressful on the embedded cluster as opposed to FT-MPI with Harness which is intended for

large clusters.

Data Processing Node n

Data Processing Node 2
Data Processing Node 1

System Controller Application
Node
I FEMPI Runime Environmen Control Agent
tRestore e. Checkpointing Health Status
.a..ure -- r----ff....---4----- ----
lCaiont n t de t r C Pc..... e...... T...h |
Control '- 1e \ 1 s, si, v
ProceSS DMS AMS+CMS Failure
Process Health (Communication) (Health Monitoring) Notification
repnil Information w )It em r-Re A g
-' Self-Relhani
AMS: Availability Management Service CMS: Cluster Management Service

Figure 4-4 Architecture of FEMPI

Figure 4-4 shows the architecture of FEMPI. The application is required to register with

the Control Agent in each node, which in turn is responsible for updating the health status of the

application in that node to a central Control Process. The Control Process is similar to a system

manager, scheduling applications to various data processing nodes. In the ST-8 mission, the

Control Process will execute on a radiation-hardened, single-board computer that is also

responsible for interacting with the main controller for the entire spacecraft. Although such

system controllers are highly reliable components, they can be deployed in a redundant fashion

for highly critical or long-term missions with cold or hot sparing.

With regard to MPI applications, failures can be broadly classified as process failures

(individual processes of an MPI application crash) and network failures (communication failure









between two MPI processes). FEMPI ensures reliable communication (reducing the changes of

network failures) with all the low-level communication through DMS. A process failure in

conventional MPI implementations causes the entire application to crash, whereas FEMPI avoids

application-wide crashes when individual processes fail. MPI Restore, a component of FEMPI,

resides on the System Controller and communicates with the Control Process to update the status

of nodes. On initialization by the application using the MPI Init function call, the FEMPI

runtime environment attached to the application process subscribes itself to a Health Status

channel. The same procedure is performed for all the MPI processes corresponding to the

application executing in all the nodes. It is through this channel that updates about the status of

nodes are received from MPI Restore. On a failure, MPI Restore notifies all other MPI processes

regarding the failure via DMS. The status of senders and receivers (of messages) are checked in

FEMPI before communication to avoid trying to establish communication with failed processes.

If the communication partner (sender or receiver) fails after the status check and before

communication, then a timeout-based recovery is used to recover out of the MPI function call.

With the initial design of FEMPI as described in this dissertation, we focus only on

selected baseline functions of the MPI standard. The first version of FEMPI includes 19 baseline

functions shown in Table 4-1, of which four are setup, three are point-to-point messaging, five

are collective communication, and seven are custom data-definition calls. The function calls

were selected based upon profiling of popular and commonly used space science kernels such as

Fast Fourier Transform, LU Decomposition, etc. With the ability to develop these common

application kernels, the baseline functions would be sufficient to support the desired applications

for the ST-8 mission. Provision of just certain baseline functions (along with ability to develop

the other functions defined in the MPI standard) in order to support the development of desired










applications is common during the design of a new MPI implementation[46-48, 58]. In our

initial version of FEMPI, we also focus only on blocking and synchronous communications,

which is the dominant mode in many MPI applications. Blocking and synchronous

communications require the corresponding calls (e.g. send and receive) to be posted at both the

sender and receiver processes for either process to continue further execution.

Table 4-1. Baseline MPI functions for FEMPI in this phase of research
Function MPI Call Type Purpose
Initialization MPI Init Setup Prepares system for message-passing functions.
Communication Rank MPIComm rank Setup Provides a unique node number for each node.
Communication Size MPIComm size Setup Provides the number of nodes in the system.
Finalize MPIFinalize Setup Disable communication services.
Send MPI Send P2P Send data to a matching receive.
Receive MPI Recv P2P Receive data from a matching send.
Simultaneous send and receive between two
Send-Receive MPIJSendrecv P2P nodes (i.e., send and receive using the same
buffer).
Snchro .AMPI Barrier Collective Synchronize all the nodes together.
Synchronization
Broadcast MPI Bcast Collective "Root" node sends same data to all other nodes.
Each node sends a separate block of data to
Gather MPI Gather Collective
Gather MPGather C" root" node to provide an all-to-one scheme
Collective "Root" node sends a different set of data to each
Scatter MPI Scatter Collective
node providing a one-to-allscheme
7 Collective All nodes share their data with all other nodes in
All-to-All MPI i/i\ii-,.r Collective
the system.
Seven functions to define custom data types
Datatype MPI Type Custom
pP useful for complicated calculations


4.3.2 Point-to-Point Messaging (Unicast Communication)

The basic communication mechanism of MPI, referred to as 'point-to-point messaging' is

the transmittal of data between a pair of processes, one sending and the other receiving. A set of

send and receive functions allow the communication of data of a specific datatype with an

associated tag. The tag allows selectivity of messages at the receiving end: one can receive on a

particular tag, or one can wild-card this quantity, allowing reception of messages with any tag.

Message selectivity on the source process of the message is also provided.









DMS, the messaging service of SR, operates by managing distributed virtual multicast

groups with publish and subscribe mechanisms over primary and secondary networks. The

publish and subscribe mechanisms are used by FEMPI for provisioning MPI point-to-point

messaging. On initialization, each FEMPI (application) process is registered to a data channel

wherein messages can be published or received. Each process registers with its unique identity

(MPI process rank). The message to be transmitted from a process is published on the data

channel with additional information to indicate the identity of the target receiver. The identity

being unique to each process, DMS filters the message to be relayed only to the target process

and is not broadcasted to the other processes.

4.3.3 Collective Communication

A collective operation is executed by having all processes in the group call the

communication routine, with matching arguments. Collective communications transmit data

among all processes in a group specified by an intracommunicator object. One exception

however is the MPI Barrier function that serves to synchronize processes without passing data.

In FEMPI, the publish and subscribe mechanism of DMS is used directly for collective

routines such as broadcast or all-to-all communication that involves the transmission of a

message from a single node to all other nodes. However other routines that involve more than

one process transmitting messages are designed as variations of point-to-point messaging

(sequence of multiple point-to-point operations).

4.3.4 Failure Recovery

Presently, two different recovery modes have been developed in FEMPI, namely IGNORE

and RECOVER. In the IGNORE mode of recovery, when a FEMPI function is called to

communicate with a failed process, the communication is not performed and MPI PROC NULL

(meaning the process location is empty) is returned to the application. Basically, the failed









process is ignored and computation proceeds without any effect of the failure. The application

can either re-spawn the process through the control process or proceed with one less process.

The MPI communicators are left unchanged. IGNORE mode is useful for applications that can

execute with reduced number of processes while the failed process is being recovered, especially

if the recovery procedure is of a long duration. With the RECOVER mode of recovery, the

failed process has to be re-spawned back to its healthy state either on the same or a different

processing node. When a FEMPI function is called to communicate with a failed process, the

function call is halted until the process is successfully re-spawned. When the process is restored,

the call is completed and the control is returned back to the application. Here again, the MPI

communicators are left changed. RECOVER mode is useful for applications that cannot execute

with any less number of nodes than when they started execution.

In future, we plan to develop a third mode of recovery, namely REMOVE. With the

REMOVE mode of recovery, when a process fails, it is removed from the process list. The MPI

communicator is altered (shrunk) to reflect the change that the system has one less process.

REMOVE mode would be useful for cases when any process or processes have irrecoverably

failed while running applications that cannot proceed with any failed process in the system.

However, this mode would require the coordination of all the MPI processes via a MPI control

process to consistently update the communicator.

4.3.5 Covering All MPI Function Call Categories

MPI function calls can be broadly classified into four different categories based on the

locality of impact of a failure in the system. The categories include point-to-point

communication calls with communication between specific processes, collective communication

calls with communication between a group of processes, process-specific calls that are local to a

single process, and group-specific calls that are collective in nature but do not involve explicit









message passing. The function calls developed for the initial version of FEMPI in this

dissertation, also listed in Table 4-1, have samples in all these categories (except for the non-

blocking version of point-to-point communication) indicating the ability for FEMPI to be

extended to the other function calls specified by the MPI standard.

For point-to-point communication calls, the impact of a process failure only extends to the

partner (communication partner) process. In FEMPI, the failure is handled by waiting for the

partner process to recover or by returning an error to the calling routine. FEMPI calls that fall in

this category include MPI Send, MPI Recv and MPI Sendrecv. The locality of impact of a

failure during collective communication calls depend on whether the failure was in a root or non-

root process. Several collective routines such as broadcast and gather have a single originating

or receiving process. Such processes are called the root. On failure of a non-root process, the

impact is only on the root which is the solitary process communicating with non-roots. Hence all

other processes can complete successfully. The root deals with the failure similar to point-to-

point communication calls. On the other hand, if the root fails, all non-roots are impacted. The

non-root processes wait for the root to recover or return an error to the calling routing. FEMPI

calls that fall in this category include MPI Bcast, MPI Gather, MPI Scatter, MPI Alltoall and

MPI Barrier.

In the case of process-specific calls, failures do not impact any other process and hence do

not mandate any fault handling in the other non-faulty processes. FEMPI calls in this category

include MPI Type *, MPI Comm rank and MPI Comm size. However, recovery of the faulty

process is still required. Group-specific calls are similar to collective communication calls.

Although there is no explicit communication of application messages, these calls involve the

exchange of several control messages. The functions in this category can be considered as a









collective communication calls with all the processes involved being roots. FEMPI function

calls in this category include MPI Init and MPI Finalize.

4.4 Performance Analysis

In this section, we discuss the performance of FEMPI based on experiments conducted on

both a prototype system and a ground-based cluster system. Beyond providing results for

scalability analysis, experiments on the ground-based cluster system demonstrate FEMPI's

flexibility and the ease with which space scientists may develop their applications on an

equivalent platform. The next section describes the setup for these systems, followed by a

discussion of the performance of FEMPI on failure-free systems. We discuss the performance of

FEMPI in systems with failures in Section 4-5.

4.4.1 Experimental Setup

For the current research phase of the NASA's New Millennium Project, a prototype system

has been designed to mirror when possible and emulate when necessary the features of a typical

satellite system. The system to be launched in 2009 has been developed at Honeywell Inc. and

the University of Florida. The prototype hardware shown in Figure 4-5 consists of a collection

of single-board computers, some augmented with FPGA coprocessors, a power supply and reset

controller for performing power-off resets on a per-node basis, redundant Ethernet switches, and

a development workstation acting as satellite controller.

Six Orion Technologies COTS Single-Board Computers (SBCs) are used to mirror the

specified data processor boards to be featured in the flight experiment (four) and also to emulate

the functionality of the radiation-hardened components (two) currently under development. Each

board is comprised of a 650MHz IBM 750fx PowerPC, 128MB of high-speed DDR SDRAM,

dual Fast Ethernet interfaces, dual PCI mezzanine card slots, and 256MB of flash memory, and

runs MontaVista Linux. Other operating systems may be used in future (e.g. one of several real-










time variants), but Linux provides a rich set of development tools from which to leverage.

Ethernet is the prevalent network for processing clusters due to its low cost and relatively high

performance, and the packet switched technology provides distinct advantages for the system

over bus-based approaches currently in use by space systems. Of the six boards, one is used as

the primary system controller node. Another SBC is used to emulate both the backup controller

(that takes over the control functions when the primary controller encounters a failure) and a

central data store (used for storing input and output data of applications along with application

checkpoints). The data store is currently implemented as a 40 GB PMC hard drive, while the

flight system will likely include a radiation-hardened solid-state storage device. The remaining

four SBCs are used as data processing nodes which we will exercise to show the performance of

FEMPI.

I cPCI Chassis -- *
Power and Data Processor 4 S
Reset -Data Processor 3 ag
Controller Data Processor 2 Pe
u W s Data Processor 1 "."e i
FigOther 4 Backup System Controller
r....-, r .--- .----.------. .
Ethernet o NIC Mass Data Store aIi
Switch "' ................
SPrimary System Controller o



Spacecraft Command Primary Ethernet Switch
and Control Processor t Xo p r --iG
(Linux Workstation) G Secondary Ethernet Switch A


Figure 4-5. System configuration of the prototype testbed

In order to test the scalability of FEMPI and see how it will perform on future systems

beyond the six-node testbed developed to emulate the exact system to be flown in 2009,

experiments were conducted on a 16-node cluster. The cluster consists of traditional server

machines each with 2.4GHz Intel Xeon processors running Redhat Linux 9.0, 1GB of DDR

RAM and connected via Gigabit Ethernet network. As previously described, the fact that the









underlying PowerPC processors in the testbed and the Xeons in the ground-based cluster have

vastly different architectures and instruction sets (e.g. they each use different endian standards) is

masked by the fact that the operating system and reliable messaging middleware provide abstract

interfaces to the underlying hardware. These abstract interfaces provide a means to ensure

portability and these experiments demonstrate that porting applications developed using FEMPI

from a ground-based cluster to the embedded space system is as easy as recompiling on the

different platform.

4.4.2 Results and Analysis

In this section, we present the performance results of FEMPI on failure-free systems where

we report the general performance of FEMPI. The relative performance of FEMPI in

comparison with conventional and commonly used MPI variants such as MPICH[54] and

LAM/MPI[56] are also discussed in this section.

4.4.2.1 Point-to-point communication

Figure 4-6 shows performance of FEMPI's point-to-point communication on the Xeon

platform. The performance of the message passing middleware is reported in terms of the

application-level throughput. A specified amount of data is sent from one node (sender) to

another (receiver) and time taken to finish the communication completely is measured.

Throughput is measured as the ratio of the size of data transferred and the time taken to transfer

the data. In order to avoid any transient errors, the results reported are averages of 100 trials.

For one-sided communications using MPI Send and MPI Recv, the maximum throughput

using FEMPI reaches about 590 Mbps on the Xeon cluster with 1 Gbps links. This maximum

throughput value is comparable to the throughput provided by the conventional MPI

implementations (i.e., MPICH and LAM/MPI). It can be seen that the throughput with MPICH

saturates at approximately 430 Mbps while that for LAM/MPI saturates at approximately 700