Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UF00100805/00001
## Material Information- Title:
- A practical realization of parallel disks for a distributed parallel computing system
- Creator:
- Jin, Xiaoming (
*Dissertant*) Rajasekaran, Sanguthevar (*Thesis advisor*) Dankel, Douglas D. (*Reviewer*) Fishwick, Paul (*Reviewer*) - Place of Publication:
- Gainesville, Fla.
- Publisher:
- University of Florida
- Publication Date:
- 2000
- Copyright Date:
- 2000
- Language:
- English
## Subjects- Subjects / Keywords:
- Algorithms ( jstor )
Bytes ( jstor ) Communications processors ( jstor ) Computer memory ( jstor ) Employee assistance programs ( jstor ) Indexing ( jstor ) Input data ( jstor ) Input output ( jstor ) Run time ( jstor ) Sorting algorithms ( jstor ) Computer and Information Science and Engineering thesis, M.S ( lcsh ) Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh ) Electronic data processing -- Distributed processing ( lcsh ) Parallel processing (Electronic computers) ( lcsh ) - Genre:
- bibliography ( marcgt )
theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) non-fiction ( marcgt )
## Notes- Abstract:
- Several models of parallel sorting are found in the literature. Among these models, the parallel disk models are proposed to alleviate the I/O bottleneck when handling large amounts of data. These models have the general theme of assuming multiple disks. For instance, the Parallel Disk Systems (PDS) model assumes D disks which are disposed on a single computer. It is also assumed that a block of data from each of the D disks can be fetched into the main memory in one parallel I/O operation. In this thesis we present a new model for multiple disks and evaluate its performance. This model is called a Parallel Machine with Multiple Disks (PMD). A PMD model has multiple computers each of which is connected with one disk. A PMD model can be thought of as a realization of the PDS model. In this thesis, we also present a more practical model which is called Parallel Machines with multiple Files (PMF). A PMF model has multiple computers connected on a central file system. We investigate the sorting problem on this new model. Our analysis demonstrates the practicality of the PMF. We also present experimental confirmation of this assertion with data from our implementation.
- Thesis:
- Thesis (M.S.)--University of Florida, 2000.
- Bibliography:
- Includes bibliographical references (p. 39-40).
- System Details:
- System requirements: World Wide Web browser and PDF reader.
- System Details:
- Mode of access: World Wide Web.
- General Note:
- Title from first page of PDF file.
- General Note:
- Document formatted into pages; contains ix, 41 p.; also contains graphics.
- General Note:
- Vita.
- Statement of Responsibility:
- by Xiaoming Jin.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- All applicable rights reserved by the source institution and holding location.
- Resource Identifier:
- 50751030 ( OCLC )
002678727 ( AlephBibNum ) ANE5954 ( NOTIS )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

A PRACTICAL REALIZATION OF PARALLEL DISKS FOR A DISTRIBUTED PARALLEL COMPUTING SYSTEM By XIAOMING JIN A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2000 To my son, my wife Ping Zhang, my teacher Sanguthevar Rajasekaran ACKNOWLEDGMENTS I would like to express my appreciation and gratitude to my advisor, Dr. Sanguthevar Rajasekaran, for his guidance during this study. Dr. Sanguthevar Rajasekaran has been very helpful for my work. It is because of his LMM sort algorithm, my work can be completed. I would also like to thank Drs. Doug Dankel and Paul Fishwick, my supervisory committee, who also make my work be possible. I owe much gratitude to my mother and to my wife, Ping Zhang. Without their taking care of the family, I would not have time to complete my work. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ................................................................................................. iii LIST OF TABLES ........................................ .............. vi LIST OF FIGURES ............................. ............ .................................... vii A B ST R A C T .................................................................................................... . ........ .. v iii CHAPTERS 1 IN TR O D U C T IO N ..................................... ................... .....................1.. . 1.1 Scope and O objective ..... .. ........................................ ........................ .. . ........ .. 1 1.2 T hesis O utlin e ................................................... .......................................... . . 3 2 SORTIN G ON THE PD S M ODEL ...................................................... ...............4...... 2.1 The PDS Model.................. ........................... ........... 4 2.2 Sorting on The PD S M odel ................................... ........................ .............. 5 3 A PARALLEL MACHINE WITH DISKS (PMD) ...................................................7... 3 .1 T h e P M D M o d el ............................................................................................... .. 7 3 .2 S ortin g A lg o rith m s ............... .................................................................................. 9 3.2.1 Algorithm of k-k Routing and k-k Sorting................................................. 9 3.2.2 The (1, m)-M erge Sort (LM M ) ....................... ......................................... 10 3.3 Sorting on the PM D M odel ...................................... ....................... .............. 13 3.3.1 Sorting on the M esh ........................................ ......................... ............. 14 3 .3 .2 B a se C a se s....................................................................................................... 15 3.3.3 T he Sorting A lgorithm ...................................... ....................... .............. 17 3.3.4 Sorting on a general PM D ........................................................... .............. 19 4 PARALLEL MACHINE WITH MULTIPLE FILES (PMF) .............. ..................... 21 4.1 Introduction of the PM F M odel ............................................................................. 2 1 4.2 Sorting on the PM F M odel................................... ........................ .............. 22 4.3 C om putting the Speed U p ....................................... ......................... .............. 30 4.3.1 U sing Q uick Sort ... ................................................................ ............ .. 30 4 .3.2 U sing H eap Sort ..... .. ........................................ ........................ .. . .......... 34 5 C O N C L U SIO N ............................................................................. ...... . . . ............... 37 5 .1 M ajo r R e su lts ......................................................................................................... 3 7 5 .2 F u tu re W o rk ........................................................................................................... 3 8 R EFEREN CE S .............. ........................................................................ . . .....39 BIO GR APH ICAL SK ETCH .................. .. ........................ ................................... 41 v LIST OF TABLES Table Page 4.1. Quick Sort R results Of U sing 2 Processors.................................................. ................ 30 4.2. Quick Sort Results Of U sing 4 Processors.................................................. ................ 32 4.3. Quick Sort R results O f U sing 8 Processors.................................................. ................ 33 4.4. H eap Sort R results O f U sing 2 Processors................................................... ................ 34 4.5. H eap Sort R results O f U sing 4 Processors................................................... ................ 35 4.6. H eap Sort R results O f U sing 8 Processors................................................... ................ 36 LIST OF FIGURES Figure Page 2.1. A architecture of a PD S M odel .. ............................................................... .............. 4 3.1. A PM D M odel ofn x n M esh .................................................................. .............. 8 4.1. Logical PM F M odel. .............. ........................ .......................... .. .... 21 4 .2 U n shu ffl e R esult....................................................... ................................................ 2 3 4 .3 M erg in g R e su lt ............................................................................................................... 2 6 4 .4 P artition F ile into q P arts. ........................................................................ ................ 27 4.5. Partition Parts into r C ells .................... ................................................................ 28 4.6. Quick Sort Speed Up Chart Using 2 Processors........................................................ 31 4.7. Quick Sort Speed Up Chart Using 4 Processors .........................................................32 4.8. Quick Sort Speed Up Chart Using 8 Processors .........................................................33 4.9. Heap Sort Speed Up Chart Using 2 Processors ............... ....................................34 4.10. Heap Sort Speed Up Chart Using 4 Processors ........................................................35 4.11. Heap Sort Speed Up Chart Using 8 Processors ........................................................36 vii Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science A PRACTICAL REALIZATION OF PARALLEL DISKS FOR A DISTRIBUTED PARALLEL COMPUTING SYSTEM By Xiaoming Jin December 2000 Chair: Dr. Sanguthevar Rajasekaran Major Department: Computer and Information Science and Engineering Several models of parallel sorting are found in the literature. Among these models, the parallel disk models are proposed to alleviate the I/O bottleneck when handling large amounts of data. These models have the general theme of assuming multiple disks. For instance, the Parallel Disk Systems (PDS) model assumes D disks which are disposed on a single computer. It is also assumed that a block of data from each of the D disks can be fetched into the main memory in one parallel I/O operation. In this thesis we present a new model for multiple disks and evaluate its performance. This model is called a Parallel Machine with Multiple Disks (PMD). A PMD model has multiple computers each of which is connected with one disk. A PMD model can be thought of as a realization of the PDS model. In this thesis, we also present a more practical model which is called Parallel Machines with multiple Files (PMF). A PMF model has multiple computers connected on a central file system. We investigate the sorting problem on this new model. Our analysis demonstrates the practicality of the PMF. We also present experimental confirmation of this assertion with data from our implementation. CHAPTER 1 INTRODUCTION 1.1 Scope and Objective Computing applications have advanced to a stage where voluminous data is the norm. The volume of data dictates the use of secondary storage devices such as disks. Even the use of just a single disk may not be sufficient to handle I/O operations efficiently. Thus researchers have introduced models with multiple disks. A model that has been studied extensively (which is a refinement of prior models) is the Parallel Disk Systems (PDS) model [17] In this model there is a single computer and D disks. In one parallel I/O, a block of data from each of the D disks can be brought into the main memory. A block consists of B records. If M is the internal memory size, then one usually requires that M > 2DB. Algorithm designers have proposed algorithms for numerous fundamental problems on the PDS model. In the analysis of these algorithms they counted only the I/O operations since the local computations can be assumed to be very fast. The practical realization of this model is an important research issue. Models such as Hierarchical Memory Models (HMMs) [8,9] have been proposed in the literature to address this issue. Realizations of HMMs using PRAMs and hypercube have been explored [9]. Sorting algorithms on these realizations have been investigated. In this thesis we propose a straight forward model called a Parallel Machine with Disks (PMD). A PMD can be thought of as a special case of the HMM. A PMD is nothing but a parallel machine where each processor has an associated disk. The parallel machine can be structured or unstructured. If the parallel machine is structured, the underlying topology could be a mesh, a hypercube, a star graph, etc. Examples of unstructured parallel computers include SMP, a cluster of workstations (employing PVM or MPI), etc. In some sense, the PMD is nothing but a parallel machine where we study out of core algorithms. In the PMD model we not only count the I/O operations but also the communication steps. One can think of a PMD as a realization of the PDS model. Given the connection between HMMs and PDSs, we can state that prior works have considered variants of the PMD where the underlying parallel machine is either a PRAM or a hypercube [9]. We begin the study of PMDs with the sorting problem. Sorting is an important problem of computing that finds applications in all walks of life. We analyze the performance of the LMM sort algorithm of Rajasekaran's [14] PMD (where the underlying parallel machine is a mesh, a hypercube, and a cluster of workstations). In particular, we present a new model which is called Parallel Machine with Multiple Files (PMF). A PMF model has multiple computers which are managed by a network file system. As in modem days, network file system has been a popular distributed file system, so this model can be more practical in real life. In this model, input data will be partitioned into several files which are stored in the file system. All computers can read and write data from these files. We show why this model has more appealing properties than other models. We compute the run times for sorting in this model. Also we compute the speed ups of using multiple computers in verse of using single computer. These analyses demonstrate the practicality of the PMF. 1.2 Thesis Outline This thesis consists of 5 chapters. In addition to this introduction, the rest of the thesis is organized as follows. In Chapter 2, we provide a summary of known algorithms for the PDS model. In Chapter 3, we present details of the PMD model. To make our discussion concrete, we use the mesh as the topology of the underlying parallel machine. However, the discussion applies to any parallel machine. Also in this chapter, we state some routing and sorting algorithms which are applied on the PMD model. Especially, we give detail description of the LMM algorithm which played a vital role in both PMD and PMF models. In Chapter 4, we show a more practical model, PMF model. We stated its structure, and give detail description of its implementation. Also, we show our experimental results of the PMF model. Chapter 5 concludes the thesis. CHAPTER 2 SORTING ON THE PDS MODEL In this chapter, we present an overview sorting results on the PDS model which has the structure of multiple disks with a single computer. 2.1 The PDS Model Sorting has been studied well on the PDS model (see Fig. 2.1). A known lower bound for the number of I/O read steps for parallel disk sorting is (- [ LogM ]). Here N is the number of records to be sorted and M is the internal memory size. Also, B is the block size and D is the number of parallel disks used. There exist several asymptotically optimal algorithms that make O(-44 [ |]) I/O read steps (see e.g., references 10, 1, and 3). Computer Figure 2.1. Architecture of a PDS Model One of the early papers on disk sorting was by Aggarwal and Vitter [2]. In the model they considered, each I/O operation results in the transfer of D blocks each block having B records. A more realistic model was envisioned by Vitter and Shriver [17]. Several asymptotically optimal algorithms have been given for sorting on this model. Nodine and Vitter's optimal algorithm [8] involves solving certain matching problems. Aggarwal and Plaxton's optimal algorithm [1] is based on the Sharesort algorithm of Cypher and Plaxton. Vitter and Shriver gave an optimal randomized algorithm for disk sorting [17]. All of these results are highly nontrivial and theoretically interesting. However, the underlying constants in their time bounds are high. 2.2 Sorting on The PDS Model In practice, the simple disk-striped mergesort (DSM) is used [4], even though it is not asymptotically optimal. DSM has the advantages of simplicity and a small constant. Data accesses made by DSM is such that in any I/O operation, the same portions of the D disks are accessed. This has the effect of having a single disk which can transfer DB records in a single I/O operation. An 1M- -way mergesort is employed by this algorithm. To start with, initial runs are formed in one pass through the data. At the end the disk has N/M runs each of length M. Next, M- runs are merged at a time. Blocks of any run are uniformly striped across the disks so that in future they can be accessed in parallel utilizing the full bandwidth. Each phase of merging involves one pass through the data. There are og(M) log(M / DB) phases and hence the total number of passes made by DSM is log(M/D) In other words, the total number of I/O read operations performed by the algorithm is -- (1+ log(N ) ) The constant here is just 1. If one assumes that N is a polynomial in M and that B is small (which are readily satisfied in practice), the lower bound simply yields Q(1) passes. All the above mentioned optimal algorithms make only 0(1) passes. So, the challenge in the design of parallel disk sorting algorithms is in reducing this constant. If M = 2DB, the number of passes made by DSM is 1 + log(N/M), which indeed can be very high. Recently, research has been performed dealing with the practical aspects. Pai, Schaffer, and Varman [11] analyzed the average case performance of a simple merging algorithm, employing an approximate model of average case inputs. Barve, Grove, and Vitter [4] have presented a simple randomized algorithm (SRM) and analyzed its performance. The analysis involves the solution of certain occupancy problems. The expected number ReadsRM of I/O read operations made by their algorithm is such that ReadsRm < N N log(N/M) logD | logloglogD l+logk +0(1)). (1 DB DB logkD kloglog-D loglogD loglogD ) The algorithm merges R=kD runs at a time, for some integer k. When R = Q(DlogD), the expected performance of their algorithm is optimal. However, in this case, the internal memory needed is Q(BDlog D). They have also compared SRM with DSM through simulations and shown that SRM performs better than DSM. Recently, Rajasekaran [14] has presented an algorithm (called (1,m)-merge sort (LMM)) which is asymptotically optimal under the assumptions that N is a polynomial in M and B is small. The algorithm is as simple as DSM. LMM makes less number of passes through the data than DSM when D is large. CHAPTER 2 A PARALLEL MACHINE WITH DISKS (PMD) 3.1 The PMD Model In this section we give more details of the PMD model. A PMD is nothing but a parallel machine where each processor has a disk. Each processor has a core memory of size M. In one I/O operation, a block of B records can be brought into the core memory of each processor from its own disk. Thus there are a total of D=P disks in the PMD, where P is the number of processors. Records from one disk can be sent to another through the communication mechanism available for the parallel machine after bringing the records into the main memory of the origin processor. It is conceivable that the communication time is considerable on the PMD. Thus it is essential to not only account for the I/O operations but also for the communication steps, in analyzing any algorithm's run time on the PMD. PMD can be thought of as a special case of the HMM [9]. Realization of HMM using PRAMs and hypercubes have already been studied [9]. The sorting problem on the PMD can be defined as follows. There are a total of N records to begin with so that there are N records in each disk. The problem is to rearrange the records such that they are in either ascending order or descending order with h records ending up in each disk. It is assumed that the processors themselves have been ordered so that the smallest k records will be output in the first processor's disk, the D next smallest records will be output in the second processor's disk, and so on. This indexing scheme is in line with the usual indexing scheme used in a parallel machine. However any other indexing scheme can also be used. To make our discussions concrete, we will use the mesh (see Fig. 3.1) as an example. Let the mesh be of size n x n. Then we have D = n2 disks. An indexing scheme is called for in sorting on a mesh (see e.g., Rajasekaran [13]). Some popular indexing schemes are column major, row major, snake-like row, blockwise row-major, etc. For the algorithm to be presented in this thesis, any of these schemes can be employed. Figure. 3.1. A PMD Model of nx n Mesh The algorithm to be presented in this thesis employs as subroutines some randomized algorithms. We say a randomized algorithm uses O(f(n)) amount of any resource (such as time, space, etc.) if the amount of resource used is no more than caf(n) with probability > (1-n-a), where c is a constant and a is a constant >1. We can also define other asymptotic functions in a similar fashion. 3.2 Sorting Algorithms Parallel sorting algorithm have been widely studied due to its classical importance and fundamentals. A lot of sorting algorithms have been presented. In this section, we talk about the problem of packet routing which plays a vital role in the design of algorithm on any parallel machine. We present the algorithms of k-k routing and k-k sorting. Also we give detail explanation of (l,m)-merge sort (LMM) algorithm which is the central spirit of sorting on the PMD model and PMF model. 3.2.1 Algorithm of k-k Routing and k-k Sorting The packet routing problem can be defined as following. There is a packet of information to start with at each processor that is destined for some other processor. The problem is to send all the packets to their correct destinations as quickly as possible. In any interconnection network, one requires that at most one packet traverses through any edge at any time. The problem of partial permutation routing refers to packet routing when at most one packet originates from any processor and at most one packet is destined for any processor. Packet routing problems have been explored thoroughly on interconnection networks (see e.g., Rajasekaran [13]). The problem of k-k routing is the problem of routing where at most k packets originate from any processor and at most k packets are destined for any processor. In the case of an n x n mesh, it is easy to prove a lower bound of -- on the routing time for this problem based on bisection considerations. There are algorithms whose run times match this bound closely as stated in the following Lemma. A proof of this Lemma can be found e.g., in Kaufmann, Rajasekaran, and Sibeyn [5]. [Lemma 3.2.1.1] The k-k routing problem can be solved in + 6(kn) time on an n x n mesh. The problem of k-k sorting is defined as follows. There are k keys at each processor of a parallel machine. The problem is to rearrange the keys in either ascending or descending order according to some indexing scheme. In an n x n mesh, this problem also has a lower bound of on the run time. The following Lemma promises a closely optimal algorithm (see e.g., Rajasekaran [13]) sorting. [Lemma 3.2.1.2] k-k sorting can be solved on an n x n mesh in k + 6(kn) steps. The above two Lemmas will be employed by our sorting algorithm on the mesh. It should be noted here that there exist deterministic algorithms (see e.g., [6]) for kk routing and k-k sorting whose run times match those stated in Lemmas 3.2.1.1 and 3.2.1.2. However, we believe that the use of randomized algorithms will result in better performance in practice. 3.2.2 The (1, m)-Merge Sort (LMM) Many of the sorting algorithms that have been proposed for the PDS are based on merging. These algorithms start by forming runs each of length M. A run is nothing but a sorted subsequence. Forming these initial runs takes only one pass through the data (or equivalently parallel I/O operations). After this, the algorithms will merge R runs at a time. Let a phase of merging refer to the task of scanning through the input once and performing R-way merging. Note that each phase of merging will reduce the number of remaining runs by a factor of R. For example, the DSM algorithm employs R = The various sorting algorithms differ in how each phase of mergings is done. The (1,m)-merge sort algorithm of Rajasekaran [14] is also based on merging. It employs R=l, for some appropriate 1. The LMM is a generalization of the odd-even merge sort, the s2-way merge sort of Thompson and Kung [16], and the columnsort algorithm of Leighton [7]. [3.2.2.1 Algorithm Odd-Even Mergesort] The odd-even mergesort algorithm employs R=2. It repeatedly merges two sequences at a time. To begin with there are n sorted runs each of length 1. From thereon the number of runs is decreased by a factor of 2 with each phase of mergings. Two runs are merged using the odd-even merge algorithm that is described below. 1. Let U = u1,u2,...,Uq and V = vl,V2,...,Vq be the two sorted sequences to be merged. Unshuffle U into two, i.e., partition U into two: Uodd = ul,u3,...,Uq-1 and even= uz,U4,...,Uq. Similarly partition V into Vodd and Veven. 2. Now recursively merge Udd with Vodd. Let X = x1,x2,...,Xq be the result. Also merge Ueven with Veven. Let Y = yi,y2,... ,yq be the result. 3. ./myJ/e' X and Y, i.e., form the sequence: Z = xj, yi, x2, y2,..., Xq,yq. 4. Perform one step of compare-exchange operation, i.e., sort successive subsequences of length two in Z. In other words, sort yi, x2; sort y2, x3; and so on. The resultant sequence is the merge of U and V. The correctness of this algorithm can be established using the zero-one principle. The algorithm of Thompson and Kung [16] is a generalization of the above algorithm where R is taken to be s2 for some appropriate function s of n. At any given time s2 runs are merged using an algorithm similar to the above. [3.2.2.2 Algorithm (1,m)-Merge] LMM is a generalization of s2-way merge sort algorithm. It uses R = 1. Each phase of mergings thus reduces the number of runs by a factor of 1. At any time, / runs are merged using the (/,m)-merge algorithm. This merging algorithm is similar to the odd-even merge except that in Step 1, the runs are m-way unshuffled (instead of 2-way unshuffling). In Step 3, m sequences are shuffled and also in Step 4, the local sorting is done differently. A detailed description of the merging algorithm follows. 1. Let the sequences to be merged be I = il,ui2, ... ,ur, for 1 < i <1. If r is small use a base case algorithm. Otherwise, unshuffle each Ui into m parts. In particular, partition tj into ,ui2,... ,Uijm, where 1 1 1+m 2 2 2+m Ui1 = ui, Ui +,...; Ui2 = ui2, ui ,...; and so on. 2. Recursively merge Ui,U29,...,U?, for 1< j < m. Let the merged sequences be Xj= xjxj2, ...,Xlr/m for 1< j < m. 3. Shuffle X1, X2,..., Xm, i.e., form the sequence Z = x11,x21 ...,Xm1 2 2 2 Ir/m Ir/m Ir/m X1 ,X2 ..., Xm,..., X1 ,X2 ... ,Xm 4. It can be shown that at this point the length of the "dirty sequence" (i.e., unsorted portion) is no more than Im. But we don't know where the dirty sequence is located. We can cleanup the dirty sequence in many different ways. One way is described below. Call the sequence of the first Im elements of Z as Z1; the next Im elements as Z2; and so on. In other words, Z is partitioned into Z1, Z2,..., Z4/m. Sort each ne of the Z's. Followed by this merge Z1, Z2; merge Z3 and 4; etc. Finally merge Z2 and Z3; merge Z4 and Zs; and so on. The above algorithm is not specific to any architecture. (The same can be said about any algorithm). An implementation of LMM on PDS has been given in Rajasekaran [14]. The number of I/O operations needed in this implementation has been shown to be [rlog(N /)B}) + 1] 2. When N is a polynomial in M and M is a polynomial in B this reduces to a constant number of passes through the data and hence LMM is optimal. In Rajasekaran [14] it has been demonstrated that LMM can be faster than the DSM when D is large. Recent implementation results of Pearson [12] indicate that LMM is competitive in practice. Thus a natural choice of sorting algorithm for PMD is LMM. In the next Section we implement LMM on a PMD and analyze the resultant I/O and communication steps. 3.3 Sorting on the PMD Model We begin by considering the sorting problem on the mesh. The result can be generalized to any parallel machine. 3.3.1 Sorting on the Mesh Consider a PMD where the underlying machine is an n x n mesh. The number of disks is D=n2 Each node in the mesh is a processor with a core memory of size M. In one I/O operation, a processor can bring a block of B records into its main memory. Thus the PMD as a whole can bring in DB records in one I/O operation, i.e., we can relate a PMD with a PDS whose main memory capacity is DM and that has D disks. Let the number of records to be sorted be N. To begin with, there are records at each disk of the PMD. The goal is to rearrange the records in either ascending order or descending order such that each disk gets records at the end. An indexing scheme has to be assumed. For the algorithm to be presented any of the following schemes will be acceptable: row-major, column-major, snake-like row-major, snake-like column-major, blockwise row-major, blockwise column-major, blockwise snake-like row-major, and blockwise snake-like column-major. We assume the blockwise snake-like row-major order for the following presentations. The block size is -i-, i.e., the first (in the snake-like row-major order) processor will store the smallest k records, the second processor will store the next smallest records, and so on. As one can easily see, the entire LMM algorithm consists of shuffling, unshuffling and local sorting steps. We use the k-k routing and k-k sorting algorithms (Lemmas 3.2.1.1 and 3.2.1.2) to perform these steps. Typically, we bring records from the disks until the local memories are filled. Processing on these records is done using k- k routing and k-k sorting algorithms. The queue length of k-k sorting and k-k routing algorithms is k + 0 (k). So we do not fill M completely. We only half-fill the local memories so as to run the randomized algorithms. Also in order to overlap I/O with local computations, only half of this memory can be used to store operational data. We refer to this portion of the core memory as M, i.e., M is one-fourth of the core memory size available for each processor. To begin with we form sorted runs each of length DM. The number of I/O operations performed is -. since each processor need to have -O1/0 for each run. Also, the number of communication steps is O(Jn). This is so because, we perform D number of k-k sorting (with k = M) and each such sort takes kn + O(kn) steps. Since LMM is based on merging in phases, we have to specify how the runs in a phase are stored across the D disks. Let the disks as well as the runs be numbered from zero. We use the same scheme as the one given in Rajasekaran [14]. Each run will be striped across the disks. If R > D, the starting disk for the ih run is i mod D, i.e., the zeroth block of the ith run will be in disk i mod D; its first block will be in disk (i+1) mod D; and so on. This will enable us to access, in one I/O read operation, one block each from D distinct runs and hence obtain perfect disk parallelism. If R < D, the starting disk for the ith run is is -. (Assume without loss of generality that D divides R.) Even now, we can obtain,- blocks from each of the runs in one I/O operation and hence achieve perfect disk parallelism. 3.3.2 Base Cases LMM is a recursive algorithm whose base cases are handled efficiently. We now discuss two base cases. Base Case 1. Consider the problem of merging DM runs each of length DM, when m -> DM. This merging is done using (/,m)-merge with = m = DM Let U1, U2,..., UJD-7 be the sequences to be merged. In Step 1, each Ui gets unshuffled into DM parts so that each part is of length DM This unshuffling can be done in one pass through the data. Thus the number of I/O operations is -. The communication time is O(- n). Note. Throughout the algorithm, each pass through the data will involved I/O operations and -n communication steps. Also, we use T(u,v) to denote the number of read passes needed to merge u sequences of length v each. In Step 2, we have DM merges to do, each merge involving DM sequences of length DM each. Since there are only DM records in each merge, all the mergings can be done in one pass through the data. Steps 3 and 4 perform shuffling and cleaning up, respectively. The length of the dirty sequence is (\DM)2 =DM. These two steps can be combined and finished in one pass through the data (see [14] for details). Thus we get: [Lemma 3.3.2.1] T (DM,DM) = 3, if > D-M. Base Case 2. This is the case of merging m runs each of length DM, when "D < DM This problem can be solved using (/,m)-merge with = m = D In this case we can obtain: [Lemma 3.3.2.2] T( ",DM) = 3,if D" < fDM . 3.3.3 The Sorting Algorithm LMM algorithm has been presented in two cases. In our implementation the two cases will be when -" >-DM and when < DM In either case, initial runs are formed in one pass at the end of which sorted sequences of length DM each remain to be merged. When '- > -DM, (/,m)-merge is employed with / = m = Let K denote DM and let = K2. In other words, c = logM) T(,) can be expressed as follows. T(K2c, DM) = T(K, DM) + T(K,KDM) + .. + T(K, K2c-1DM) (2) The above relation basically means that there are KIc sequences of length DM each to begin with; we merge K at a time to end up with K2-1 sequences of length KDM each; again merge K at a time to end up with K2c-2 sequences of length K2DM each; and so on. Finally there will be K sequences of length K2c-1 DM each which are merged. Each of these mergings is done using (1,m)-merge with / = m = DM. It can also be shown that T(K, K'DM) = 2i + T(K,DM) = 2i + 3. The fact that T(K, DM) = 3 (c.f. Lemma 3.3.2.1) has been used. Upon substituting this into Equation (2), we get 2c- 1 T(K2c, DM) = (2i + 3) = 4c2 + 4c i=0 where c = log(DM) Now, we have the following: [Theorem 3.3.3.1] The number of read passes needed to sort N records is 1+4(log(N DM) )2 +4 og(NDM) if > DM. This number of passes is no more than [ log(N DM) -+ ]2. This means that the number of I/O read operations log(min{DM ,DM 1 B}) is no more than D [1+4( log(DM) )2 + 4 log( ]. The number of communication steps is no more than ((-Ln [1+4(ilog(NIDM) )2 +4 o( NIDM) D log(DM) log(DM) D. The second case to be considered is when < 4DM Here (1,m)-merge will be used with 1 = m ". Let Q denote and let A- Qd. That is, d= log(/ D) Like in case 1 we can get, T(Qd,DM) = T(Q,DM) + T(Q,QDM) + + T(Q,Qd-lDM). (3) Also, we can get, T(Q,Q'DM) = 2i + T(Q,DM) = 2i + 3. Here the fact T(Q,DM) = 3 (c.f. Lemma 3.3.2.2) has been used. Equation (3) now becomes d-1 T(Qd,DM)= -(2i + 3) = d2 +2d 1=0 where d log(N /DM) where d og(DM B) [Theorem 3.3.3.2] The number of read passes needed to sort N records on the PMD is upper bounded by [ log(NDM) _+1]2, if Dm < DM. log(min{ D-,DM/B}) B Theorem 3.3.3.1 and theorem 3.3.3.2 readily yield: [Theorem 3.3.3.3] We can sort N records in < [ log(N /DM) + +1]2 read passes over the 1log(min{(ZDM,DM/ B}) data. The total number of I/O read operations needed is < D[ log(min/DM) -+1]2. Also, the total number of communication steps needed is ()(k n [ logNDM) + 1 ]2) D log(min{ DM-, DM/B}) 3.3.4 Sorting on a general PMD In this section we consider a general PMD where the underlying parallel machine can either be structured (e.g., the mesh, the hypercube, etc.) or unstructured (e.g., SMP, a cluster of workstations, etc.). We can apply LMM on a general PMD in which case the number of I/O operations will remain the same, i.e., [ log(N +1 ]2. As has become clear DB log(mnin{.f DM,DM1 B}) from our discussion on the mesh, we need mechanisms for k-k routing and k-k sorting. Let RM and SM denote the time needed for performing one MM routing and one MM sorting on the parallel machine, respectively. Then, in each pass through the data, the total communication time will be ( RM + SM), implying that the total communication time for the entire algorithm will be S< (RM + SM)[ log(Ni DM) + ]2 DB log(min{ ,DM/B}) 20 Thus we get the following general Theorem: [Theorem 3.3.4.1] Sorting on a general PMD model can be performed in ,[ log( DM) )+1 2 I/O operations. The total communication time TB log(min{ Z ),M,ZM/B}) is < (RM + SM)[ log(N/,DM) +1] 2 DB Iog(min{J 37,DM/B}) CHAPTER 4 PARALLEL MACHINE WITH MULTIPLE FILES (PMF) In this chapter, we present a more practical parallel computing model called PMF. We show why this model has some advantages and report our experimental evaluation of the PMF. 4.1 Introduction of the PMF Model A PMF model is nothing but multiple computers managed in a network file system. The underlying parallel machine is a network of workstations. In this model, input data will be partitioned into several files which are stored as source data (see Fig. 4.1). Computers can read and write data from these files. c amp cmp | cOmp Source Data Figure. 4.1. Logical PMF Model. The sorting problem on the PMF can be defined as follows. There are a total of N records to be sorted, and there are P number of processors. The problem is to sort the N records into descending or ascending order. We partition input data into P parts, and stored them in P files each with size of N/P. We also index each file and processor as F1,F2,...,Fp and P1,P2,...,Pp respectively. After sorting, the smallest K records will be output in Fi, the next smallest h records will be output in F2, and so on. The LMM sort played a vital role in PMF Model. LMM algorithm consists of shuffling, unshuffling and local sorting steps. In each step, the data are partitioned into many parts, and each processor can pick any part from the partition and do the sorting, unshuffling, shuffling and cleaning. This property is essential in PMF model because processors do not need to communicate to get data from each other. Since there is no communication between processors, the I/O time become critical. But in PMF model, if we increase the number of processors, the I/O time will be reduced. Using 2n processors will reduce the I/O time by half comparing with using n processors. All of the advantages of this parallel computing model is that it sorts data in perfect parallel without any communications between each processor. 4.2 Sorting on the PMF Model In this section, we show in detail about how to implement LMM sort on the PMF model, and how to partition data so that processors can read and write data in parallel without any communications between processors. LMM sort is a recursive algorithm. Here, we do not do the merge recursively, but proceed it just one level. Meanwhile, we use Quick sort or Heap sort for our local sorting. The detailed description of sorting follows. The input data size is N, and there are q processors. We create q number of files each of size N/q. Also suppose the memory size is M and it is much less than N/q. The chosen value of m can be random. We have varied the value of m to see its influence on the run time, but it seems to be nearly the same. Here, for easy implementation, we choose m to be a divisor of N/q, and also m can be divided by q. Without losing any generality, we suppose M is a divisor of N/q and (N/q)/M=r. Step 1 We mark the files as Fl, F2, ..., Fq. Also we mark processors as P1, P2,..., Pq. At first, processor Pi will input the first M data from Fi, sort it, unshuffle it into m parts with each part of size a=M/m, and then put it back to its original place (row 1) in Fi, for 1 < i < q. Then Pi inputt the second M data from Fi, sort it, unshuffle it into m part, and the put it back to its original place (row 2) in Fi. This procedure will continue until r times. We have each file Fi with r unshuffled sequences (see Fig. 4.2), for 1 < i < q. So after step 1, there will be total r*q unshuffled sequences. < ----- --- unshuffle M into m parts each with size of a --------------------| <-- a --41 Row 1: ... part part 2 part 3 part 4 part m Row 2: ... part part 2 part 3 part 4 part Row 3: ... partly part 2 part 3 part 4 part m Row r: ... partly part2 part 3 part 4 part m Figure. 4.2. Unshuffle Result As we notice, the sorting and unshuffling are done in parallel. There is no communication between each processor since each processor only deal with its corresponding files. Meanwhile, if we increase the number of processors, we also increase the number of files respectively which cause less data in each file. So if the data size is fixed, increasing processors will reduce the number of I/O . Ste 2 In this step, we want to merge part 1 from all q files, part2 from all q files,..., and part m from all q files (see Fig. 4.2). This can also be done in parallel. But we need to create another q number of files to contain the merged results in order to read data and output data in perfect parallel. Let's call these additional q files as DI, D2, ..., Dq. As we mentioned above, our choice of m can be divided by q. Suppose k=m/q. The scheme is as following: First step: Processor j read partj from file F(j mod q). Processor j reads partj from file F((j+1) mod q). Processor j reads partj from file F((j+2) mod q). Processor j reads partj from file F((j+q-1) mod q). Notice that from each file, processor j will read in r number of part j, so there are total r*q number of part j to merge. After processor j merged these r*q sequences, it outputs the merged result to the first place (row 1) of Dj (see Figure. 4.1.3). We also notice that the read data, merge data, and output data are all done in parallel: when processor j is reading data from file F(j mod q), processor (j+1) is reading data from file F((j+1) mod q); when processor j is reading data from F((j+1) mod q), processor (j+1) is read data from file F((j+2) mod q). Meanwhile, different processor output merged data into different files. Second step: Processor j reads part (q+j) from file F(j mod q). Processor j reads part (q+j) from file F((j+1) mod q). Processorj reads part (q+j) from file F((j+q-1) mod q). After reading in the r*q sequences, processor j merges them and put the merged result to the second place of Dj (see Fig. 4.1.3). kth step: Processor j reads part ((k-1)*q+j) from file F(j mod q). Processorj reads part ((k-1)*q+j) from file F((j+1) mod q). Processorj reads part ((k-1)*q+j) from file F((j+q-1) mod q). After reading in the r*q sequences, processor j merges them and put the merged result to the kth place of Dj (see Fig. 4.3). After as many as k merging steps, processor j will generate k merged sequences which are outputted in file Dj. The size of each merged sequence can be easily computed as mergeSize=r*a*q=N/m, and we must make sure that mergeSize < M (our choice of m must meet this requirement). File Dj <------------------------- mergeSize ---------------------------- Row 1: merge result of part j by processor j Row 2: merge result of part (q+j) by processor j Row k: merge result of part ((k-1)*q+j) by processor Figure. 4.3. Merging Result From the analysis above, we see each merging step in Step 2 is done in parallel. We can also compute the I/O time for each processor. The logical I/O ime will be as following. I/O time = k*(input data time for each step + output data time for each step) = k*(r*q + 1) = m*(r+l/q). If data size is fixed, and if we double the number of processors, it is obvious that the I/O time will be reduced by 12 from the above equation. Step 3 In this step, we try to shuffle the m merged sequences. We want to read data, shuffle data, and output data in parallel. The scheme is as following: We partition file Dj (0 < j < q) into q part (see Fig. 4.4). File Dj <- part -1 <- part 2 <-- part 3 --1 ... part -| Row 1 Row 2 ... Row k Figure. 4.4. Partition File into q Parts Each processor will be responsible for its own part. So processor j will shuffle only part j. By partition the data in this way, processors can read data, shuffle data, and output data in parallel. First step: Processor j reads partj from file D(j mod q). Second step: Processorj reads partj from file D((j+1) mod q). qth step: Processor j reads partj from file D((j+q-1) mod q). We notice that each processor reads data in parallel: when processor j is reading part j from file D(j mod q), processor (j+1) is reading part (j+1) from file D((j+1) mod q); when processor j is reading part j from file D((j+1) mod q), processor (j+1) is read part (j+1) from file D((j+2) mod q). After each processor Dj read its data, it will perform shuffling, and then output the result to file Fj. The above analysis is just for easy understanding of how to shuffle data in parallel. In actual practice, we also need to partition each of the q parts into r cells with each cell having data of size a, where r = (N/q)/M and a = M/m which we already mentioned in Step 1. This partition is necessary because if each processor Dj input part j from all q files, the size of the input data is definitely larger than memory size. Now, we show the detail implementation of Step 3. We need to partition each of the q parts into r cells (see Fig. 4.5). File Dj (ci represent cell i) <- -- part 1 --- <- ---- part 2 --- -1 ......... --- part q ----_ cl c2 ... cr cl c2 ... cr ... cl c2 ... cr cl c2 ... cr cl c2 ... cr ... cl c2 ... cr cl c2 ... cr cl c2 ... cr .... cl c2 ... cr Figure. 4.5. Partition Parts into r Cells First step: Processor j reads cl from partj of file D(j mod q). Processorj reads cl from partj of file D((j+i) mod q). Processor j reads cl from partj of file D((j+2) mod q). Processor j reads cl from partj of file D((j+q-1) mod q). After processor j reads in the data, it will shuffle them, and then output result to the first place of file Fj. Second step: Processor j reads c2 from partj of file D(j mod q). Processorj reads c2 from partj of file D((j+I) mod q). I Processor j reads c2 from partj of file D((j+2) mod q). Processor j reads c2 from partj of file D((j+q-1) mod q). After processor j reads in the data, it will shuffle them, and then output result to the second place of file Fj. rth step: Processorj reads cr from part of file D(j mod q). Processor j reads cr from partj of file D((j+1) mod q). Processor j reads cr from partj of file D((j+2) mod q). Processor j reads cr from partj of file D((j+q-1) mod q). After processor j reads in the data, it will shuffle them, and then output result to the rth place of file Fj. After as many as r steps, we finished shuffling the data. From the analysis above, Each processor read data, shuffle data, and output data in parallel. Logical I/O time = r*(input data time for one step + output data time for one step) = r*(k*q+l) = r*((m/q)*q+l) = r*(m+l). If we double the number of processors, r will be reduced by 12. So doubling the number of processors will also reduce the I/O time by 12. Step 4 In this step, we will try to clean the dirty sequences. We know the size of the dirty sequence is no more than l*m = (N/M)*m. The cleaning can be very simple. Each processor j clean the dirty sequence of file Fj. Since each processor clean different file, parallel can easily proceed. Logical I/O time = 2*(N/q)/(l*m) = (2*M)/(q*m). If we double the number of processors, the I/O time will be reduced by 12. After Step 4, we finish the sorting. As we can see from the above analysis, there is no communication between each processor. All the sorting procedures are done in parallel. Meanwhile, If we double the number of processors, the I/O time will be reduced by 12. 4.3 Computing the Speed Up In this section, we report our experimental evaluation of the PMF. We employ 2, 4, and 8 processors to sort different size of data and compute the real time speed up by comparing the result of using only 1 processor. In this simulate application, we fix the memory size to be 138240 byte. The value of m is 120. The data we generate are random floating numbers. We use two different sorting algorithms Quick sort and Heap sort, to sort the local data. 4.3.1 Using Quick Sort Here, we use quick sort for our local sorting. The testing results are shown in the following tables. Also, we show a chart of the data to illustrate the appealing speed up. Quick Sort Results Of Using 2 Processors 120 100 80- 608 -1 processor 40 2 processors 20 0 Figure. 4.6. Quick Sort Speed Up Chart Using 2 Processors Input Data Size 1 Processor 2 Processor Speed Up 3317760 bytes 41.2 seconds 22.1 seconds 1.86 6635520 bytes 66.4 seconds 34.1 seconds 1.94 9953280 bytes 87.2 seconds 44.5 seconds 1.97 13271040 bytes 104.8 seconds 53.2 seconds 1.97 Table 4.1. Of Using 4 Processors 120 100 80 0ml processor 60 - 40 4 processors 20 0 Figure. 4.7. Quick Sort Speed Up Chart Using 4 Processors Input Data Size 1 Processor 4 Processor Speed Up 3317760 bytes 41.2 seconds 13.4 seconds 3.07 6635520 bytes 66.4 seconds 21.6 seconds 3.07 9953280 bytes 87.2 seconds 27.2 seconds 3.20 13271040 bytes 104.8 seconds 33.1 seconds 3.16 Table 4.2. Quick Sort Results Table 4.3. Quick Sort Results Of Using 8 Processors Input Data Size 1 Processor 8 Processor Speed Up 3317760 bytes 41.2 seconds 9.70 seconds 4.24 6635520 bytes 66.4 seconds 14.2 seconds 4.67 9953280 bytes 87.2 seconds 18.4 seconds 4.74 13271040 bytes 104.8 seconds 23.1 seconds 4.53 120 100 80 0ml processor 60 - 40 -*8 processors 20 0 Figure. 4.8. Quick Sort Speed Up Chart Using 8 Processors 4.3.2 Using Heap Sort Here, we use heap sort for our local sorting. The testing results are shown in the following tables. Also, we show the charts to illustrate the appealing speed up. Table 4.4. Heap Sort Results Of Using 2 Processors 120 100 80- 60 *-l processor *2 processors 40- 20 - 0 3317760 6635520 9953280 1.3E+07 Figure. 4.9. Heap Sort Speed Up Chart Using 2 Processors Input Data Size 1 Processor 2 Processor Speed Up 3317760 bytes 44.6 seconds 23.2 seconds 1.92 6635520 bytes 68.1 seconds 35.8 seconds 1.90 9953280 bytes 90.2 seconds 46.8 seconds 1.92 13271040 bytes 113.1 seconds 57.9 seconds 1.95 Sort Results Of Using 4 Processors Figure. 4.10. Heap Sort Speed Up Chart Using 4 Processors Input Data Size 1 Processor 4 Processor Speed Up 3317760 bytes 44.6 seconds 14.2 seconds 3.14 6635520 bytes 68.1 seconds 21.8 seconds 3.11 9953280 bytes 90.2 seconds 28.1 seconds 3.21 13271040 bytes 113.1 seconds 34.8 seconds 3.25 120 100 80- 60- 40 20 0 E1 processor *4 processors Table 4.5. Heap Sort Results Of Using 8 Processors 120 100 80- 60- E1 processor 40 E8 processors 20 0 Figure. 4.11. Heap Sort Speed Up Chart Using 8 Processors Input Data Size 1 Processor 8 Processor Speed Up 3317760 bytes 44.6 seconds 9.79 seconds 4.56 6635520 bytes 68.1 seconds 14.2 seconds 4.79 9953280 bytes 90.2 seconds 18.9 seconds 4.76 13271040 bytes 113.1 seconds 23..5 seconds 4.81 Table 4.6. Heap CHAPTER 5 CONCLUSION 5.1 Major Results Parallel sorting algorithm have been widely studied due to its classical importance and fundamentals. Many sequential sorting algorithms have been presented. Even though there are many sorting algorithms, many of them are less practical due to the large constants in their run time. In this thesis, we have investigated a straight forward model of computing with multiple disks (which can be thought of as a special case of the HMM). This model, PMD, can be thought of as a realization of prior models such as the PDS. We have also presented a sorting algorithm for the PMD. Here, we use LMM sort algorithm on the PMD. But for local sorting, we have to use k-k routing and k-k sorting algorithm due to the communication concern. The I/O time and communication time are evaluated on the PMD. From the analysis of the result, the communication time is still significant. On our research, reducing communication time and I/O time are the major concerns. Here, we presented the PMF model. The underlying parallel machine is a network of workstations. The data are partitioned and stored in several files which are managed by a network file system. The LMM sort algorithm plays a vital role on the PMF model. The LMM sort consists of sorting, shuffling and unshuffling. These require to partition data into many parts. This property make it possible to let different processors read and write different parts of data from different files in parallel. Because of this, there is no communications between processors, and hence, we overcome the communication overhead. Meanwhile, if we increase the number of processors, I/O time will also be improved. Our experimental results for sorting indicate that we can get decent speedups in practice using the PMF model. 5.2 Future Work From our testing results, we can see that when we use two processors, the speed up is almost 2. But as we use four processors, the speed up is close to 3. Especially, when we use 8 processors, the speed up is only close to 5. The results are not as optimal as we analyzed. In our research, we believe this problem is because of initial start up communication delay. The more processors we use will cause more delays. Meanwhile, different processors' speed can be different, and this will also bad effects on the speed up. Our future research will be concerning with these problems. REFERENCES [1] A. Aggarwal and C. G. Plaxton, Optimal Parallel Sorting in Multi-Level Storage, Proc. Fifth Annual ACM Symposium on Discrete Algorithms, New York, 1994, pp. 659-668. [2] A. Aggarwal and J. S. Vitter, The Input/Output Complexity of Sorting and Related Problems, Communications of the ACM, 31(9), 1988, pp. 1116-1127. [3] L. Arge, The Buffer Tree: A New Technique for Optimal I/O-Algorithms, Proc. 4th International Workshop on Algorithms and Data Structures (WADS), New York, 1995, pp. 334-345. [4] R. Barve, E. F. Grove, and J. S. Vitter, Simple Randomized Mergesort on Parallel Disks, Technical Report CS-1996-15, Department of Computer Science, Duke University, Durham, NC, October 1996. [5] M. Kaufmann, S. Rajasekaran, and J. F. Sibeyn, Matching the Bisection Bound for Routing and Sorting on the Mesh, Proc. 4th Annual ACM Symposium on Parallel Alg t iitinu and Architectures, New York, 1992, pp. 31-40. [6] M. Kunde, Block Gossiping on Grids and Tori: Deterministic Sorting and Routing Match the Bisection Bound, Proc. First Annual European Symposium on Algorithms, Springer-Verlag Lecture Notes in Computer Science 726, New York, 1993, pp. 272-283. [7] T. Leighton, Tight Bounds on the Complexity of Parallel Sorting, IEEE Transactions on Computers C34(4), 1985, pp. 344-354. [8] M. H. Nodine, J. S. Vitter, Large Scale Sorting in Paralle Memories, Proc. Third Annual ACM Symposium on Parallel A/lglt ilh/u, and Architectures, New York, 1991, pp. 29-39. [9] M. H. Nodine, and J. S. Vitter, Deterministic Distribution Sort in Shared and Distributed Memory Multiprocessors, Proc. Fifth Annual ACM Symposium on Parallel Algot ilthni and Architectures, New York, 1993, pp. 120-129. [10] M. H. Nodine and J. S. Vitter, Greed Sort: Optimal Deterministic Sorting on Parallel Disks, Journal of the ACM, 42(4), 1995, pp. 919-933. [11] V. S. Pai, A. A. Schaffer, and P. J. Varman, Markov Analysis of Multiple-Disk Prefetching Strategies for External Merging, Theoretical Computer Science, 128(2), 1994, pp. 211-239. [12] M. D. Pearson, Fast Out-of-Core Sorting on Parallel Disk Systems, Technical Report PCS-TR99-351, Dartmouth College, Computer Science, Hanover, NH, June 1999, ftp://ftp.cs.dartmouth.edu/TR/TR99-351 .ps.Z. [13] S. Rajasekaran, Sorting and Selection on Interconnection Networks, DIMACS Series in Discrete A A1, lihe, tii % and Theoretical Computer Science, 21, 1995, pp. 275-296. [14] S. Rajasekaran, A Framework For Simple Sorting Algorithms On Parallel Disk Systems, Proc. 10th Annual ACM Symposium on Parallel Algorithms and Architectures, New York, 1998, pp. 88-97. [15] S. Rajasekaran, Selection Algorithms for the Parallel Disk Systems, Proc. International Conference on High Performance Computing, New York, 1998. [16] C. D. Thompson and H. T. Kung, Sorting on a Mesh Connecte Parallel Computer, Communications of the ACM, 20(4), 1977, pp. 263-271. [17] J. S. Vitter and E. A. M. Shriver, Algorithms for Parallel Memory I: Two-Level Memories, Algorithmica, 12(2-3), 1994, pp. 110-147. BIOGRAPHICAL SKETCH Xiaoming Jin was born in Nanjing, P.R.China. He is a Master of Science degree student in the Department of Computer and Information Science and Engineering at the University of Florida. Also, he is a Master of Science degree student in the Department of Mathematics at the University of Florida. He received his B.S. degree in mathematical science from Liberty University at Virginia. |