A PRACTICAL REALIZATION
OF PARALLEL DISKS FOR A DISTRIBUTED PARALLEL
COMPUTING SYSTEM
By
XIAOMING JIN
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2000
To my son, my wife Ping Zhang, my teacher Sanguthevar Rajasekaran
ACKNOWLEDGMENTS
I would like to express my appreciation and gratitude to my advisor, Dr.
Sanguthevar Rajasekaran, for his guidance during this study. Dr. Sanguthevar
Rajasekaran has been very helpful for my work. It is because of his LMM sort algorithm,
my work can be completed. I would also like to thank Drs. Doug Dankel and Paul
Fishwick, my supervisory committee, who also make my work be possible.
I owe much gratitude to my mother and to my wife, Ping Zhang. Without their
taking care of the family, I would not have time to complete my work.
TABLE OF CONTENTS
page
A C K N O W L E D G M E N T S ................................................................................................. iii
LIST OF TABLES ........................................ .............. vi
LIST OF FIGURES ............................. ............ .................................... vii
A B ST R A C T .................................................................................................... . ........ .. v iii
CHAPTERS
1 IN TR O D U C T IO N ..................................... ................... .....................1.. .
1.1 Scope and O objective ..... .. ........................................ ........................ .. . ........ .. 1
1.2 T hesis O utlin e ................................................... .......................................... . . 3
2 SORTIN G ON THE PD S M ODEL ...................................................... ...............4......
2.1 The PDS Model.................. ........................... ........... 4
2.2 Sorting on The PD S M odel ................................... ........................ .............. 5
3 A PARALLEL MACHINE WITH DISKS (PMD) ...................................................7...
3 .1 T h e P M D M o d el ............................................................................................... .. 7
3 .2 S ortin g A lg o rith m s ............... .................................................................................. 9
3.2.1 Algorithm of kk Routing and kk Sorting................................................. 9
3.2.2 The (1, m)M erge Sort (LM M ) ....................... ......................................... 10
3.3 Sorting on the PM D M odel ...................................... ....................... .............. 13
3.3.1 Sorting on the M esh ........................................ ......................... ............. 14
3 .3 .2 B a se C a se s....................................................................................................... 15
3.3.3 T he Sorting A lgorithm ...................................... ....................... .............. 17
3.3.4 Sorting on a general PM D ........................................................... .............. 19
4 PARALLEL MACHINE WITH MULTIPLE FILES (PMF) .............. ..................... 21
4.1 Introduction of the PM F M odel ............................................................................. 2 1
4.2 Sorting on the PM F M odel................................... ........................ .............. 22
4.3 C om putting the Speed U p ....................................... ......................... .............. 30
4.3.1 U sing Q uick Sort ... ................................................................ ............ .. 30
4 .3.2 U sing H eap Sort ..... .. ........................................ ........................ .. . .......... 34
5 C O N C L U SIO N ............................................................................. ...... . . . ............... 37
5 .1 M ajo r R e su lts ......................................................................................................... 3 7
5 .2 F u tu re W o rk ........................................................................................................... 3 8
R EFEREN CE S .............. ........................................................................ . . .....39
BIO GR APH ICAL SK ETCH .................. .. ........................ ................................... 41
v
LIST OF TABLES
Table Page
4.1. Quick Sort R results Of U sing 2 Processors.................................................. ................ 30
4.2. Quick Sort Results Of U sing 4 Processors.................................................. ................ 32
4.3. Quick Sort R results O f U sing 8 Processors.................................................. ................ 33
4.4. H eap Sort R results O f U sing 2 Processors................................................... ................ 34
4.5. H eap Sort R results O f U sing 4 Processors................................................... ................ 35
4.6. H eap Sort R results O f U sing 8 Processors................................................... ................ 36
LIST OF FIGURES
Figure Page
2.1. A architecture of a PD S M odel .. ............................................................... .............. 4
3.1. A PM D M odel ofn x n M esh .................................................................. .............. 8
4.1. Logical PM F M odel. .............. ........................ .......................... .. .... 21
4 .2 U n shu ffl e R esult....................................................... ................................................ 2 3
4 .3 M erg in g R e su lt ............................................................................................................... 2 6
4 .4 P artition F ile into q P arts. ........................................................................ ................ 27
4.5. Partition Parts into r C ells .................... ................................................................ 28
4.6. Quick Sort Speed Up Chart Using 2 Processors........................................................ 31
4.7. Quick Sort Speed Up Chart Using 4 Processors .........................................................32
4.8. Quick Sort Speed Up Chart Using 8 Processors .........................................................33
4.9. Heap Sort Speed Up Chart Using 2 Processors ............... ....................................34
4.10. Heap Sort Speed Up Chart Using 4 Processors ........................................................35
4.11. Heap Sort Speed Up Chart Using 8 Processors ........................................................36
vii
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
A PRACTICAL REALIZATION
OF PARALLEL DISKS FOR A DISTRIBUTED PARALLEL
COMPUTING SYSTEM
By
Xiaoming Jin
December 2000
Chair: Dr. Sanguthevar Rajasekaran
Major Department: Computer and Information Science and Engineering
Several models of parallel sorting are found in the literature. Among these
models, the parallel disk models are proposed to alleviate the I/O bottleneck when
handling large amounts of data. These models have the general theme of assuming
multiple disks. For instance, the Parallel Disk Systems (PDS) model assumes D disks
which are disposed on a single computer. It is also assumed that a block of data from
each of the D disks can be fetched into the main memory in one parallel I/O operation. In
this thesis we present a new model for multiple disks and evaluate its performance. This
model is called a Parallel Machine with Multiple Disks (PMD). A PMD model has
multiple computers each of which is connected with one disk. A PMD model can be
thought of as a realization of the PDS model. In this thesis, we also present a more
practical model which is called Parallel Machines with multiple Files (PMF). A PMF
model has multiple computers connected on a central file system. We investigate the
sorting problem on this new model. Our analysis demonstrates the practicality of the
PMF. We also present experimental confirmation of this assertion with data from our
implementation.
CHAPTER 1
INTRODUCTION
1.1 Scope and Objective
Computing applications have advanced to a stage where voluminous data is the
norm. The volume of data dictates the use of secondary storage devices such as disks.
Even the use of just a single disk may not be sufficient to handle I/O operations
efficiently. Thus researchers have introduced models with multiple disks.
A model that has been studied extensively (which is a refinement of prior models)
is the Parallel Disk Systems (PDS) model [17] In this model there is a single computer
and D disks. In one parallel I/O, a block of data from each of the D disks can be brought
into the main memory. A block consists of B records. If M is the internal memory size,
then one usually requires that M > 2DB. Algorithm designers have proposed algorithms
for numerous fundamental problems on the PDS model. In the analysis of these
algorithms they counted only the I/O operations since the local computations can be
assumed to be very fast.
The practical realization of this model is an important research issue. Models such
as Hierarchical Memory Models (HMMs) [8,9] have been proposed in the literature to
address this issue. Realizations of HMMs using PRAMs and hypercube have been
explored [9]. Sorting algorithms on these realizations have been investigated.
In this thesis we propose a straight forward model called a Parallel Machine with
Disks (PMD). A PMD can be thought of as a special case of the HMM. A PMD is
nothing but a parallel machine where each processor has an associated disk. The parallel
machine can be structured or unstructured. If the parallel machine is structured, the
underlying topology could be a mesh, a hypercube, a star graph, etc. Examples of
unstructured parallel computers include SMP, a cluster of workstations (employing PVM
or MPI), etc. In some sense, the PMD is nothing but a parallel machine where we study
out of core algorithms. In the PMD model we not only count the I/O operations but also
the communication steps. One can think of a PMD as a realization of the PDS model.
Given the connection between HMMs and PDSs, we can state that prior works have
considered variants of the PMD where the underlying parallel machine is either a PRAM
or a hypercube [9]. We begin the study of PMDs with the sorting problem. Sorting is an
important problem of computing that finds applications in all walks of life. We analyze
the performance of the LMM sort algorithm of Rajasekaran's [14] PMD (where the
underlying parallel machine is a mesh, a hypercube, and a cluster of workstations).
In particular, we present a new model which is called Parallel Machine with
Multiple Files (PMF). A PMF model has multiple computers which are managed by a
network file system. As in modem days, network file system has been a popular
distributed file system, so this model can be more practical in real life. In this model,
input data will be partitioned into several files which are stored in the file system. All
computers can read and write data from these files. We show why this model has more
appealing properties than other models. We compute the run times for sorting in this
model. Also we compute the speed ups of using multiple computers in verse of using
single computer. These analyses demonstrate the practicality of the PMF.
1.2 Thesis Outline
This thesis consists of 5 chapters. In addition to this introduction, the rest of
the thesis is organized as follows.
In Chapter 2, we provide a summary of known algorithms for the PDS
model.
In Chapter 3, we present details of the PMD model. To make our discussion
concrete, we use the mesh as the topology of the underlying parallel machine. However,
the discussion applies to any parallel machine. Also in this chapter, we state some routing
and sorting algorithms which are applied on the PMD model. Especially, we give detail
description of the LMM algorithm which played a vital role in both PMD and PMF
models.
In Chapter 4, we show a more practical model, PMF model. We stated its
structure, and give detail description of its implementation. Also, we show our
experimental results of the PMF model.
Chapter 5 concludes the thesis.
CHAPTER 2
SORTING ON THE PDS MODEL
In this chapter, we present an overview sorting results on the PDS model which
has the structure of multiple disks with a single computer.
2.1 The PDS Model
Sorting has been studied well on the PDS model (see Fig. 2.1). A known lower
bound for the number of I/O read steps for parallel disk sorting is ( [ LogM ]). Here N
is the number of records to be sorted and M is the internal memory size. Also, B is the
block size and D is the number of parallel disks used. There exist several asymptotically
optimal algorithms that make O(44 [ ]) I/O read steps (see e.g., references 10, 1,
and 3).
Computer
Figure 2.1. Architecture of a PDS Model
One of the early papers on disk sorting was by Aggarwal and Vitter [2]. In the
model they considered, each I/O operation results in the transfer of D blocks each block
having B records. A more realistic model was envisioned by Vitter and Shriver [17].
Several asymptotically optimal algorithms have been given for sorting on this model.
Nodine and Vitter's optimal algorithm [8] involves solving certain matching problems.
Aggarwal and Plaxton's optimal algorithm [1] is based on the Sharesort algorithm of
Cypher and Plaxton. Vitter and Shriver gave an optimal randomized algorithm for disk
sorting [17]. All of these results are highly nontrivial and theoretically interesting.
However, the underlying constants in their time bounds are high.
2.2 Sorting on The PDS Model
In practice, the simple diskstriped mergesort (DSM) is used [4], even though it is
not asymptotically optimal. DSM has the advantages of simplicity and a small constant.
Data accesses made by DSM is such that in any I/O operation, the same portions of the D
disks are accessed. This has the effect of having a single disk which can transfer DB
records in a single I/O operation. An 1M way mergesort is employed by this algorithm.
To start with, initial runs are formed in one pass through the data. At the end the disk has
N/M runs each of length M. Next, M runs are merged at a time. Blocks of any run are
uniformly striped across the disks so that in future they can be accessed in parallel
utilizing the full bandwidth.
Each phase of merging involves one pass through the data. There are og(M)
log(M / DB)
phases and hence the total number of passes made by DSM is log(M/D) In other words,
the total number of I/O read operations performed by the algorithm is  (1+ log(N ) )
The constant here is just 1.
If one assumes that N is a polynomial in M and that B is small (which are readily
satisfied in practice), the lower bound simply yields Q(1) passes. All the above
mentioned optimal algorithms make only 0(1) passes. So, the challenge in the design of
parallel disk sorting algorithms is in reducing this constant. If M = 2DB, the number of
passes made by DSM is 1 + log(N/M), which indeed can be very high.
Recently, research has been performed dealing with the practical aspects. Pai,
Schaffer, and Varman [11] analyzed the average case performance of a simple merging
algorithm, employing an approximate model of average case inputs. Barve, Grove, and
Vitter [4] have presented a simple randomized algorithm (SRM) and analyzed its
performance. The analysis involves the solution of certain occupancy problems. The
expected number ReadsRM of I/O read operations made by their algorithm is such that
ReadsRm < N N log(N/M) logD  logloglogD l+logk +0(1)). (1
DB DB logkD kloglogD loglogD loglogD )
The algorithm merges R=kD runs at a time, for some integer k. When R =
Q(DlogD), the expected performance of their algorithm is optimal. However, in this case,
the internal memory needed is Q(BDlog D). They have also compared SRM with DSM
through simulations and shown that SRM performs better than DSM. Recently,
Rajasekaran [14] has presented an algorithm (called (1,m)merge sort (LMM)) which is
asymptotically optimal under the assumptions that N is a polynomial in M and B is small.
The algorithm is as simple as DSM. LMM makes less number of passes through the data
than DSM when D is large.
CHAPTER 2
A PARALLEL MACHINE WITH DISKS (PMD)
3.1 The PMD Model
In this section we give more details of the PMD model. A PMD is nothing but a
parallel machine where each processor has a disk. Each processor has a core memory of
size M. In one I/O operation, a block of B records can be brought into the core memory
of each processor from its own disk. Thus there are a total of D=P disks in the PMD,
where P is the number of processors. Records from one disk can be sent to another
through the communication mechanism available for the parallel machine after bringing
the records into the main memory of the origin processor. It is conceivable that the
communication time is considerable on the PMD. Thus it is essential to not only account
for the I/O operations but also for the communication steps, in analyzing any algorithm's
run time on the PMD.
PMD can be thought of as a special case of the HMM [9]. Realization of HMM
using PRAMs and hypercubes have already been studied [9].
The sorting problem on the PMD can be defined as follows. There are a total of N
records to begin with so that there are N records in each disk. The problem is to
rearrange the records such that they are in either ascending order or descending order
with h records ending up in each disk. It is assumed that the processors themselves have
been ordered so that the smallest k records will be output in the first processor's disk, the
D
next smallest records will be output in the second processor's disk, and so on. This
indexing scheme is in line with the usual indexing scheme used in a parallel machine.
However any other indexing scheme can also be used.
To make our discussions concrete, we will use the mesh (see Fig. 3.1) as an
example. Let the mesh be of size n x n. Then we have D = n2 disks. An indexing scheme
is called for in sorting on a mesh (see e.g., Rajasekaran [13]). Some popular indexing
schemes are column major, row major, snakelike row, blockwise rowmajor, etc. For the
algorithm to be presented in this thesis, any of these schemes can be employed.
Figure. 3.1. A PMD Model of nx n Mesh
The algorithm to be presented in this thesis employs as subroutines some
randomized algorithms. We say a randomized algorithm uses O(f(n)) amount of any
resource (such as time, space, etc.) if the amount of resource used is no more than caf(n)
with probability > (1na), where c is a constant and a is a constant >1. We can also
define other asymptotic functions in a similar fashion.
3.2 Sorting Algorithms
Parallel sorting algorithm have been widely studied due to its classical importance
and fundamentals. A lot of sorting algorithms have been presented. In this section, we
talk about the problem of packet routing which plays a vital role in the design of
algorithm on any parallel machine. We present the algorithms of kk routing and kk
sorting. Also we give detail explanation of (l,m)merge sort (LMM) algorithm which is
the central spirit of sorting on the PMD model and PMF model.
3.2.1 Algorithm of kk Routing and kk Sorting
The packet routing problem can be defined as following. There is a packet of
information to start with at each processor that is destined for some other processor. The
problem is to send all the packets to their correct destinations as quickly as possible. In
any interconnection network, one requires that at most one packet traverses through any
edge at any time. The problem of partial permutation routing refers to packet routing
when at most one packet originates from any processor and at most one packet is destined
for any processor. Packet routing problems have been explored thoroughly on
interconnection networks (see e.g., Rajasekaran [13]).
The problem of kk routing is the problem of routing where at most k packets
originate from any processor and at most k packets are destined for any processor.
In the case of an n x n mesh, it is easy to prove a lower bound of  on the
routing time for this problem based on bisection considerations. There are algorithms
whose run times match this bound closely as stated in the following Lemma. A proof of
this Lemma can be found e.g., in Kaufmann, Rajasekaran, and Sibeyn [5].
[Lemma 3.2.1.1]
The kk routing problem can be solved in + 6(kn) time on an n x n mesh.
The problem of kk sorting is defined as follows. There are k keys at each
processor of a parallel machine. The problem is to rearrange the keys in either ascending
or descending order according to some indexing scheme.
In an n x n mesh, this problem also has a lower bound of on the run time. The
following Lemma promises a closely optimal algorithm (see e.g., Rajasekaran [13])
sorting.
[Lemma 3.2.1.2]
kk sorting can be solved on an n x n mesh in k + 6(kn) steps.
The above two Lemmas will be employed by our sorting algorithm on the mesh.
It should be noted here that there exist deterministic algorithms (see e.g., [6]) for kk
routing and kk sorting whose run times match those stated in Lemmas 3.2.1.1 and
3.2.1.2. However, we believe that the use of randomized algorithms will result in better
performance in practice.
3.2.2 The (1, m)Merge Sort (LMM)
Many of the sorting algorithms that have been proposed for the PDS are based on
merging. These algorithms start by forming runs each of length M. A run is nothing
but a sorted subsequence. Forming these initial runs takes only one pass through the data
(or equivalently parallel I/O operations). After this, the algorithms will merge R runs
at a time. Let a phase of merging refer to the task of scanning through the input once and
performing Rway merging. Note that each phase of merging will reduce the number of
remaining runs by a factor of R. For example, the DSM algorithm employs R = The
various sorting algorithms differ in how each phase of mergings is done.
The (1,m)merge sort algorithm of Rajasekaran [14] is also based on merging. It
employs R=l, for some appropriate 1. The LMM is a generalization of the oddeven merge
sort, the s2way merge sort of Thompson and Kung [16], and the columnsort algorithm of
Leighton [7].
[3.2.2.1 Algorithm OddEven Mergesort]
The oddeven mergesort algorithm employs R=2. It repeatedly merges two
sequences at a time. To begin with there are n sorted runs each of length 1.
From thereon the number of runs is decreased by a factor of 2 with each
phase of mergings. Two runs are merged using the oddeven merge
algorithm that is described below.
1. Let U = u1,u2,...,Uq and V = vl,V2,...,Vq be the two sorted sequences
to be merged. Unshuffle U into two, i.e., partition U into two: Uodd =
ul,u3,...,Uq1 and even= uz,U4,...,Uq. Similarly partition V into Vodd
and Veven.
2. Now recursively merge Udd with Vodd. Let X = x1,x2,...,Xq be the
result. Also merge Ueven with Veven. Let Y = yi,y2,... ,yq be the result.
3. ./myJ/e' X and Y, i.e., form the sequence: Z = xj, yi, x2, y2,..., Xq,yq.
4. Perform one step of compareexchange operation, i.e., sort
successive subsequences of length two in Z. In other words, sort yi,
x2; sort y2, x3; and so on. The resultant sequence is the merge of U and
V.
The correctness of this algorithm can be established using the zeroone principle.
The algorithm of Thompson and Kung [16] is a generalization of the above algorithm
where R is taken to be s2 for some appropriate function s of n. At any given time s2 runs
are merged using an algorithm similar to the above.
[3.2.2.2 Algorithm (1,m)Merge]
LMM is a generalization of s2way merge sort algorithm. It uses R = 1.
Each phase of mergings thus reduces the number of runs by a factor of 1.
At any time, / runs are merged using the (/,m)merge algorithm. This
merging algorithm is similar to the oddeven merge except that in Step 1,
the runs are mway unshuffled (instead of 2way unshuffling). In Step 3,
m sequences are shuffled and also in Step 4, the local sorting is done
differently. A detailed description of the merging algorithm follows.
1. Let the sequences to be merged be I = il,ui2, ... ,ur, for 1 < i <1. If r
is small use a base case algorithm. Otherwise, unshuffle each Ui
into m parts. In particular, partition tj into ,ui2,... ,Uijm, where
1 1 1+m 2 2 2+m
Ui1 = ui, Ui +,...; Ui2 = ui2, ui ,...; and so on.
2. Recursively merge Ui,U29,...,U?, for 1< j < m. Let the merged
sequences be Xj= xjxj2, ...,Xlr/m for 1< j < m.
3. Shuffle X1, X2,..., Xm, i.e., form the sequence Z = x11,x21 ...,Xm1
2 2 2 Ir/m Ir/m Ir/m
X1 ,X2 ..., Xm,..., X1 ,X2 ... ,Xm
4. It can be shown that at this point the length of the "dirty
sequence" (i.e., unsorted portion) is no more than Im. But we
don't know where the dirty sequence is located. We can cleanup
the dirty sequence in many different ways. One way is described
below.
Call the sequence of the first Im elements of Z as Z1; the next Im
elements as Z2; and so on. In other words, Z is partitioned into Z1,
Z2,..., Z4/m. Sort each ne of the Z's. Followed by this merge Z1, Z2;
merge Z3 and 4; etc. Finally merge Z2 and Z3; merge Z4 and Zs; and
so on.
The above algorithm is not specific to any architecture. (The same can be said
about any algorithm). An implementation of LMM on PDS has been given in
Rajasekaran [14]. The number of I/O operations needed in this implementation has been
shown to be [rlog(N /)B}) + 1] 2. When N is a polynomial in M and M is a polynomial
in B this reduces to a constant number of passes through the data and hence LMM is
optimal. In Rajasekaran [14] it has been demonstrated that LMM can be faster than the
DSM when D is large. Recent implementation results of Pearson [12] indicate that LMM
is competitive in practice. Thus a natural choice of sorting algorithm for PMD is LMM.
In the next Section we implement LMM on a PMD and analyze the resultant I/O and
communication steps.
3.3 Sorting on the PMD Model
We begin by considering the sorting problem on the mesh. The result can be
generalized to any parallel machine.
3.3.1 Sorting on the Mesh
Consider a PMD where the underlying machine is an n x n mesh. The number of
disks is D=n2 Each node in the mesh is a processor with a core memory of size M. In
one I/O operation, a processor can bring a block of B records into its main memory.
Thus the PMD as a whole can bring in DB records in one I/O operation, i.e., we can
relate a PMD with a PDS whose main memory capacity is DM and that has D disks.
Let the number of records to be sorted be N. To begin with, there are records
at each disk of the PMD. The goal is to rearrange the records in either ascending order or
descending order such that each disk gets records at the end. An indexing scheme has
to be assumed. For the algorithm to be presented any of the following schemes will be
acceptable: rowmajor, columnmajor, snakelike rowmajor, snakelike columnmajor,
blockwise rowmajor, blockwise columnmajor, blockwise snakelike rowmajor, and
blockwise snakelike columnmajor. We assume the blockwise snakelike rowmajor
order for the following presentations. The block size is i, i.e., the first (in the snakelike
rowmajor order) processor will store the smallest k records, the second processor will
store the next smallest records, and so on.
As one can easily see, the entire LMM algorithm consists of shuffling,
unshuffling and local sorting steps. We use the kk routing and kk sorting algorithms
(Lemmas 3.2.1.1 and 3.2.1.2) to perform these steps. Typically, we bring records from
the disks until the local memories are filled. Processing on these records is done using k
k routing and kk sorting algorithms. The queue length of kk sorting and kk routing
algorithms is k + 0 (k). So we do not fill M completely. We only halffill the local
memories so as to run the randomized algorithms. Also in order to overlap I/O with local
computations, only half of this memory can be used to store operational data. We refer to
this portion of the core memory as M, i.e., M is onefourth of the core memory size
available for each processor.
To begin with we form sorted runs each of length DM. The number of I/O
operations performed is . since each processor need to have O1/0 for each run. Also,
the number of communication steps is O(Jn). This is so because, we perform D
number of kk sorting (with k = M) and each such sort takes kn + O(kn) steps.
Since LMM is based on merging in phases, we have to specify how the runs in a
phase are stored across the D disks. Let the disks as well as the runs be numbered from
zero. We use the same scheme as the one given in Rajasekaran [14]. Each run will be
striped across the disks. If R > D, the starting disk for the ih run is i mod D, i.e., the
zeroth block of the ith run will be in disk i mod D; its first block will be in disk (i+1) mod
D; and so on. This will enable us to access, in one I/O read operation, one block each
from D distinct runs and hence obtain perfect disk parallelism. If R < D, the starting disk
for the ith run is is . (Assume without loss of generality that D divides R.) Even now, we
can obtain, blocks from each of the runs in one I/O operation and hence achieve perfect
disk parallelism.
3.3.2 Base Cases
LMM is a recursive algorithm whose base cases are handled efficiently. We now
discuss two base cases.
Base Case 1. Consider the problem of merging DM runs each of length DM, when
m > DM. This merging is done using (/,m)merge with = m = DM
Let U1, U2,..., UJD7 be the sequences to be merged. In Step 1, each Ui
gets unshuffled into DM parts so that each part is of length DM This unshuffling can
be done in one pass through the data. Thus the number of I/O operations is . The
communication time is O( n).
Note. Throughout the algorithm, each pass through the data will involved I/O
operations and n communication steps. Also, we use T(u,v) to denote the number of
read passes needed to merge u sequences of length v each.
In Step 2, we have DM merges to do, each merge involving DM
sequences of length DM each. Since there are only DM records in each merge, all the
mergings can be done in one pass through the data.
Steps 3 and 4 perform shuffling and cleaning up, respectively. The length of the
dirty sequence is (\DM)2 =DM. These two steps can be combined and finished in one
pass through the data (see [14] for details). Thus we get:
[Lemma 3.3.2.1]
T (DM,DM) = 3, if > DM.
Base Case 2. This is the case of merging m runs each of length DM, when "D
< DM This problem can be solved using (/,m)merge with = m = D
In this case we can obtain:
[Lemma 3.3.2.2]
T( ",DM) = 3,if D" < fDM .
3.3.3 The Sorting Algorithm
LMM algorithm has been presented in two cases. In our implementation the two
cases will be when " >DM and when < DM In either case, initial runs are
formed in one pass at the end of which sorted sequences of length DM each remain
to be merged.
When ' > DM, (/,m)merge is employed with / = m = Let K denote
DM and let = K2. In other words, c = logM)
T(,) can be expressed as follows.
T(K2c, DM) = T(K, DM) + T(K,KDM) + .. + T(K, K2c1DM) (2)
The above relation basically means that there are KIc sequences of length DM
each to begin with; we merge K at a time to end up with K21 sequences of length KDM
each; again merge K at a time to end up with K2c2 sequences of length K2DM each; and
so on. Finally there will be K sequences of length K2c1 DM each which are merged. Each
of these mergings is done using (1,m)merge with / = m = DM.
It can also be shown that
T(K, K'DM) = 2i + T(K,DM) = 2i + 3.
The fact that T(K, DM) = 3 (c.f. Lemma 3.3.2.1) has been used.
Upon substituting this into Equation (2), we get
2c 1
T(K2c, DM) = (2i + 3) = 4c2 + 4c
i=0
where c = log(DM) Now, we have the following:
[Theorem 3.3.3.1]
The number of read passes needed to sort N records is 1+4(log(N DM) )2
+4 og(NDM) if > DM. This number of passes is no more than
[ log(N DM) + ]2. This means that the number of I/O read operations
log(min{DM ,DM 1 B})
is no more than D [1+4( log(DM) )2 + 4 log( ].
The number of communication steps is no more than
((Ln [1+4(ilog(NIDM) )2 +4 o( NIDM)
D log(DM) log(DM) D.
The second case to be considered is when < 4DM Here (1,m)merge will be
used with 1 = m ". Let Q denote and let A Qd. That is, d= log(/ D) Like in
case 1 we can get,
T(Qd,DM) = T(Q,DM) + T(Q,QDM) + + T(Q,QdlDM). (3)
Also, we can get,
T(Q,Q'DM) = 2i + T(Q,DM) = 2i + 3.
Here the fact T(Q,DM) = 3 (c.f. Lemma 3.3.2.2) has been used.
Equation (3) now becomes
d1
T(Qd,DM)= (2i + 3) = d2 +2d
1=0
where d log(N /DM)
where d og(DM B)
[Theorem 3.3.3.2]
The number of read passes needed to sort N records on the PMD is upper
bounded by [ log(NDM) _+1]2, if Dm < DM.
log(min{ D,DM/B}) B
Theorem 3.3.3.1 and theorem 3.3.3.2 readily yield:
[Theorem 3.3.3.3]
We can sort N records in < [ log(N /DM) + +1]2 read passes over the
1log(min{(ZDM,DM/ B})
data. The total number of I/O read operations needed is <
D[ log(min/DM) +1]2. Also, the total number of communication steps
needed is ()(k n [ logNDM) + 1 ]2)
D log(min{ DM, DM/B})
3.3.4 Sorting on a general PMD
In this section we consider a general PMD where the underlying parallel machine
can either be structured (e.g., the mesh, the hypercube, etc.) or unstructured (e.g., SMP, a
cluster of workstations, etc.).
We can apply LMM on a general PMD in which case the number of I/O
operations will remain the same, i.e., [ log(N +1 ]2. As has become clear
DB log(mnin{.f DM,DM1 B})
from our discussion on the mesh, we need mechanisms for kk routing and kk sorting.
Let RM and SM denote the time needed for performing one MM routing and one MM
sorting on the parallel machine, respectively. Then, in each pass through the data, the
total communication time will be ( RM + SM), implying that the total communication
time for the entire algorithm will be
S< (RM + SM)[ log(Ni DM) + ]2
DB log(min{ ,DM/B})
20
Thus we get the following general Theorem:
[Theorem 3.3.4.1]
Sorting on a general PMD model can be performed in
,[ log( DM) )+1 2 I/O operations. The total communication time
TB log(min{ Z ),M,ZM/B})
is < (RM + SM)[ log(N/,DM) +1] 2
DB Iog(min{J 37,DM/B})
CHAPTER 4
PARALLEL MACHINE WITH MULTIPLE FILES (PMF)
In this chapter, we present a more practical parallel computing model called PMF.
We show why this model has some advantages and report our experimental evaluation of
the PMF.
4.1 Introduction of the PMF Model
A PMF model is nothing but multiple computers managed in a network file
system. The underlying parallel machine is a network of workstations. In this model,
input data will be partitioned into several files which are stored as source data (see Fig.
4.1). Computers can read and write data from these files.
c amp cmp  cOmp
Source Data
Figure. 4.1. Logical PMF Model.
The sorting problem on the PMF can be defined as follows. There are a total of N
records to be sorted, and there are P number of processors. The problem is to sort the N
records into descending or ascending order. We partition input data into P parts, and
stored them in P files each with size of N/P. We also index each file and processor as
F1,F2,...,Fp and P1,P2,...,Pp respectively. After sorting, the smallest K records will be
output in Fi, the next smallest h records will be output in F2, and so on.
The LMM sort played a vital role in PMF Model. LMM algorithm consists of
shuffling, unshuffling and local sorting steps. In each step, the data are partitioned into
many parts, and each processor can pick any part from the partition and do the sorting,
unshuffling, shuffling and cleaning. This property is essential in PMF model because
processors do not need to communicate to get data from each other.
Since there is no communication between processors, the I/O time become
critical. But in PMF model, if we increase the number of processors, the I/O time will be
reduced. Using 2n processors will reduce the I/O time by half comparing with using n
processors.
All of the advantages of this parallel computing model is that it sorts data in
perfect parallel without any communications between each processor.
4.2 Sorting on the PMF Model
In this section, we show in detail about how to implement LMM sort on the PMF
model, and how to partition data so that processors can read and write data in parallel
without any communications between processors. LMM sort is a recursive algorithm.
Here, we do not do the merge recursively, but proceed it just one level. Meanwhile, we
use Quick sort or Heap sort for our local sorting. The detailed description of sorting
follows.
The input data size is N, and there are q processors. We create q number of files
each of size N/q. Also suppose the memory size is M and it is much less than N/q. The
chosen value of m can be random. We have varied the value of m to see its influence on
the run time, but it seems to be nearly the same. Here, for easy implementation, we
choose m to be a divisor of N/q, and also m can be divided by q. Without losing any
generality, we suppose M is a divisor of N/q and (N/q)/M=r.
Step 1 We mark the files as Fl, F2, ..., Fq. Also we mark processors as P1,
P2,..., Pq. At first, processor Pi will input the first M data from Fi, sort it, unshuffle it
into m parts with each part of size a=M/m, and then put it back to its original place (row
1) in Fi, for 1 < i < q. Then Pi inputt the second M data from Fi, sort it, unshuffle it into m
part, and the put it back to its original place (row 2) in Fi. This procedure will continue
until r times. We have each file Fi with r unshuffled sequences (see Fig. 4.2), for 1 < i <
q. So after step 1, there will be total r*q unshuffled sequences.
<   unshuffle M into m parts each with size of a 
< a 41
Row 1: ...
part part 2 part 3 part 4 part m
Row 2: ...
part part 2 part 3 part 4 part
Row 3: ...
partly part 2 part 3 part 4 part m
Row r: ...
partly part2 part 3 part 4 part m
Figure. 4.2. Unshuffle Result
As we notice, the sorting and unshuffling are done in parallel. There is no
communication between each processor since each processor only deal with its
corresponding files. Meanwhile, if we increase the number of processors, we also
increase the number of files respectively which cause less data in each file. So if the data
size is fixed, increasing processors will reduce the number of I/O .
Ste 2 In this step, we want to merge part 1 from all q files, part2 from all q files,...,
and part m from all q files (see Fig. 4.2). This can also be done in parallel. But we need to
create another q number of files to contain the merged results in order to read data and
output data in perfect parallel. Let's call these additional q files as DI, D2, ..., Dq. As
we mentioned above, our choice of m can be divided by q. Suppose k=m/q. The scheme
is as following:
First step: Processor j read partj from file F(j mod q).
Processor j reads partj from file F((j+1) mod q).
Processor j reads partj from file F((j+2) mod q).
Processor j reads partj from file F((j+q1) mod q).
Notice that from each file, processor j will read in r number of part j, so there are
total r*q number of part j to merge. After processor j merged these r*q sequences, it
outputs the merged result to the first place (row 1) of Dj (see Figure. 4.1.3). We also
notice that the read data, merge data, and output data are all done in parallel: when
processor j is reading data from file F(j mod q), processor (j+1) is reading data from file
F((j+1) mod q); when processor j is reading data from F((j+1) mod q), processor (j+1) is
read data from file F((j+2) mod q). Meanwhile, different processor output merged data
into different files.
Second step: Processor j reads part (q+j) from file F(j mod q).
Processor j reads part (q+j) from file F((j+1) mod q).
Processorj reads part (q+j) from file F((j+q1) mod q).
After reading in the r*q sequences, processor j merges them and put the merged
result to the second place of Dj (see Fig. 4.1.3).
kth step: Processor j reads part ((k1)*q+j) from file F(j mod q).
Processorj reads part ((k1)*q+j) from file F((j+1) mod q).
Processorj reads part ((k1)*q+j) from file F((j+q1) mod q).
After reading in the r*q sequences, processor j merges them and put the merged
result to the kth place of Dj (see Fig. 4.3).
After as many as k merging steps, processor j will generate k merged sequences
which are outputted in file Dj. The size of each merged sequence can be easily computed
as mergeSize=r*a*q=N/m, and we must make sure that mergeSize < M (our choice of m
must meet this requirement).
File Dj
< mergeSize 
Row 1: merge result of part j by processor j
Row 2: merge result of part (q+j) by processor j
Row k: merge result of part ((k1)*q+j) by processor
Figure. 4.3. Merging Result
From the analysis above, we see each merging step in Step 2 is done in parallel.
We can also compute the I/O time for each processor. The logical I/O ime will be as
following.
I/O time = k*(input data time for each step + output data time for each step)
= k*(r*q + 1)
= m*(r+l/q).
If data size is fixed, and if we double the number of processors, it is obvious that
the I/O time will be reduced by 12 from the above equation.
Step 3 In this step, we try to shuffle the m merged sequences. We want to read data,
shuffle data, and output data in parallel. The scheme is as following:
We partition file Dj (0 < j < q) into q part (see Fig. 4.4).
File Dj
< part 1 < part 2 < part 3 1
... part 
Row 1
Row 2 ...
Row k
Figure. 4.4. Partition File into q Parts
Each processor will be responsible for its own part. So processor j will shuffle
only part j. By partition the data in this way, processors can read data, shuffle data, and
output data in parallel.
First step: Processor j reads partj from file D(j mod q).
Second step: Processorj reads partj from file D((j+1) mod q).
qth step: Processor j reads partj from file D((j+q1) mod q).
We notice that each processor reads data in parallel: when processor j is reading
part j from file D(j mod q), processor (j+1) is reading part (j+1) from file D((j+1) mod q);
when processor j is reading part j from file D((j+1) mod q), processor (j+1) is read part
(j+1) from file D((j+2) mod q). After each processor Dj read its data, it will perform
shuffling, and then output the result to file Fj.
The above analysis is just for easy understanding of how to shuffle data in
parallel. In actual practice, we also need to partition each of the q parts into r cells with
each cell having data of size a, where r = (N/q)/M and a = M/m which we already
mentioned in Step 1. This partition is necessary because if each processor Dj input part j
from all q files, the size of the input data is definitely larger than memory size.
Now, we show the detail implementation of Step 3. We need to partition each of
the q parts into r cells (see Fig. 4.5).
File Dj (ci represent cell i)
<  part 1  <  part 2  1 .........  part q _
cl c2 ... cr cl c2 ... cr ... cl c2 ... cr
cl c2 ... cr cl c2 ... cr ... cl c2 ... cr
cl c2 ... cr cl c2 ... cr .... cl c2 ... cr
Figure. 4.5. Partition Parts into r Cells
First step: Processor j reads cl from partj of file D(j mod q).
Processorj reads cl from partj of file D((j+i) mod q).
Processor j reads cl from partj of file D((j+2) mod q).
Processor j reads cl from partj of file D((j+q1) mod q).
After processor j reads in the data, it will shuffle them, and then output result to
the first place of file Fj.
Second step: Processor j reads c2 from partj of file D(j mod q).
Processorj reads c2 from partj of file D((j+I) mod q).
I
Processor j reads c2 from partj of file D((j+2) mod q).
Processor j reads c2 from partj of file D((j+q1) mod q).
After processor j reads in the data, it will shuffle them, and then output result to
the second place of file Fj.
rth step: Processorj reads cr from part of file D(j mod q).
Processor j reads cr from partj of file D((j+1) mod q).
Processor j reads cr from partj of file D((j+2) mod q).
Processor j reads cr from partj of file D((j+q1) mod q).
After processor j reads in the data, it will shuffle them, and then output result to
the rth place of file Fj.
After as many as r steps, we finished shuffling the data. From the analysis above,
Each processor read data, shuffle data, and output data in parallel.
Logical I/O time = r*(input data time for one step + output data time for one step)
= r*(k*q+l)
= r*((m/q)*q+l)
= r*(m+l).
If we double the number of processors, r will be reduced by 12. So doubling the
number of processors will also reduce the I/O time by 12.
Step 4 In this step, we will try to clean the dirty sequences. We know the size of the
dirty sequence is no more than l*m = (N/M)*m. The cleaning can be very simple. Each
processor j clean the dirty sequence of file Fj. Since each processor clean different file,
parallel can easily proceed.
Logical I/O time = 2*(N/q)/(l*m)
= (2*M)/(q*m).
If we double the number of processors, the I/O time will be reduced by 12.
After Step 4, we finish the sorting. As we can see from the above analysis, there
is no communication between each processor. All the sorting procedures are done in
parallel. Meanwhile, If we double the number of processors, the I/O time will be reduced
by 12.
4.3 Computing the Speed Up
In this section, we report our experimental evaluation of the PMF. We employ 2,
4, and 8 processors to sort different size of data and compute the real time speed up by
comparing the result of using only 1 processor. In this simulate application, we fix the
memory size to be 138240 byte. The value of m is 120. The data we generate are random
floating numbers. We use two different sorting algorithms Quick sort and Heap sort, to
sort the local data.
4.3.1 Using Quick Sort
Here, we use quick sort for our local sorting. The testing results are shown in the
following tables. Also, we show a chart of the data to illustrate the appealing speed up.
Quick Sort Results Of Using 2 Processors
120
100
80
608 1 processor
40 2 processors
20
0
Figure. 4.6. Quick Sort Speed Up Chart Using 2 Processors
Input Data Size 1 Processor 2 Processor Speed Up
3317760 bytes 41.2 seconds 22.1 seconds 1.86
6635520 bytes 66.4 seconds 34.1 seconds 1.94
9953280 bytes 87.2 seconds 44.5 seconds 1.97
13271040 bytes 104.8 seconds 53.2 seconds 1.97
Table 4.1.
Of Using 4 Processors
120
100
80 0ml processor
60 
40 4 processors
20
0
Figure. 4.7. Quick Sort Speed Up Chart Using 4 Processors
Input Data Size 1 Processor 4 Processor Speed Up
3317760 bytes 41.2 seconds 13.4 seconds 3.07
6635520 bytes 66.4 seconds 21.6 seconds 3.07
9953280 bytes 87.2 seconds 27.2 seconds 3.20
13271040 bytes 104.8 seconds 33.1 seconds 3.16
Table 4.2.
Quick Sort Results
Table 4.3. Quick Sort Results Of Using 8 Processors
Input Data Size 1 Processor 8 Processor Speed Up
3317760 bytes 41.2 seconds 9.70 seconds 4.24
6635520 bytes 66.4 seconds 14.2 seconds 4.67
9953280 bytes 87.2 seconds 18.4 seconds 4.74
13271040 bytes 104.8 seconds 23.1 seconds 4.53
120
100
80 0ml processor
60 
40 *8 processors
20
0
Figure. 4.8. Quick Sort Speed Up Chart Using 8 Processors
4.3.2 Using Heap Sort
Here, we use heap sort for our local sorting. The testing results are shown in the
following tables. Also, we show the charts to illustrate the appealing speed up.
Table 4.4.
Heap Sort Results Of Using 2 Processors
120
100
80
60 *l processor
*2 processors
40
20 
0
3317760 6635520 9953280 1.3E+07
Figure. 4.9. Heap Sort Speed Up Chart Using 2 Processors
Input Data Size 1 Processor 2 Processor Speed Up
3317760 bytes 44.6 seconds 23.2 seconds 1.92
6635520 bytes 68.1 seconds 35.8 seconds 1.90
9953280 bytes 90.2 seconds 46.8 seconds 1.92
13271040 bytes 113.1 seconds 57.9 seconds 1.95
Sort Results Of Using 4 Processors
Figure. 4.10. Heap Sort Speed Up Chart Using 4 Processors
Input Data Size 1 Processor 4 Processor Speed Up
3317760 bytes 44.6 seconds 14.2 seconds 3.14
6635520 bytes 68.1 seconds 21.8 seconds 3.11
9953280 bytes 90.2 seconds 28.1 seconds 3.21
13271040 bytes 113.1 seconds 34.8 seconds 3.25
120
100
80
60
40
20
0
E1 processor
*4 processors
Table 4.5. Heap
Sort Results Of Using 8 Processors
120
100
80
60 E1 processor
40 E8 processors
20
0
Figure. 4.11. Heap Sort Speed Up Chart Using 8 Processors
Input Data Size 1 Processor 8 Processor Speed Up
3317760 bytes 44.6 seconds 9.79 seconds 4.56
6635520 bytes 68.1 seconds 14.2 seconds 4.79
9953280 bytes 90.2 seconds 18.9 seconds 4.76
13271040 bytes 113.1 seconds 23..5 seconds 4.81
Table 4.6. Heap
CHAPTER 5
CONCLUSION
5.1 Major Results
Parallel sorting algorithm have been widely studied due to its classical
importance and fundamentals. Many sequential sorting algorithms have been presented.
Even though there are many sorting algorithms, many of them are less practical due to the
large constants in their run time.
In this thesis, we have investigated a straight forward model of computing with
multiple disks (which can be thought of as a special case of the HMM). This model,
PMD, can be thought of as a realization of prior models such as the PDS. We have also
presented a sorting algorithm for the PMD. Here, we use LMM sort algorithm on the
PMD. But for local sorting, we have to use kk routing and kk sorting algorithm due to
the communication concern. The I/O time and communication time are evaluated on the
PMD. From the analysis of the result, the communication time is still significant.
On our research, reducing communication time and I/O time are the major
concerns. Here, we presented the PMF model. The underlying parallel machine is a
network of workstations. The data are partitioned and stored in several files which are
managed by a network file system. The LMM sort algorithm plays a vital role on the
PMF model. The LMM sort consists of sorting, shuffling and unshuffling. These require
to partition data into many parts. This property make it possible to let different processors
read and write different parts of data from different files in parallel. Because of this, there
is no communications between processors, and hence, we overcome the communication
overhead. Meanwhile, if we increase the number of processors, I/O time will also be
improved. Our experimental results for sorting indicate that we can get decent speedups
in practice using the PMF model.
5.2 Future Work
From our testing results, we can see that when we use two processors, the speed
up is almost 2. But as we use four processors, the speed up is close to 3. Especially, when
we use 8 processors, the speed up is only close to 5. The results are not as optimal as we
analyzed. In our research, we believe this problem is because of initial start up
communication delay. The more processors we use will cause more delays. Meanwhile,
different processors' speed can be different, and this will also bad effects on the speed up.
Our future research will be concerning with these problems.
REFERENCES
[1] A. Aggarwal and C. G. Plaxton, Optimal Parallel Sorting in MultiLevel Storage,
Proc. Fifth Annual ACM Symposium on Discrete Algorithms, New York, 1994,
pp. 659668.
[2] A. Aggarwal and J. S. Vitter, The Input/Output Complexity of Sorting and
Related Problems, Communications of the ACM, 31(9), 1988, pp. 11161127.
[3] L. Arge, The Buffer Tree: A New Technique for Optimal I/OAlgorithms, Proc.
4th International Workshop on Algorithms and Data Structures (WADS), New
York, 1995, pp. 334345.
[4] R. Barve, E. F. Grove, and J. S. Vitter, Simple Randomized Mergesort on Parallel
Disks, Technical Report CS199615, Department of Computer Science, Duke
University, Durham, NC, October 1996.
[5] M. Kaufmann, S. Rajasekaran, and J. F. Sibeyn, Matching the Bisection Bound
for Routing and Sorting on the Mesh, Proc. 4th Annual ACM Symposium on
Parallel Alg t iitinu and Architectures, New York, 1992, pp. 3140.
[6] M. Kunde, Block Gossiping on Grids and Tori: Deterministic Sorting and Routing
Match the Bisection Bound, Proc. First Annual European Symposium on
Algorithms, SpringerVerlag Lecture Notes in Computer Science 726, New York,
1993, pp. 272283.
[7] T. Leighton, Tight Bounds on the Complexity of Parallel Sorting, IEEE
Transactions on Computers C34(4), 1985, pp. 344354.
[8] M. H. Nodine, J. S. Vitter, Large Scale Sorting in Paralle Memories, Proc. Third
Annual ACM Symposium on Parallel A/lglt ilh/u, and Architectures, New York,
1991, pp. 2939.
[9] M. H. Nodine, and J. S. Vitter, Deterministic Distribution Sort in Shared and
Distributed Memory Multiprocessors, Proc. Fifth Annual ACM Symposium on
Parallel Algot ilthni and Architectures, New York, 1993, pp. 120129.
[10] M. H. Nodine and J. S. Vitter, Greed Sort: Optimal Deterministic Sorting on
Parallel Disks, Journal of the ACM, 42(4), 1995, pp. 919933.
[11] V. S. Pai, A. A. Schaffer, and P. J. Varman, Markov Analysis of MultipleDisk
Prefetching Strategies for External Merging, Theoretical Computer Science,
128(2), 1994, pp. 211239.
[12] M. D. Pearson, Fast OutofCore Sorting on Parallel Disk Systems, Technical
Report PCSTR99351, Dartmouth College, Computer Science, Hanover, NH,
June 1999, ftp://ftp.cs.dartmouth.edu/TR/TR99351 .ps.Z.
[13] S. Rajasekaran, Sorting and Selection on Interconnection Networks, DIMACS
Series in Discrete A A1, lihe, tii % and Theoretical Computer Science, 21, 1995, pp.
275296.
[14] S. Rajasekaran, A Framework For Simple Sorting Algorithms On Parallel Disk
Systems, Proc. 10th Annual ACM Symposium on Parallel Algorithms and
Architectures, New York, 1998, pp. 8897.
[15] S. Rajasekaran, Selection Algorithms for the Parallel Disk Systems, Proc.
International Conference on High Performance Computing, New York, 1998.
[16] C. D. Thompson and H. T. Kung, Sorting on a Mesh Connecte Parallel Computer,
Communications of the ACM, 20(4), 1977, pp. 263271.
[17] J. S. Vitter and E. A. M. Shriver, Algorithms for Parallel Memory I: TwoLevel
Memories, Algorithmica, 12(23), 1994, pp. 110147.
BIOGRAPHICAL SKETCH
Xiaoming Jin was born in Nanjing, P.R.China. He is a Master of Science degree
student in the Department of Computer and Information Science and Engineering at the
University of Florida. Also, he is a Master of Science degree student in the Department of
Mathematics at the University of Florida. He received his B.S. degree in mathematical
science from Liberty University at Virginia.
